aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/fatcat_indexing_pipeline.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-03-23 21:42:32 -0700
committerBryan Newbold <bnewbold@archive.org>2021-03-23 21:42:32 -0700
commit5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/fatcat_indexing_pipeline.md
parente70e7cff4b5c910405694fb297330507b49937b1 (diff)
downloadfatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz
fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip
proposals: add 2021 UI updates, and rename all to have a date in filename
Diffstat (limited to 'proposals/fatcat_indexing_pipeline.md')
-rw-r--r--proposals/fatcat_indexing_pipeline.md54
1 files changed, 0 insertions, 54 deletions
diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md
deleted file mode 100644
index deafb65..0000000
--- a/proposals/fatcat_indexing_pipeline.md
+++ /dev/null
@@ -1,54 +0,0 @@
-
-## High-Level
-
-Work-oriented: base input is arrays of expanded releases, all from the same
-work.
-
-Re-index pipeline would look at fatcat changelog or existing release feed, and
-use the `work_id` to fetch all other releases.
-
-Batch indexing pipeline would use a new variant of `fatcat-export` which is
-expanded releases (one-per-line), grouped (or sorted) by work id.
-
-Then, pipeline looks like:
-
-- choose canonical release
-- choose best access
-- choose best fulltext file
- => iterate releases and files
- => soft prefer canonical release, file access, release_date, etc
- => check via postgrest query that fulltext is available
- => fetch raw fulltext
-- check if we expect a SIM copy to exist
- => eg, using an issue db?
- => if so, fetch petabox metadata and try to confirm, so we can create a URL
- => if we don't have another fulltext source (?):
- => fetch djvu file and extract the pages in question (or just 1 if unsure?)
-- output "heavy" object
-
-Next step is:
-
-- summarize biblio metadata
-- select one abstract per language
-- sanitize abstracts and fulltext content for indexing
-- compute counts, epistimological quality, etc
-
-The output of that goes to Kafka for indexing into ES.
-
-This indexing process is probably going to be both CPU and network intensive.
-In python will want multiprocessing and maybe also async?
-
-## Implementation
-
-Existing tools/libraries:
-
-- fatcat-openapi-client
-- postgrest client
-- S3/minio/seaweed client
-- ftfy
-- language detection
-
-New needed (eventually):
-
-- strip latex
-- strip JATS or HTML