diff options
author | Bryan Newbold <bnewbold@archive.org> | 2021-03-23 21:42:32 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2021-03-23 21:42:32 -0700 |
commit | 5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch) | |
tree | 599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/fatcat_indexing_pipeline.md | |
parent | e70e7cff4b5c910405694fb297330507b49937b1 (diff) | |
download | fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip |
proposals: add 2021 UI updates, and rename all to have a date in filename
Diffstat (limited to 'proposals/fatcat_indexing_pipeline.md')
-rw-r--r-- | proposals/fatcat_indexing_pipeline.md | 54 |
1 files changed, 0 insertions, 54 deletions
diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md deleted file mode 100644 index deafb65..0000000 --- a/proposals/fatcat_indexing_pipeline.md +++ /dev/null @@ -1,54 +0,0 @@ - -## High-Level - -Work-oriented: base input is arrays of expanded releases, all from the same -work. - -Re-index pipeline would look at fatcat changelog or existing release feed, and -use the `work_id` to fetch all other releases. - -Batch indexing pipeline would use a new variant of `fatcat-export` which is -expanded releases (one-per-line), grouped (or sorted) by work id. - -Then, pipeline looks like: - -- choose canonical release -- choose best access -- choose best fulltext file - => iterate releases and files - => soft prefer canonical release, file access, release_date, etc - => check via postgrest query that fulltext is available - => fetch raw fulltext -- check if we expect a SIM copy to exist - => eg, using an issue db? - => if so, fetch petabox metadata and try to confirm, so we can create a URL - => if we don't have another fulltext source (?): - => fetch djvu file and extract the pages in question (or just 1 if unsure?) -- output "heavy" object - -Next step is: - -- summarize biblio metadata -- select one abstract per language -- sanitize abstracts and fulltext content for indexing -- compute counts, epistimological quality, etc - -The output of that goes to Kafka for indexing into ES. - -This indexing process is probably going to be both CPU and network intensive. -In python will want multiprocessing and maybe also async? - -## Implementation - -Existing tools/libraries: - -- fatcat-openapi-client -- postgrest client -- S3/minio/seaweed client -- ftfy -- language detection - -New needed (eventually): - -- strip latex -- strip JATS or HTML |