aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-03 13:13:02 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-03 13:13:02 -0700
commit510a070c9d7b886d6c8e3aa43b3b44bfa6ff1f6d (patch)
tree1e95ac8880c581be55bd325d6838a5d7aa6d87fc /README.md
parent048e10e79662fec42f26c30f791b20df7b67407e (diff)
downloadfatcat-scholar-510a070c9d7b886d6c8e3aa43b3b44bfa6ff1f6d.tar.gz
fatcat-scholar-510a070c9d7b886d6c8e3aa43b3b44bfa6ff1f6d.zip
commit prototype pipeline notes (in README)
Diffstat (limited to 'README.md')
-rw-r--r--README.md47
1 files changed, 47 insertions, 0 deletions
diff --git a/README.md b/README.md
index 7c8d99f..84f0722 100644
--- a/README.md
+++ b/README.md
@@ -24,3 +24,50 @@ Use gunicorn plus uvicorn, to get multiple worker processes, each running
async:
gunicorn example:app -w 4 -k uvicorn.workers.UvicornWorker
+
+## Prototype Pipeline
+
+Requires staff credentials in environment for `internetarchive` python library.
+
+TODO: pass these credentials via ansible/dotenv
+
+Generate complete SIM issue database:
+
+ ia search "collection:periodicals collection:sim_microfilm mediatype:collection" --itemlist | rg "^pub_" > data/sim_collections.tsv
+ ia search "collection:periodicals collection:sim_microfilm mediatype:texts" --itemlist | rg "^sim_" > data/sim_items.tsv
+
+ cat data/sim_collections.tsv | parallel -j4 ia metadata {} | jq . -c | pv -l > data/sim_collections.json
+ cat data/sim_items.tsv | parallel -j8 ia metadata {} | jq . -c | pv -l > data/sim_items.json
+
+ cat data/sim_collections.2020-05-15.json | pv -l | python -m fatcat_scholar.issue_db load_pubs
+ cat data/sim_items.2020-05-15.json | pv -l | python -m fatcat_scholar.issue_db load_issues
+ python -m fatcat_scholar.issue_db load_counts
+
+Create QA elasticsearch index (localhost):
+
+ http put ":9200/qa_scholar_fulltext_v01?include_type_name=true" < schema/scholar_fulltext.v01.json
+ http put ":9200/qa_scholar_fulltext_v01/_alias/qa_scholar_fulltext"
+
+Fetch "heavy" fulltext documents (JSON) for full SIM database:
+
+ python -m fatcat_scholar.sim_pipeline run_issue_db | pv -l | gzip > data/sim_intermediate.json.gz
+
+Re-use existing COVID-19 database to index releases:
+
+ cat /srv/fatcat_covid19/metadata/fatcat_hits.2020-04-27.enrich.json \
+ | jq -c .fatcat_release \
+ | rg -v "^null" \
+ | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases --fulltext-cache-dir /srv/fatcat_covid19/fulltext_web \
+ | pv -l \
+ | gzip > data/work_intermediate.json.gz
+
+ => 48.3k 0:17:58 [44.8 /s]
+
+Transform and index both into local elasticsearch:
+
+ zcat data/work_intermediate.json.gz data/sim_intermediate.json.gz \
+ | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.transform run_transform \
+ | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc
+
+ => 132635 docs in 2m18.787824205s at 955.667 docs/s with 4 workers
+