aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-12-08 16:38:56 -0800
committerBryan Newbold <bnewbold@archive.org>2020-12-08 16:38:56 -0800
commit568da1363626d07d0956fb8a60bb7d7fd9054d83 (patch)
tree3f0542ff71626a463cc3372c7fc5743fc70c5cae /notes/ingest
parent2d338ab3649642affbedeb28470a96a6a5ba7597 (diff)
downloadsandcrawler-568da1363626d07d0956fb8a60bb7d7fd9054d83.tar.gz
sandcrawler-568da1363626d07d0956fb8a60bb7d7fd9054d83.zip
commit sept 2020 scielo ingest notes
Diffstat (limited to 'notes/ingest')
-rw-r--r--notes/ingest/2020-09_scielo.md21
1 files changed, 21 insertions, 0 deletions
diff --git a/notes/ingest/2020-09_scielo.md b/notes/ingest/2020-09_scielo.md
new file mode 100644
index 0000000..4ec6fbd
--- /dev/null
+++ b/notes/ingest/2020-09_scielo.md
@@ -0,0 +1,21 @@
+
+As a follow-up to `SCIELO-CRAWL-2020-07`, going to bulk ingest all existing
+fatcat releases with no IA copy and with `publisher_type:scielo`. There are
+200k+ such releases.
+
+It seems like some of these are HTML or XML, eg: https://doi.org/10.4321/s1132-12962011000300008
+
+Could try XML ingest of these!
+
+## Bulk Ingest
+
+Dump ingest requests
+
+ ./fatcat_ingest.py --allow-non-oa query "publisher_type:scielo" | pv -l > /srv/fatcat/snapshots/scielo_papers_20200914.ingest_request.json
+ Expecting 212529 release objects in search queries
+
+Enqueue
+
+ cat /srv/fatcat/snapshots/scielo_papers_20200914.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+ => done 2020-09-14
+