diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-12-08 16:38:56 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-12-08 16:38:56 -0800 |
commit | 568da1363626d07d0956fb8a60bb7d7fd9054d83 (patch) | |
tree | 3f0542ff71626a463cc3372c7fc5743fc70c5cae | |
parent | 2d338ab3649642affbedeb28470a96a6a5ba7597 (diff) | |
download | sandcrawler-568da1363626d07d0956fb8a60bb7d7fd9054d83.tar.gz sandcrawler-568da1363626d07d0956fb8a60bb7d7fd9054d83.zip |
commit sept 2020 scielo ingest notes
-rw-r--r-- | notes/ingest/2020-09_scielo.md | 21 |
1 files changed, 21 insertions, 0 deletions
diff --git a/notes/ingest/2020-09_scielo.md b/notes/ingest/2020-09_scielo.md new file mode 100644 index 0000000..4ec6fbd --- /dev/null +++ b/notes/ingest/2020-09_scielo.md @@ -0,0 +1,21 @@ + +As a follow-up to `SCIELO-CRAWL-2020-07`, going to bulk ingest all existing +fatcat releases with no IA copy and with `publisher_type:scielo`. There are +200k+ such releases. + +It seems like some of these are HTML or XML, eg: https://doi.org/10.4321/s1132-12962011000300008 + +Could try XML ingest of these! + +## Bulk Ingest + +Dump ingest requests + + ./fatcat_ingest.py --allow-non-oa query "publisher_type:scielo" | pv -l > /srv/fatcat/snapshots/scielo_papers_20200914.ingest_request.json + Expecting 212529 release objects in search queries + +Enqueue + + cat /srv/fatcat/snapshots/scielo_papers_20200914.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + => done 2020-09-14 + |