Martin did a fresh scrape of many OAI-PMH endpoints, and we should ingest/crawl. Note that Martin excluded many Indonesian endpoints, will need to follow-up on those. ## Prep Fetch metadata snapshot: wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01.ndj.zst wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01_urls.txt.zst Pre-filter out a bunch of prefixes we won't crawl (out of scope, and large): zstdcat /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.ndj.zst \ | rg -v 'oai:kb.dk:' \ | rg -v 'oai:bdr.oai.bsb-muenchen.de:' \ | rg -v 'oai:hispana.mcu.es:' \ | rg -v 'oai:bnf.fr:' \ | rg -v 'oai:ukm.si:' \ | rg -v 'oai:biodiversitylibrary.org:' \ | rg -v 'oai:hsp.org:' \ | rg -v 'oai:repec:' \ | rg -v 'oai:n/a:' \ | rg -v 'oai:quod.lib.umich.edu:' \ | rg -v 'oai:americanae.aecid.es:' \ | rg -v 'oai:www.irgrid.ac.cn:' \ | rg -v 'oai:espace.library.uq.edu:' \ | rg -v 'oai:edoc.mpg.de:' \ | rg -v 'oai:bibliotecadigital.jcyl.es:' \ | rg -v 'oai:repository.erciyes.edu.tr:' \ | rg -v 'oai:krm.or.kr:' \ | ./scripts/oai2ingestrequest.py - \ | pv -l \ | gzip \ > /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.requests.json.gz These failed to transform in the expected way; a change in JSON schema from last time?