diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-10-07 12:38:00 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-10-07 12:38:00 -0700 |
commit | 1657d4783a29eaee4adb610e0134a10b0126e202 (patch) | |
tree | 61cd917c85c0e2e7eeb982fb66d7d37f7374069e /notes/ingest/2022-03_oaipmh.md | |
parent | 54e14814080d9a706ff6f15694b3b54918200169 (diff) | |
download | sandcrawler-1657d4783a29eaee4adb610e0134a10b0126e202.tar.gz sandcrawler-1657d4783a29eaee4adb610e0134a10b0126e202.zip |
OAI-PMH updates
Diffstat (limited to 'notes/ingest/2022-03_oaipmh.md')
-rw-r--r-- | notes/ingest/2022-03_oaipmh.md | 40 |
1 files changed, 40 insertions, 0 deletions
diff --git a/notes/ingest/2022-03_oaipmh.md b/notes/ingest/2022-03_oaipmh.md new file mode 100644 index 0000000..d2a8d71 --- /dev/null +++ b/notes/ingest/2022-03_oaipmh.md @@ -0,0 +1,40 @@ + +Martin did a fresh scrape of many OAI-PMH endpoints, and we should ingest/crawl. + +Note that Martin excluded many Indonesian endpoints, will need to follow-up on +those. + +## Prep + +Fetch metadata snapshot: + + wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01.ndj.zst + + wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01_urls.txt.zst + +Pre-filter out a bunch of prefixes we won't crawl (out of scope, and large): + + zstdcat /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.ndj.zst \ + | rg -v 'oai:kb.dk:' \ + | rg -v 'oai:bdr.oai.bsb-muenchen.de:' \ + | rg -v 'oai:hispana.mcu.es:' \ + | rg -v 'oai:bnf.fr:' \ + | rg -v 'oai:ukm.si:' \ + | rg -v 'oai:biodiversitylibrary.org:' \ + | rg -v 'oai:hsp.org:' \ + | rg -v 'oai:repec:' \ + | rg -v 'oai:n/a:' \ + | rg -v 'oai:quod.lib.umich.edu:' \ + | rg -v 'oai:americanae.aecid.es:' \ + | rg -v 'oai:www.irgrid.ac.cn:' \ + | rg -v 'oai:espace.library.uq.edu:' \ + | rg -v 'oai:edoc.mpg.de:' \ + | rg -v 'oai:bibliotecadigital.jcyl.es:' \ + | rg -v 'oai:repository.erciyes.edu.tr:' \ + | rg -v 'oai:krm.or.kr:' \ + | ./scripts/oai2ingestrequest.py - \ + | pv -l \ + | gzip \ + > /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.requests.json.gz + +These failed to transform in the expected way; a change in JSON schema from last time? |