notes/ingest/2022-03_oaipmh.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


Martin did a fresh scrape of many OAI-PMH endpoints, and we should ingest/crawl.

Note that Martin excluded many Indonesian endpoints, will need to follow-up on
those.

## Prep

Fetch metadata snapshot:

    wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01.ndj.zst

    wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01_urls.txt.zst

Pre-filter out a bunch of prefixes we won't crawl (out of scope, and large):

    zstdcat /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.ndj.zst \
        | rg -v 'oai:kb.dk:' \
        | rg -v 'oai:bdr.oai.bsb-muenchen.de:' \
        | rg -v 'oai:hispana.mcu.es:' \
        | rg -v 'oai:bnf.fr:' \
        | rg -v 'oai:ukm.si:' \
        | rg -v 'oai:biodiversitylibrary.org:' \
        | rg -v 'oai:hsp.org:' \
        | rg -v 'oai:repec:' \
        | rg -v 'oai:n/a:' \
        | rg -v 'oai:quod.lib.umich.edu:' \
        | rg -v 'oai:americanae.aecid.es:' \
        | rg -v 'oai:www.irgrid.ac.cn:' \
        | rg -v 'oai:espace.library.uq.edu:' \
        | rg -v 'oai:edoc.mpg.de:' \
        | rg -v 'oai:bibliotecadigital.jcyl.es:' \
        | rg -v 'oai:repository.erciyes.edu.tr:' \
        | rg -v 'oai:krm.or.kr:' \
        | ./scripts/oai2ingestrequest.py - \
        | pv -l \
        | gzip \
        > /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.requests.json.gz

These failed to transform in the expected way; a change in JSON schema from last time?