blob: d2a8d71ef7c7da564ebe06912b0070d1df4c56cf (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
Martin did a fresh scrape of many OAI-PMH endpoints, and we should ingest/crawl.
Note that Martin excluded many Indonesian endpoints, will need to follow-up on
those.
## Prep
Fetch metadata snapshot:
wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01.ndj.zst
wget https://archive.org/download/oai_pmh_partial_dump_2022_03_01/oai_pmh_partial_dump_2022_03_01_urls.txt.zst
Pre-filter out a bunch of prefixes we won't crawl (out of scope, and large):
zstdcat /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.ndj.zst \
| rg -v 'oai:kb.dk:' \
| rg -v 'oai:bdr.oai.bsb-muenchen.de:' \
| rg -v 'oai:hispana.mcu.es:' \
| rg -v 'oai:bnf.fr:' \
| rg -v 'oai:ukm.si:' \
| rg -v 'oai:biodiversitylibrary.org:' \
| rg -v 'oai:hsp.org:' \
| rg -v 'oai:repec:' \
| rg -v 'oai:n/a:' \
| rg -v 'oai:quod.lib.umich.edu:' \
| rg -v 'oai:americanae.aecid.es:' \
| rg -v 'oai:www.irgrid.ac.cn:' \
| rg -v 'oai:espace.library.uq.edu:' \
| rg -v 'oai:edoc.mpg.de:' \
| rg -v 'oai:bibliotecadigital.jcyl.es:' \
| rg -v 'oai:repository.erciyes.edu.tr:' \
| rg -v 'oai:krm.or.kr:' \
| ./scripts/oai2ingestrequest.py - \
| pv -l \
| gzip \
> /srv/sandcrawler/tasks/oai-pmh/oai_pmh_partial_dump_2022_03_01.requests.json.gz
These failed to transform in the expected way; a change in JSON schema from last time?
|