Martin did another OAI-PMH bulk crawl, this time with the old JSON format: I updated the transform script to block some additional domains. ## Prep Fetch the snapshot: cd /srv/sandcrawler/tasks/ wget https://archive.org/download/oai_harvest_20220921/2022-09-21-oai-pmh-metadata-compat.jsonl.zst Transform to ingest requests: cd /srv/sandcrawler/src/python git log | head -n1 # commit dfd4605d84712eccb95a63e50b0bcb343642b433 pipenv shell zstdcat /srv/sandcrawler/tasks/2022-09-21-oai-pmh-metadata-compat.jsonl.zst \ | ./scripts/oai2ingestrequest.py - \ | pv -l \ | gzip \ > /srv/sandcrawler/tasks/2022-09-21_oaipmh_ingestrequests.json.gz # 16.1M 1:01:02 [4.38k/s] Curious about types, though this would probably be handled at fatcat ingest time: zstdcat 2022-09-21-oai-pmh-metadata-compat.jsonl.zst | jq '.types[]' -r | sort | uniq -c | sort -nr > oai_type_counts.txt head oai_type_counts.txt -n30 5623867 info:eu-repo/semantics/article 5334928 info:eu-repo/semantics/publishedVersion 3870359 text 1240225 Text 829169 Article 769849 NonPeerReviewed 665700 PeerReviewed 648740 Peer-reviewed Article 547857 article 482906 info:eu-repo/semantics/bachelorThesis 353814 Thesis 329269 Student thesis 262650 info:eu-repo/semantics/conferenceObject 185354 Journal articles 162021 info:eu-repo/semantics/doctoralThesis 152079 Journal Article 150226 Research Article 130217 Conference papers 127255 Artículo revisado por pares 124243 Newspaper 123908 ##rt.metadata.pkp.peerReviewed## 123309 Photograph 122981 info:eu-repo/semantics/masterThesis 116719 Book 108946 Image 108216 Report 107946 Other 103562 masterThesis 103038 info:eu-repo/semantics/other 101404 StillImage [...] And formats: zstdcat 2022-09-21-oai-pmh-metadata-compat.jsonl.zst | jq '.formats[]' -r | sort | uniq -c | sort -nr > oai_format_counts.txt head -n 20 oai_format_counts.txt 11151928 application/pdf 677413 text 561656 text/html 498518 image/jpeg 231219 Text 193638 text/xml 147214 Image 117073 image/jpg 110872 pdf 91323 image/tiff 76948 bib 75393 application/xml 70244 Digitized from 35 mm. microfilm. 68206 mods 59227 PDF 57677 application/epub+zip 57602 application/octet-stream 52072 text/plain 51620 application/msword 47227 audio/mpeg Also, just overall size (number of records): zstdcat 2022-09-21-oai-pmh-metadata-compat.jsonl.zst | wc -l # 20,840,301 Next load in to sandcrawler DB: zcat /srv/sandcrawler/tasks/2022-09-21_oaipmh_ingestrequests.json.gz | pv -l | ./persist_tool.py ingest-request - Traceback (most recent call last): File "./persist_tool.py", line 311, in main() File "./persist_tool.py", line 307, in main args.func(args) File "./persist_tool.py", line 119, in run_ingest_request pusher.run() File "/1/srv/sandcrawler/src/python/sandcrawler/workers.py", line 397, in run self.worker.push_batch(batch) File "/1/srv/sandcrawler/src/python/sandcrawler/persist.py", line 342, in push_batch resp = self.db.insert_ingest_request(self.cur, irequests) File "/1/srv/sandcrawler/src/python/sandcrawler/db.py", line 459, in insert_ingest_request resp = psycopg2.extras.execute_values(cur, sql, rows, page_size=250, fetch=True) File "/1/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/psycopg2/extras.py", line 1270, in execute_values cur.execute(b''.join(parts)) psycopg2.errors.ProgramLimitExceeded: index row size 3400 exceeds btree version 4 maximum 2704 for index "ingest_request_base_url_idx" DETAIL: Index row references tuple (6893121,3) in relation "ingest_request". HINT: Values larger than 1/3 of a buffer page cannot be indexed. Consider a function index of an MD5 hash of the value, or use full text indexing. 15.7M 0:41:48 [6.27k/s] Darn, this means we won't get reasonable stats about how many rows were inserted/updated. Patched the persist tool to skip very long URLs, and ran again (backwards, just URLs which didn't get inserted already): zcat /srv/sandcrawler/tasks/2022-09-21_oaipmh_ingestrequests.json.gz \ | tac \ | head -n1000000 \ | pv -l \ | ./persist_tool.py ingest-request - # 1.00M 0:03:04 [5.41k/s] # Worker: Counter({'total': 1000000, 'insert-requests': 124701, 'skip-url-too-long': 1, 'update-requests': 0}) Status of just the new lines: SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'oai' AND date(ingest_request.created) > '2022-09-01' GROUP BY status ORDER BY COUNT DESC LIMIT 20; status | count -------------------------+--------- | 6398455 success | 540219 no-pdf-link | 41316 link-loop | 23871 no-capture | 11350 redirect-loop | 8315 wrong-mimetype | 2394 terminal-bad-status | 1540 null-body | 1038 cdx-error | 272 empty-blob | 237 petabox-error | 213 wayback-error | 186 blocked-cookie | 107 timeout | 47 wayback-content-error | 26 spn2-cdx-lookup-failure | 21 skip-url-blocklist | 16 spn2-backoff | 15 body-too-large | 13 (20 rows) ## Bulk Ingest Should already have filtered domains/prefixes in transform script, so not including filters here. COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'oai' AND date(ingest_request.created) > '2022-09-01' AND ingest_file_result.status IS NULL ) TO '/srv/sandcrawler/tasks/oai_noingest_20220921.rows.json'; # COPY 6398455 ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/oai_noingest_20220921.rows.json \ | pv -l \ | shuf \ > /srv/sandcrawler/tasks/oai_noingest_20220921.ingest_request.json # 6.40M 0:02:18 [46.2k/s] cat /srv/sandcrawler/tasks/oai_noingest_20220921.ingest_request.json \ | rg -v "\\\\" \ | jq . -c \ | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 # DONE Expect this ingest to take a week or so. Then, run stats again: SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'oai' AND date(ingest_request.created) > '2022-09-01' GROUP BY status ORDER BY COUNT DESC LIMIT 20; status | count -------------------------+--------- no-capture | 3617175 success | 2775036 no-pdf-link | 449298 link-loop | 74260 terminal-bad-status | 47819 wrong-mimetype | 20195 redirect-loop | 18197 empty-blob | 12127 cdx-error | 3038 skip-url-blocklist | 2630 wayback-error | 2599 petabox-error | 2354 wayback-content-error | 1617 blocked-cookie | 1293 null-body | 1038 body-too-large | 670 | 143 bad-gzip-encoding | 64 timeout | 47 spn2-cdx-lookup-failure | 20 (20 rows) ## Crawl Seedlist COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'oai' AND date(ingest_request.created) > '2022-09-01' AND ( ingest_file_result.status = 'no-capture' OR ingest_file_result.status = 'redirect-loop' OR ingest_file_result.status = 'terminal-bad-status' OR ingest_file_result.status = 'cdx-error' OR ingest_file_result.status = 'petabox-error' OR ingest_file_result.status = 'wayback-error' OR ingest_file_result.status = 'timeout' OR ingest_file_result.status = 'wayback-content-error' ) ) TO '/srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json'; => COPY 3692846 ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json \ | pv -l \ | shuf \ > /srv/sandcrawler/tasks/oai_nocapture_20220921.ingest_request.json => 3.69M 0:01:19 [46.6k/s] This will be used for re-ingest later. For now, extract URLs: cat /srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json \ | jq .base_url -r \ | sort -u -S 4G \ | pv -l \ > /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt => 3.66M 0:00:59 [61.8k/s] cat /srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json \ | rg '"terminal_url"' \ | jq -r .result.terminal_url \ | rg -v ^null$ \ | sort -u -S 4G \ | pv -l \ > /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt => 0.00 0:00:05 [0.00 /s] cat /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt \ | awk '{print "F+ " $1}' \ | shuf \ > /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule What domains are we crawling? cat /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt \ | sort -u -S 4G \ | cut -d/ -f3 \ | sort \ | uniq -c \ | sort -nr \ > /srv/sandcrawler/tasks/oai_nocapture_20220921.domains.txt head -n20 /srv/sandcrawler/tasks/oai_nocapture_20220921.domains.txt 91899 raco.cat 70116 islandora.wrlc.org 68708 urn.kb.se 63726 citeseerx.ist.psu.edu 50370 publications.rwth-aachen.de 44885 urn.nsk.hr 38429 server15795.contentdm.oclc.org 33041 periodicos.ufpb.br 32519 nbn-resolving.org 31990 www.ajol.info 24745 hal.archives-ouvertes.fr 22569 id.nii.ac.jp 17239 tilburguniversity.on.worldcat.org 15873 dspace.nbuv.gov.ua 15436 digitalcommons.wustl.edu 14885 www.iiste.org 14623 www.manchester.ac.uk 14033 nbn-resolving.de 13999 opus4.kobv.de 13689 www.redalyc.org Sizes: wc -l /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule 3662864 /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt 0 /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt 3662864 /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule Copy seedlist to crawler: # as regular user scp /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule wbgrp-svc206.us.archive.org:/tmp ## Post-Crawl Bulk Ingest # ran 2022-11-16, after crawl cleanup cat /srv/sandcrawler/tasks/oai_nocapture_20220921.ingest_request.json \ | rg -v "\\\\" \ | jq . -c \ | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 => DONE SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'oai' AND date(ingest_request.created) > '2022-09-01' GROUP BY status ORDER BY COUNT DESC LIMIT 20; status | count -----------------------+--------- success | 4721164 +1,946,128 no-pdf-link | 1116290 no-capture | 673939 terminal-bad-status | 232217 link-loop | 148544 wrong-mimetype | 68841 redirect-loop | 26262 empty-blob | 17759 cdx-error | 6570 blocked-cookie | 4026 blocked-wall | 3054 skip-url-blocklist | 2924 body-too-large | 2404 bad-redirect | 1565 wayback-error | 1320 petabox-error | 1083 null-body | 1038 wayback-content-error | 264 bad-gzip-encoding | 150 | 143 (20 rows)