diff options
Diffstat (limited to 'notes/ingest/2021-12-13_datasets.md')
-rw-r--r-- | notes/ingest/2021-12-13_datasets.md | 398 |
1 files changed, 398 insertions, 0 deletions
diff --git a/notes/ingest/2021-12-13_datasets.md b/notes/ingest/2021-12-13_datasets.md new file mode 100644 index 0000000..edad789 --- /dev/null +++ b/notes/ingest/2021-12-13_datasets.md @@ -0,0 +1,398 @@ + +First round of production dataset ingest. Aiming to get one or two small +repositories entirely covered, and a few thousand datasets from all supported +platforms. + +Planning to run with sandcrawler in batch mode on `wbgrp-svc263`, expecting up +to a TByte of content locally (on spinning disk). For successful output, will +run through fatcat import; for a subset of unsuccessful, will start a small +heritrix crawl. + + +## Ingest Generation + +Summary: + + wc -l /srv/fatcat/tasks/ingest_dataset_*pilot.json + 2 /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json + 1702 /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json + 2975 /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json + 10000 /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json + 10000 /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json + +All the below ingest requests were combined into a single large file: + + cat /srv/fatcat/tasks/ingest_dataset*pilot.json | shuf | pv -l | gzip > /srv/fatcat/tasks/ingest_dataset_combined.json.gz + # 24.7k 0:00:00 [91.9k/s] + +### Figshare + +- sample 10k datasets (not other types) +- want only "versioned" DOIs; use regex on DOI to ensure + + ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.6084 type:dataset' \ + | rg '10\.6084/m9\.figshare\.\d+.v\d+' \ + | shuf -n10000 \ + | pv -l \ + > /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json + # Counter({'estimate': 505968, 'ingest_request': 50000, 'elasticsearch_release': 50000}) + +### Zenodo + +- has DOIs (of course) +- want only "versioned" DOIs? how to skip? +- sample 10k + + ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.5281 type:dataset' \ + | rg '10\.5281/zenodo' \ + | shuf -n10000 \ + | pv -l \ + > /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json + +### Goettingen Research Online + +- <https://data.goettingen-research-online.de/> +- Dataverse instance, not harvard-hosted +- ~1,400 datasets, ~10,500 files +- has DOIs +- `doi_prefix:10.25625`, then filter to only one slash + + ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query 'doi_prefix:10.25625 type:dataset' \ + | rg -v '10\.25625/[a-z0-9]+/[a-z0-9]' \ + | shuf \ + | pv -l \ + > /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json + # Counter({'ingest_request': 12739, 'elasticsearch_release': 12739, 'estimate': 12739}) # 1.7k 0:01:29 [ 19 /s] + +### Harvard Dataverse + +- main harvard dataverse instance, many "sub-dataverses" +- ~137,000 datasets, ~1,400,000 files +- 10k sample + + ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.7910 type:dataset' \ + | rg '10\.7910/dvn/[a-z0-9]{6}' \ + | rg -v '10\.7910/dvn/[a-z0-9]{6}/[a-z0-9]' \ + | shuf -n10000 \ + | pv -l \ + > /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json + # Counter({'estimate': 660979, 'ingest_request': 50000, 'elasticsearch_release': 50000}) # 2.97k 0:03:26 [14.4 /s] + +Note that this was fewer than expected, but moving on anyways. + +### archive.org + +A couple hand-filtered items. + +"CAT" dataset +- item: <https://archive.org/details/CAT_DATASET> +- fatcat release (for paper): `release_36vy7s5gtba67fmyxlmijpsaui` + +"The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing" +- https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62 +- https://fatcat.wiki/release/7owybd2hrvdmdpm4zpo7hkn2pu (paper) + + + { + "ingest_type": "dataset", + "ingest_request_source": "savepapernow", + "base_url": "https://archive.org/details/CAT_DATASET", + "release_stage": "published", + "fatcat": { + "release_ident": "36vy7s5gtba67fmyxlmijpsaui", + "work_ident": "ycqtbhnfmzamheq2amztiwbsri" + }, + "ext_ids": {}, + "link_source": "spn", + "link_source_id": "36vy7s5gtba67fmyxlmijpsaui" + } + { + "ingest_type": "dataset", + "ingest_request_source": "savepapernow", + "base_url": "https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62", + "release_stage": "published", + "fatcat": { + "release_ident": "7owybd2hrvdmdpm4zpo7hkn2pu", + "work_ident": "3xkz7iffwbdfhbwhnd73iu66cu" + }, + "ext_ids": {}, + "link_source": "spn", + "link_source_id": "7owybd2hrvdmdpm4zpo7hkn2pu" + } + + # paste and then Ctrl-D: + cat | jq . -c > /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json + + +## Ingest Command + +On `wbgrp-svc263`. + +In the current version of tool, `skip_cleanup_local_files=True` by default, so +files will stick around. + +Note that `--no-spn2` is passed, so we are expecting a lot of `no-capture` in the output. + + + # first a small sample + zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ + | head -n5 \ + | pv -l \ + | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \ + > /srv/sandcrawler/tasks/ingest_dataset_combined_results.ramp.json + + # ok, run the whole batch through + zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ + | pv -l \ + | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \ + > /srv/sandcrawler/tasks/ingest_dataset_combined_results.json + +Got an error: + + internetarchive.exceptions.AuthenticationError: No access_key or secret_key set! Have you run `ia configure`? + +Did a hot patch to try to have the uploads happen under a session, with config from ENV, but didn't work: + + AttributeError: 'ArchiveSession' object has no attribute 'upload' + +Going to hack with config in homedir for now. + +Extract URLs for crawling: + + cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \ + | rg '"no-capture"' \ + | rg -v '"manifest"' \ + | jq 'select(.status = "no-capture")' -c \ + | jq .request.base_url -r \ + | pv -l \ + > /srv/sandcrawler/tasks/dataset_seedlist.base_url.txt + + cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \ + | rg '"no-capture"' \ + | rg '"manifest"' \ + | jq 'select(.status = "no-capture")' -c \ + | rg '"web-' \ + | jq .manifest[].terminal_url -r \ + | pv -l \ + > /srv/sandcrawler/tasks/dataset_seedlist.manifest_terminal.txt + +### Exceptions Encountered + + File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 193, in process + internetarchive.upload + [...] + ConnectionResetError: [Errno 104] Connection reset by peer + urllib3.exceptions.ProtocolError + requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), 'https://s3.us.archive.org/zenodo.org-3275525/rhOverM_Asymptotic_GeometricUnits_CoM.h5') + + + Traceback (most recent call last): + File "./ingest_tool.py", line 208, in <module> + main() + File "./ingest_tool.py", line 204, in main + args.func(args) + File "./ingest_tool.py", line 57, in run_requests + result = fileset_worker.process(request) + File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 375, in process + archive_result = strategy_helper.process(dataset_meta) + File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 130, in process + r.raise_for_status() + File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status + raise HTTPError(http_error_msg, response=self) + requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ndownloader.figshare.com/files/5474201 + +download sometimes just slowly time out, like after a day or more + + + Traceback (most recent call last): + File "./ingest_tool.py", line 208, in <module> + main() + File "./ingest_tool.py", line 204, in main + args.func(args) + File "./ingest_tool.py", line 57, in run_requests + result = fileset_worker.process(request) + File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 381, in process + archive_result = strategy_helper.process(dataset_meta) + File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 155, in process + file_meta = gen_file_metadata_path(local_path, allow_empty=True) + File "/srv/sandcrawler/src/python/sandcrawler/misc.py", line 89, in gen_file_metadata_path + mimetype = magic.Magic(mime=True).from_file(path) + File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/magic/__init__.py", line 111, in from_file + with _real_open(filename): + FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sandcrawler/figshare.com-7925396-v1/HG02070.dedup.realigned.recalibrated.hc.g.vcf.gz' + + + Traceback (most recent call last): + File "./ingest_tool.py", line 208, in <module> + main() + File "./ingest_tool.py", line 204, in main + args.func(args) + File "./ingest_tool.py", line 57, in run_requests + result = fileset_worker.process(request) + File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 314, in process + dataset_meta = platform_helper.process_request(request, resource, html_biblio) + File "/srv/sandcrawler/src/python/sandcrawler/fileset_platforms.py", line 208, in process_request + obj_latest = obj["data"]["latestVersion"] + KeyError: 'latestVersion' + +Fixed the above, trying again: + + git log | head -n1 + # commit ffdc901fa067db55fe6cfeb8d0c3807d29df092c + + Wed Dec 15 21:57:42 UTC 2021 + + zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ + | shuf \ + | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ + | pv -l \ + > /srv/sandcrawler/tasks/ingest_dataset_combined_results4.json + +Zenodo seems really slow, let's try filtering those out: + + zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ + | rg -v 10.5281 \ + | shuf \ + | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ + | pv -l \ + > /srv/sandcrawler/tasks/ingest_dataset_combined_results5.json + # 3.76k 15:12:53 [68.7m/s] + + zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ + | rg -v 10.5281 \ + | shuf \ + | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ + | pv -l \ + > /srv/sandcrawler/tasks/ingest_dataset_combined_results6.json + +## Fatcat Import + + wc -l ingest_dataset_combined_results*.json + 126 ingest_dataset_combined_results2.json + 153 ingest_dataset_combined_results3.json + 275 ingest_dataset_combined_results4.json + 3762 ingest_dataset_combined_results5.json + 7736 ingest_dataset_combined_results6.json + 182 ingest_dataset_combined_results.json + 5 ingest_dataset_combined_results.ramp.json + 12239 total + + cat ingest_dataset_combined_results*.json \ + | rg '^\{' \ + | jq '[.request.fatcat.release_ident, . | tostring] | @tsv' -r \ + | sort \ + | uniq --check-chars 26 \ + | cut -f2 \ + | rg -v '\\\\' \ + | pv -l \ + > uniq_ingest_dataset_combined_results.json + # 9.48k 0:00:06 [1.54k/s] + + cat uniq_ingest_dataset_combined_results.json | jq .status -r | sort | uniq -c | sort -nr + 7941 no-capture + 374 platform-404 + 369 terminal-bad-status + 348 success-file + 172 success + 79 platform-scope + 77 error-platform-download + 47 empty-manifest + 27 platform-restricted + 20 too-many-files + 12 redirect-loop + 6 error-archiveorg-upload + 3 too-large-size + 3 mismatch + 1 no-platform-match + + cat uniq_ingest_dataset_combined_results.json \ + | rg '"success' \ + | jq 'select(.status == "success") | .' -c \ + > uniq_ingest_dataset_combined_results.success.json + + cat uniq_ingest_dataset_combined_results.json \ + | rg '"success' \ + | jq 'select(.status == "success-file") | .' -c \ + > uniq_ingest_dataset_combined_results.success-file.json + +On fatcat QA instance: + + git log | head -n1 + # commit cca680e2cc4768a4d45e199f6256a433b25b4075 + + head /tmp/uniq_ingest_dataset_combined_results.success-file.json \ + | ./fatcat_import.py ingest-fileset-results - + # Counter({'total': 10, 'skip': 10, 'skip-single-file': 10, 'insert': 0, 'update': 0, 'exists': 0}) + + head /tmp/uniq_ingest_dataset_combined_results.success-file.json \ + | ./fatcat_import.py ingest-file-results - + # Counter({'total': 10, 'skip': 10, 'skip-ingest-type': 10, 'insert': 0, 'update': 0, 'exists': 0}) + +Need to update fatcat file worker to support single-file filesets... was that the plan? + + head /tmp/uniq_ingest_dataset_combined_results.success.json \ + | ./fatcat_import.py ingest-fileset-results - + # Counter({'total': 10, 'skip': 10, 'skip-no-access-url': 10, 'insert': 0, 'update': 0, 'exists': 0}) + + # Counter({'total': 10, 'insert': 10, 'skip': 0, 'update': 0, 'exists': 0}) + + +## Summary + +As follow-up, it may be worth doing another manual round of ingest requests. +After that, would be good to fill in "glue" code so that this can be done with +kafka workers, and do re-tries/dumps using sandcrawler SQL database. Then can +start scaling up more ingest, using ingest tool, "bulk mode" processing, +heritrix crawls from `no-capture` dumps, etc, similar to bulk file ingest +process. + +For scaling, let's do a "full" ingest request generation of all datasets, and +crawl the base URL with heritrix, in fast/direct mode. Expect this to be tens +of millions of mostly DOIs (doi.org URLs), should crawl quickly. + +Then, do bulk downloading with ingest worker, perhaps on misc-vm or aitio. +uploading large datasets to archive.org, but not doing SPN web requests. Feed +the resulting huge file seedlist into a heritrix crawl to download web files. + +Will need to add support for more specific platforms. + + +### Huge Bulk Ingest Prep + +On prod instance: + + ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query type:dataset \ + | pv -l \ + | gzip \ + > /srv/fatcat/tasks/ingest_dataset_bulk.2022-01-05.json.gz + # Expecting 11264787 release objects in search queries + # TIMEOUT ERROR + # 6.07M 19:13:02 [87.7 /s] (partial) + +As follow-up, should do a full batch (not partial). For now search index is too +unreliable (read timeouts). + + zcat ingest_dataset_bulk.2022-01-05.partial.json.gz \ + | jq .base_url -r \ + | sort -u \ + | shuf \ + | awk '{print "F+ " $1}' \ + > ingest_dataset_bulk.2022-01-05.partial.schedule + +## Retries (2022-01-12) + +This is after having done a bunch of crawling. + + cat ingest_dataset_combined_results6.json \ + | rg '"no-capture"' \ + | jq 'select(.status = "no-capture")' -c \ + | jq .request -c \ + | pv -l \ + > ingest_dataset_retry.json + => 6.51k 0:00:01 [3.55k/s] + + cat /srv/sandcrawler/tasks/ingest_dataset_retry.json \ + | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ + | pv -l \ + > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json + |