First round of production dataset ingest. Aiming to get one or two small repositories entirely covered, and a few thousand datasets from all supported platforms. Planning to run with sandcrawler in batch mode on `wbgrp-svc263`, expecting up to a TByte of content locally (on spinning disk). For successful output, will run through fatcat import; for a subset of unsuccessful, will start a small heritrix crawl. ## Ingest Generation Summary: wc -l /srv/fatcat/tasks/ingest_dataset_*pilot.json 2 /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json 1702 /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json 2975 /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json 10000 /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json 10000 /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json All the below ingest requests were combined into a single large file: cat /srv/fatcat/tasks/ingest_dataset*pilot.json | shuf | pv -l | gzip > /srv/fatcat/tasks/ingest_dataset_combined.json.gz # 24.7k 0:00:00 [91.9k/s] ### Figshare - sample 10k datasets (not other types) - want only "versioned" DOIs; use regex on DOI to ensure ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.6084 type:dataset' \ | rg '10\.6084/m9\.figshare\.\d+.v\d+' \ | shuf -n10000 \ | pv -l \ > /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json # Counter({'estimate': 505968, 'ingest_request': 50000, 'elasticsearch_release': 50000}) ### Zenodo - has DOIs (of course) - want only "versioned" DOIs? how to skip? - sample 10k ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.5281 type:dataset' \ | rg '10\.5281/zenodo' \ | shuf -n10000 \ | pv -l \ > /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json ### Goettingen Research Online - - Dataverse instance, not harvard-hosted - ~1,400 datasets, ~10,500 files - has DOIs - `doi_prefix:10.25625`, then filter to only one slash ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query 'doi_prefix:10.25625 type:dataset' \ | rg -v '10\.25625/[a-z0-9]+/[a-z0-9]' \ | shuf \ | pv -l \ > /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json # Counter({'ingest_request': 12739, 'elasticsearch_release': 12739, 'estimate': 12739}) # 1.7k 0:01:29 [ 19 /s] ### Harvard Dataverse - main harvard dataverse instance, many "sub-dataverses" - ~137,000 datasets, ~1,400,000 files - 10k sample ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.7910 type:dataset' \ | rg '10\.7910/dvn/[a-z0-9]{6}' \ | rg -v '10\.7910/dvn/[a-z0-9]{6}/[a-z0-9]' \ | shuf -n10000 \ | pv -l \ > /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json # Counter({'estimate': 660979, 'ingest_request': 50000, 'elasticsearch_release': 50000}) # 2.97k 0:03:26 [14.4 /s] Note that this was fewer than expected, but moving on anyways. ### archive.org A couple hand-filtered items. "CAT" dataset - item: - fatcat release (for paper): `release_36vy7s5gtba67fmyxlmijpsaui` "The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing" - https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62 - https://fatcat.wiki/release/7owybd2hrvdmdpm4zpo7hkn2pu (paper) { "ingest_type": "dataset", "ingest_request_source": "savepapernow", "base_url": "https://archive.org/details/CAT_DATASET", "release_stage": "published", "fatcat": { "release_ident": "36vy7s5gtba67fmyxlmijpsaui", "work_ident": "ycqtbhnfmzamheq2amztiwbsri" }, "ext_ids": {}, "link_source": "spn", "link_source_id": "36vy7s5gtba67fmyxlmijpsaui" } { "ingest_type": "dataset", "ingest_request_source": "savepapernow", "base_url": "https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62", "release_stage": "published", "fatcat": { "release_ident": "7owybd2hrvdmdpm4zpo7hkn2pu", "work_ident": "3xkz7iffwbdfhbwhnd73iu66cu" }, "ext_ids": {}, "link_source": "spn", "link_source_id": "7owybd2hrvdmdpm4zpo7hkn2pu" } # paste and then Ctrl-D: cat | jq . -c > /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json ## Ingest Command On `wbgrp-svc263`. In the current version of tool, `skip_cleanup_local_files=True` by default, so files will stick around. Note that `--no-spn2` is passed, so we are expecting a lot of `no-capture` in the output. # first a small sample zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ | head -n5 \ | pv -l \ | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \ > /srv/sandcrawler/tasks/ingest_dataset_combined_results.ramp.json # ok, run the whole batch through zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ | pv -l \ | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \ > /srv/sandcrawler/tasks/ingest_dataset_combined_results.json Got an error: internetarchive.exceptions.AuthenticationError: No access_key or secret_key set! Have you run `ia configure`? Did a hot patch to try to have the uploads happen under a session, with config from ENV, but didn't work: AttributeError: 'ArchiveSession' object has no attribute 'upload' Going to hack with config in homedir for now. Extract URLs for crawling: cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \ | rg '"no-capture"' \ | rg -v '"manifest"' \ | jq 'select(.status = "no-capture")' -c \ | jq .request.base_url -r \ | pv -l \ > /srv/sandcrawler/tasks/dataset_seedlist.base_url.txt cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \ | rg '"no-capture"' \ | rg '"manifest"' \ | jq 'select(.status = "no-capture")' -c \ | rg '"web-' \ | jq .manifest[].terminal_url -r \ | pv -l \ > /srv/sandcrawler/tasks/dataset_seedlist.manifest_terminal.txt ### Exceptions Encountered File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 193, in process internetarchive.upload [...] ConnectionResetError: [Errno 104] Connection reset by peer urllib3.exceptions.ProtocolError requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), 'https://s3.us.archive.org/zenodo.org-3275525/rhOverM_Asymptotic_GeometricUnits_CoM.h5') Traceback (most recent call last): File "./ingest_tool.py", line 208, in main() File "./ingest_tool.py", line 204, in main args.func(args) File "./ingest_tool.py", line 57, in run_requests result = fileset_worker.process(request) File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 375, in process archive_result = strategy_helper.process(dataset_meta) File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 130, in process r.raise_for_status() File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ndownloader.figshare.com/files/5474201 download sometimes just slowly time out, like after a day or more Traceback (most recent call last): File "./ingest_tool.py", line 208, in main() File "./ingest_tool.py", line 204, in main args.func(args) File "./ingest_tool.py", line 57, in run_requests result = fileset_worker.process(request) File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 381, in process archive_result = strategy_helper.process(dataset_meta) File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 155, in process file_meta = gen_file_metadata_path(local_path, allow_empty=True) File "/srv/sandcrawler/src/python/sandcrawler/misc.py", line 89, in gen_file_metadata_path mimetype = magic.Magic(mime=True).from_file(path) File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/magic/__init__.py", line 111, in from_file with _real_open(filename): FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sandcrawler/figshare.com-7925396-v1/HG02070.dedup.realigned.recalibrated.hc.g.vcf.gz' Traceback (most recent call last): File "./ingest_tool.py", line 208, in main() File "./ingest_tool.py", line 204, in main args.func(args) File "./ingest_tool.py", line 57, in run_requests result = fileset_worker.process(request) File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 314, in process dataset_meta = platform_helper.process_request(request, resource, html_biblio) File "/srv/sandcrawler/src/python/sandcrawler/fileset_platforms.py", line 208, in process_request obj_latest = obj["data"]["latestVersion"] KeyError: 'latestVersion' Fixed the above, trying again: git log | head -n1 # commit ffdc901fa067db55fe6cfeb8d0c3807d29df092c Wed Dec 15 21:57:42 UTC 2021 zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ | shuf \ | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ | pv -l \ > /srv/sandcrawler/tasks/ingest_dataset_combined_results4.json Zenodo seems really slow, let's try filtering those out: zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ | rg -v 10.5281 \ | shuf \ | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ | pv -l \ > /srv/sandcrawler/tasks/ingest_dataset_combined_results5.json # 3.76k 15:12:53 [68.7m/s] zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \ | rg -v 10.5281 \ | shuf \ | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ | pv -l \ > /srv/sandcrawler/tasks/ingest_dataset_combined_results6.json ## Fatcat Import wc -l ingest_dataset_combined_results*.json 126 ingest_dataset_combined_results2.json 153 ingest_dataset_combined_results3.json 275 ingest_dataset_combined_results4.json 3762 ingest_dataset_combined_results5.json 7736 ingest_dataset_combined_results6.json 182 ingest_dataset_combined_results.json 5 ingest_dataset_combined_results.ramp.json 12239 total cat ingest_dataset_combined_results*.json \ | rg '^\{' \ | jq '[.request.fatcat.release_ident, . | tostring] | @tsv' -r \ | sort \ | uniq --check-chars 26 \ | cut -f2 \ | rg -v '\\\\' \ | pv -l \ > uniq_ingest_dataset_combined_results.json # 9.48k 0:00:06 [1.54k/s] cat uniq_ingest_dataset_combined_results.json | jq .status -r | sort | uniq -c | sort -nr 7941 no-capture 374 platform-404 369 terminal-bad-status 348 success-file 172 success 79 platform-scope 77 error-platform-download 47 empty-manifest 27 platform-restricted 20 too-many-files 12 redirect-loop 6 error-archiveorg-upload 3 too-large-size 3 mismatch 1 no-platform-match cat uniq_ingest_dataset_combined_results.json \ | rg '"success' \ | jq 'select(.status == "success") | .' -c \ > uniq_ingest_dataset_combined_results.success.json cat uniq_ingest_dataset_combined_results.json \ | rg '"success' \ | jq 'select(.status == "success-file") | .' -c \ > uniq_ingest_dataset_combined_results.success-file.json On fatcat QA instance: git log | head -n1 # commit cca680e2cc4768a4d45e199f6256a433b25b4075 head /tmp/uniq_ingest_dataset_combined_results.success-file.json \ | ./fatcat_import.py ingest-fileset-results - # Counter({'total': 10, 'skip': 10, 'skip-single-file': 10, 'insert': 0, 'update': 0, 'exists': 0}) head /tmp/uniq_ingest_dataset_combined_results.success-file.json \ | ./fatcat_import.py ingest-file-results - # Counter({'total': 10, 'skip': 10, 'skip-ingest-type': 10, 'insert': 0, 'update': 0, 'exists': 0}) Need to update fatcat file worker to support single-file filesets... was that the plan? head /tmp/uniq_ingest_dataset_combined_results.success.json \ | ./fatcat_import.py ingest-fileset-results - # Counter({'total': 10, 'skip': 10, 'skip-no-access-url': 10, 'insert': 0, 'update': 0, 'exists': 0}) # Counter({'total': 10, 'insert': 10, 'skip': 0, 'update': 0, 'exists': 0}) ## Summary As follow-up, it may be worth doing another manual round of ingest requests. After that, would be good to fill in "glue" code so that this can be done with kafka workers, and do re-tries/dumps using sandcrawler SQL database. Then can start scaling up more ingest, using ingest tool, "bulk mode" processing, heritrix crawls from `no-capture` dumps, etc, similar to bulk file ingest process. For scaling, let's do a "full" ingest request generation of all datasets, and crawl the base URL with heritrix, in fast/direct mode. Expect this to be tens of millions of mostly DOIs (doi.org URLs), should crawl quickly. Then, do bulk downloading with ingest worker, perhaps on misc-vm or aitio. uploading large datasets to archive.org, but not doing SPN web requests. Feed the resulting huge file seedlist into a heritrix crawl to download web files. Will need to add support for more specific platforms. ### Huge Bulk Ingest Prep On prod instance: ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query type:dataset \ | pv -l \ | gzip \ > /srv/fatcat/tasks/ingest_dataset_bulk.2022-01-05.json.gz # Expecting 11264787 release objects in search queries # TIMEOUT ERROR # 6.07M 19:13:02 [87.7 /s] (partial) As follow-up, should do a full batch (not partial). For now search index is too unreliable (read timeouts). zcat ingest_dataset_bulk.2022-01-05.partial.json.gz \ | jq .base_url -r \ | sort -u \ | shuf \ | awk '{print "F+ " $1}' \ > ingest_dataset_bulk.2022-01-05.partial.schedule ## Retries (2022-01-12) This is after having done a bunch of crawling. cat ingest_dataset_combined_results6.json \ | rg '"no-capture"' \ | jq 'select(.status = "no-capture")' -c \ | jq .request -c \ | pv -l \ > ingest_dataset_retry.json => 6.51k 0:00:01 [3.55k/s] cat /srv/sandcrawler/tasks/ingest_dataset_retry.json \ | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \ | pv -l \ > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json ## Retries (2022-02) Finally got things to complete end to end for this batch! cat ingest_dataset_retry_results5.json | jq .status -r | sort | uniq -c | sort -nr 3220 terminal-bad-status 2120 no-capture 380 empty-manifest 264 success-file 251 success 126 success-existing 39 mismatch 28 error-platform-download 24 too-many-files 20 platform-scope 13 platform-restricted 13 mismatch-size 6 too-large-size 3 transfer-encoding-error 2 no-platform-match 2 error-archiveorg-upload 1 redirect-loop 1 empty-blob Some more URLs to crawl: cat ingest_dataset_retry_results5.json \ | rg '"no-capture"' \ | rg -v '"manifest"' \ | jq 'select(.status = "no-capture")' -c \ | jq .request.base_url -r \ | pv -l \ > /srv/sandcrawler/tasks/dataset_seedlist_retries5.base_url.txt # 1.00 # just a single DOI that failed to crawl, for whatever reason cat ingest_dataset_retry_results5.json \ | rg '"no-capture"' \ | rg '"manifest"' \ | jq 'select(.status = "no-capture")' -c \ | rg '"web-' \ | jq .manifest[].terminal_url -r \ | pv -l \ > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt These are ready to crawl, in the existing dataset crawl. cat /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt \ | sort -u \ | shuf \ | awk '{print "F+ " $1}' \ > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.schedule