First round of production dataset ingest. Aiming to get one or two small
repositories entirely covered, and a few thousand datasets from all supported
platforms.
Planning to run with sandcrawler in batch mode on `wbgrp-svc263`, expecting up
to a TByte of content locally (on spinning disk). For successful output, will
run through fatcat import; for a subset of unsuccessful, will start a small
heritrix crawl.
## Ingest Generation
Summary:
wc -l /srv/fatcat/tasks/ingest_dataset_*pilot.json
2 /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json
1702 /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
2975 /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
10000 /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
10000 /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json
All the below ingest requests were combined into a single large file:
cat /srv/fatcat/tasks/ingest_dataset*pilot.json | shuf | pv -l | gzip > /srv/fatcat/tasks/ingest_dataset_combined.json.gz
# 24.7k 0:00:00 [91.9k/s]
### Figshare
- sample 10k datasets (not other types)
- want only "versioned" DOIs; use regex on DOI to ensure
./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.6084 type:dataset' \
| rg '10\.6084/m9\.figshare\.\d+.v\d+' \
| shuf -n10000 \
| pv -l \
> /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
# Counter({'estimate': 505968, 'ingest_request': 50000, 'elasticsearch_release': 50000})
### Zenodo
- has DOIs (of course)
- want only "versioned" DOIs? how to skip?
- sample 10k
./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.5281 type:dataset' \
| rg '10\.5281/zenodo' \
| shuf -n10000 \
| pv -l \
> /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json
### Goettingen Research Online
-
- Dataverse instance, not harvard-hosted
- ~1,400 datasets, ~10,500 files
- has DOIs
- `doi_prefix:10.25625`, then filter to only one slash
./fatcat_ingest.py --ingest-type dataset --allow-non-oa query 'doi_prefix:10.25625 type:dataset' \
| rg -v '10\.25625/[a-z0-9]+/[a-z0-9]' \
| shuf \
| pv -l \
> /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
# Counter({'ingest_request': 12739, 'elasticsearch_release': 12739, 'estimate': 12739}) # 1.7k 0:01:29 [ 19 /s]
### Harvard Dataverse
- main harvard dataverse instance, many "sub-dataverses"
- ~137,000 datasets, ~1,400,000 files
- 10k sample
./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.7910 type:dataset' \
| rg '10\.7910/dvn/[a-z0-9]{6}' \
| rg -v '10\.7910/dvn/[a-z0-9]{6}/[a-z0-9]' \
| shuf -n10000 \
| pv -l \
> /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
# Counter({'estimate': 660979, 'ingest_request': 50000, 'elasticsearch_release': 50000}) # 2.97k 0:03:26 [14.4 /s]
Note that this was fewer than expected, but moving on anyways.
### archive.org
A couple hand-filtered items.
"CAT" dataset
- item:
- fatcat release (for paper): `release_36vy7s5gtba67fmyxlmijpsaui`
"The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing"
- https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62
- https://fatcat.wiki/release/7owybd2hrvdmdpm4zpo7hkn2pu (paper)
{
"ingest_type": "dataset",
"ingest_request_source": "savepapernow",
"base_url": "https://archive.org/details/CAT_DATASET",
"release_stage": "published",
"fatcat": {
"release_ident": "36vy7s5gtba67fmyxlmijpsaui",
"work_ident": "ycqtbhnfmzamheq2amztiwbsri"
},
"ext_ids": {},
"link_source": "spn",
"link_source_id": "36vy7s5gtba67fmyxlmijpsaui"
}
{
"ingest_type": "dataset",
"ingest_request_source": "savepapernow",
"base_url": "https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62",
"release_stage": "published",
"fatcat": {
"release_ident": "7owybd2hrvdmdpm4zpo7hkn2pu",
"work_ident": "3xkz7iffwbdfhbwhnd73iu66cu"
},
"ext_ids": {},
"link_source": "spn",
"link_source_id": "7owybd2hrvdmdpm4zpo7hkn2pu"
}
# paste and then Ctrl-D:
cat | jq . -c > /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json
## Ingest Command
On `wbgrp-svc263`.
In the current version of tool, `skip_cleanup_local_files=True` by default, so
files will stick around.
Note that `--no-spn2` is passed, so we are expecting a lot of `no-capture` in the output.
# first a small sample
zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
| head -n5 \
| pv -l \
| parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
> /srv/sandcrawler/tasks/ingest_dataset_combined_results.ramp.json
# ok, run the whole batch through
zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
| pv -l \
| parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
> /srv/sandcrawler/tasks/ingest_dataset_combined_results.json
Got an error:
internetarchive.exceptions.AuthenticationError: No access_key or secret_key set! Have you run `ia configure`?
Did a hot patch to try to have the uploads happen under a session, with config from ENV, but didn't work:
AttributeError: 'ArchiveSession' object has no attribute 'upload'
Going to hack with config in homedir for now.
Extract URLs for crawling:
cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
| rg '"no-capture"' \
| rg -v '"manifest"' \
| jq 'select(.status = "no-capture")' -c \
| jq .request.base_url -r \
| pv -l \
> /srv/sandcrawler/tasks/dataset_seedlist.base_url.txt
cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
| rg '"no-capture"' \
| rg '"manifest"' \
| jq 'select(.status = "no-capture")' -c \
| rg '"web-' \
| jq .manifest[].terminal_url -r \
| pv -l \
> /srv/sandcrawler/tasks/dataset_seedlist.manifest_terminal.txt
### Exceptions Encountered
File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 193, in process
internetarchive.upload
[...]
ConnectionResetError: [Errno 104] Connection reset by peer
urllib3.exceptions.ProtocolError
requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), 'https://s3.us.archive.org/zenodo.org-3275525/rhOverM_Asymptotic_GeometricUnits_CoM.h5')
Traceback (most recent call last):
File "./ingest_tool.py", line 208, in
main()
File "./ingest_tool.py", line 204, in main
args.func(args)
File "./ingest_tool.py", line 57, in run_requests
result = fileset_worker.process(request)
File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 375, in process
archive_result = strategy_helper.process(dataset_meta)
File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 130, in process
r.raise_for_status()
File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ndownloader.figshare.com/files/5474201
download sometimes just slowly time out, like after a day or more
Traceback (most recent call last):
File "./ingest_tool.py", line 208, in
main()
File "./ingest_tool.py", line 204, in main
args.func(args)
File "./ingest_tool.py", line 57, in run_requests
result = fileset_worker.process(request)
File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 381, in process
archive_result = strategy_helper.process(dataset_meta)
File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 155, in process
file_meta = gen_file_metadata_path(local_path, allow_empty=True)
File "/srv/sandcrawler/src/python/sandcrawler/misc.py", line 89, in gen_file_metadata_path
mimetype = magic.Magic(mime=True).from_file(path)
File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/magic/__init__.py", line 111, in from_file
with _real_open(filename):
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sandcrawler/figshare.com-7925396-v1/HG02070.dedup.realigned.recalibrated.hc.g.vcf.gz'
Traceback (most recent call last):
File "./ingest_tool.py", line 208, in
main()
File "./ingest_tool.py", line 204, in main
args.func(args)
File "./ingest_tool.py", line 57, in run_requests
result = fileset_worker.process(request)
File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 314, in process
dataset_meta = platform_helper.process_request(request, resource, html_biblio)
File "/srv/sandcrawler/src/python/sandcrawler/fileset_platforms.py", line 208, in process_request
obj_latest = obj["data"]["latestVersion"]
KeyError: 'latestVersion'
Fixed the above, trying again:
git log | head -n1
# commit ffdc901fa067db55fe6cfeb8d0c3807d29df092c
Wed Dec 15 21:57:42 UTC 2021
zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
| shuf \
| parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
| pv -l \
> /srv/sandcrawler/tasks/ingest_dataset_combined_results4.json
Zenodo seems really slow, let's try filtering those out:
zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
| rg -v 10.5281 \
| shuf \
| parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
| pv -l \
> /srv/sandcrawler/tasks/ingest_dataset_combined_results5.json
# 3.76k 15:12:53 [68.7m/s]
zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
| rg -v 10.5281 \
| shuf \
| parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
| pv -l \
> /srv/sandcrawler/tasks/ingest_dataset_combined_results6.json
## Fatcat Import
wc -l ingest_dataset_combined_results*.json
126 ingest_dataset_combined_results2.json
153 ingest_dataset_combined_results3.json
275 ingest_dataset_combined_results4.json
3762 ingest_dataset_combined_results5.json
7736 ingest_dataset_combined_results6.json
182 ingest_dataset_combined_results.json
5 ingest_dataset_combined_results.ramp.json
12239 total
cat ingest_dataset_combined_results*.json \
| rg '^\{' \
| jq '[.request.fatcat.release_ident, . | tostring] | @tsv' -r \
| sort \
| uniq --check-chars 26 \
| cut -f2 \
| rg -v '\\\\' \
| pv -l \
> uniq_ingest_dataset_combined_results.json
# 9.48k 0:00:06 [1.54k/s]
cat uniq_ingest_dataset_combined_results.json | jq .status -r | sort | uniq -c | sort -nr
7941 no-capture
374 platform-404
369 terminal-bad-status
348 success-file
172 success
79 platform-scope
77 error-platform-download
47 empty-manifest
27 platform-restricted
20 too-many-files
12 redirect-loop
6 error-archiveorg-upload
3 too-large-size
3 mismatch
1 no-platform-match
cat uniq_ingest_dataset_combined_results.json \
| rg '"success' \
| jq 'select(.status == "success") | .' -c \
> uniq_ingest_dataset_combined_results.success.json
cat uniq_ingest_dataset_combined_results.json \
| rg '"success' \
| jq 'select(.status == "success-file") | .' -c \
> uniq_ingest_dataset_combined_results.success-file.json
On fatcat QA instance:
git log | head -n1
# commit cca680e2cc4768a4d45e199f6256a433b25b4075
head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
| ./fatcat_import.py ingest-fileset-results -
# Counter({'total': 10, 'skip': 10, 'skip-single-file': 10, 'insert': 0, 'update': 0, 'exists': 0})
head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
| ./fatcat_import.py ingest-file-results -
# Counter({'total': 10, 'skip': 10, 'skip-ingest-type': 10, 'insert': 0, 'update': 0, 'exists': 0})
Need to update fatcat file worker to support single-file filesets... was that the plan?
head /tmp/uniq_ingest_dataset_combined_results.success.json \
| ./fatcat_import.py ingest-fileset-results -
# Counter({'total': 10, 'skip': 10, 'skip-no-access-url': 10, 'insert': 0, 'update': 0, 'exists': 0})
# Counter({'total': 10, 'insert': 10, 'skip': 0, 'update': 0, 'exists': 0})
## Summary
As follow-up, it may be worth doing another manual round of ingest requests.
After that, would be good to fill in "glue" code so that this can be done with
kafka workers, and do re-tries/dumps using sandcrawler SQL database. Then can
start scaling up more ingest, using ingest tool, "bulk mode" processing,
heritrix crawls from `no-capture` dumps, etc, similar to bulk file ingest
process.
For scaling, let's do a "full" ingest request generation of all datasets, and
crawl the base URL with heritrix, in fast/direct mode. Expect this to be tens
of millions of mostly DOIs (doi.org URLs), should crawl quickly.
Then, do bulk downloading with ingest worker, perhaps on misc-vm or aitio.
uploading large datasets to archive.org, but not doing SPN web requests. Feed
the resulting huge file seedlist into a heritrix crawl to download web files.
Will need to add support for more specific platforms.
### Huge Bulk Ingest Prep
On prod instance:
./fatcat_ingest.py --ingest-type dataset --allow-non-oa query type:dataset \
| pv -l \
| gzip \
> /srv/fatcat/tasks/ingest_dataset_bulk.2022-01-05.json.gz
# Expecting 11264787 release objects in search queries
# TIMEOUT ERROR
# 6.07M 19:13:02 [87.7 /s] (partial)
As follow-up, should do a full batch (not partial). For now search index is too
unreliable (read timeouts).
zcat ingest_dataset_bulk.2022-01-05.partial.json.gz \
| jq .base_url -r \
| sort -u \
| shuf \
| awk '{print "F+ " $1}' \
> ingest_dataset_bulk.2022-01-05.partial.schedule
## Retries (2022-01-12)
This is after having done a bunch of crawling.
cat ingest_dataset_combined_results6.json \
| rg '"no-capture"' \
| jq 'select(.status = "no-capture")' -c \
| jq .request -c \
| pv -l \
> ingest_dataset_retry.json
=> 6.51k 0:00:01 [3.55k/s]
cat /srv/sandcrawler/tasks/ingest_dataset_retry.json \
| parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
| pv -l \
> /srv/sandcrawler/tasks/ingest_dataset_retry_results.json
## Retries (2022-02)
Finally got things to complete end to end for this batch!
cat ingest_dataset_retry_results5.json | jq .status -r | sort | uniq -c | sort -nr
3220 terminal-bad-status
2120 no-capture
380 empty-manifest
264 success-file
251 success
126 success-existing
39 mismatch
28 error-platform-download
24 too-many-files
20 platform-scope
13 platform-restricted
13 mismatch-size
6 too-large-size
3 transfer-encoding-error
2 no-platform-match
2 error-archiveorg-upload
1 redirect-loop
1 empty-blob
Some more URLs to crawl:
cat ingest_dataset_retry_results5.json \
| rg '"no-capture"' \
| rg -v '"manifest"' \
| jq 'select(.status = "no-capture")' -c \
| jq .request.base_url -r \
| pv -l \
> /srv/sandcrawler/tasks/dataset_seedlist_retries5.base_url.txt
# 1.00
# just a single DOI that failed to crawl, for whatever reason
cat ingest_dataset_retry_results5.json \
| rg '"no-capture"' \
| rg '"manifest"' \
| jq 'select(.status = "no-capture")' -c \
| rg '"web-' \
| jq .manifest[].terminal_url -r \
| pv -l \
> /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt
These are ready to crawl, in the existing dataset crawl.
cat /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt \
| sort -u \
| shuf \
| awk '{print "F+ " $1}' \
> /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.schedule