ingest notes: various in-progress projects

author: Bryan Newbold <bnewbold@archive.org> 2022-01-27 17:55:15 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2022-01-27 17:55:15 -0800
commit: c8e2462471a010e4ae368941b539e9404f3768fc (patch)
tree: 8b9eaa02fbe7e75a8dfb09f341f77e6d645cc3b9 /notes/ingest/2021-12-13_datasets.md
parent: 2a96e2baeb7d318a4aa2abbda7052757a02f5167 (diff)
download: sandcrawler-c8e2462471a010e4ae368941b539e9404f3768fc.tar.gz
sandcrawler-c8e2462471a010e4ae368941b539e9404f3768fc.zip
1 files changed, 398 insertions, 0 deletions
diff --git a/notes/ingest/2021-12-13_datasets.md b/notes/ingest/2021-12-13_datasets.md
new file mode 100644
index 0000000..edad789
--- /dev/null
+++ b/notes/ingest/2021-12-13_datasets.md
@@ -0,0 +1,398 @@
+
+First round of production dataset ingest. Aiming to get one or two small
+repositories entirely covered, and a few thousand datasets from all supported
+platforms.
+
+Planning to run with sandcrawler in batch mode on `wbgrp-svc263`, expecting up
+to a TByte of content locally (on spinning disk). For successful output, will
+run through fatcat import; for a subset of unsuccessful, will start a small
+heritrix crawl.
+
+
+## Ingest Generation
+
+Summary:
+
+    wc -l /srv/fatcat/tasks/ingest_dataset_*pilot.json
+          2 /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json
+       1702 /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
+       2975 /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
+      10000 /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
+      10000 /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json
+
+All the below ingest requests were combined into a single large file:
+
+    cat /srv/fatcat/tasks/ingest_dataset*pilot.json | shuf | pv -l | gzip > /srv/fatcat/tasks/ingest_dataset_combined.json.gz
+    # 24.7k 0:00:00 [91.9k/s]
+
+### Figshare
+
+- sample 10k datasets (not other types)
+- want only "versioned" DOIs; use regex on DOI to ensure
+
+    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.6084 type:dataset' \
+        | rg '10\.6084/m9\.figshare\.\d+.v\d+' \
+        | shuf -n10000 \
+        | pv -l \
+        > /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
+    # Counter({'estimate': 505968, 'ingest_request': 50000, 'elasticsearch_release': 50000})
+
+### Zenodo
+
+- has DOIs (of course)
+- want only "versioned" DOIs? how to skip?
+- sample 10k
+
+    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.5281 type:dataset' \
+        | rg '10\.5281/zenodo' \
+        | shuf -n10000 \
+        | pv -l \
+        > /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json
+
+### Goettingen Research Online
+
+- <https://data.goettingen-research-online.de/>
+- Dataverse instance, not harvard-hosted
+- ~1,400 datasets, ~10,500 files
+- has DOIs
+- `doi_prefix:10.25625`, then filter to only one slash
+
+    ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query 'doi_prefix:10.25625 type:dataset' \
+        | rg -v '10\.25625/[a-z0-9]+/[a-z0-9]' \
+        | shuf \
+        | pv -l \
+        > /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
+    # Counter({'ingest_request': 12739, 'elasticsearch_release': 12739, 'estimate': 12739})                                                                       # 1.7k 0:01:29 [  19 /s]
+
+### Harvard Dataverse
+
+- main harvard dataverse instance, many "sub-dataverses"
+- ~137,000 datasets, ~1,400,000 files
+- 10k sample
+
+    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.7910 type:dataset' \
+        | rg '10\.7910/dvn/[a-z0-9]{6}' \
+        | rg -v '10\.7910/dvn/[a-z0-9]{6}/[a-z0-9]' \
+        | shuf -n10000 \
+        | pv -l \
+        > /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
+    # Counter({'estimate': 660979, 'ingest_request': 50000, 'elasticsearch_release': 50000})                                                                      # 2.97k 0:03:26 [14.4 /s]
+
+Note that this was fewer than expected, but moving on anyways.
+
+### archive.org
+
+A couple hand-filtered items.
+
+"CAT" dataset
+- item: <https://archive.org/details/CAT_DATASET>
+- fatcat release (for paper): `release_36vy7s5gtba67fmyxlmijpsaui`
+
+"The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing"
+- https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62
+- https://fatcat.wiki/release/7owybd2hrvdmdpm4zpo7hkn2pu (paper)
+
+
+    {
+        "ingest_type": "dataset",
+        "ingest_request_source": "savepapernow",
+        "base_url": "https://archive.org/details/CAT_DATASET",
+        "release_stage": "published",
+        "fatcat": {
+            "release_ident": "36vy7s5gtba67fmyxlmijpsaui",
+            "work_ident": "ycqtbhnfmzamheq2amztiwbsri"
+        },
+        "ext_ids": {},
+        "link_source": "spn",
+        "link_source_id": "36vy7s5gtba67fmyxlmijpsaui"
+    }
+    {
+        "ingest_type": "dataset",
+        "ingest_request_source": "savepapernow",
+        "base_url": "https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62",
+        "release_stage": "published",
+        "fatcat": {
+            "release_ident": "7owybd2hrvdmdpm4zpo7hkn2pu",
+            "work_ident": "3xkz7iffwbdfhbwhnd73iu66cu"
+        },
+        "ext_ids": {},
+        "link_source": "spn",
+        "link_source_id": "7owybd2hrvdmdpm4zpo7hkn2pu"
+    }
+
+    # paste and then Ctrl-D:
+    cat | jq . -c > /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json
+
+
+## Ingest Command
+
+On `wbgrp-svc263`.
+
+In the current version of tool, `skip_cleanup_local_files=True` by default, so
+files will stick around.
+
+Note that `--no-spn2` is passed, so we are expecting a lot of `no-capture` in the output.
+
+
+    # first a small sample
+    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
+        | head -n5 \
+        | pv -l \
+        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
+        > /srv/sandcrawler/tasks/ingest_dataset_combined_results.ramp.json
+
+    # ok, run the whole batch through
+    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
+        | pv -l \
+        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
+        > /srv/sandcrawler/tasks/ingest_dataset_combined_results.json
+
+Got an error:
+
+    internetarchive.exceptions.AuthenticationError: No access_key or secret_key set! Have you run `ia configure`?
+
+Did a hot patch to try to have the uploads happen under a session, with config from ENV, but didn't work:
+
+    AttributeError: 'ArchiveSession' object has no attribute 'upload'
+
+Going to hack with config in homedir for now.
+
+Extract URLs for crawling:
+
+    cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
+        | rg '"no-capture"' \
+        | rg -v '"manifest"' \
+        | jq 'select(.status = "no-capture")' -c \
+        | jq .request.base_url -r \
+        | pv -l \
+        > /srv/sandcrawler/tasks/dataset_seedlist.base_url.txt
+
+    cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
+        | rg '"no-capture"' \
+        | rg '"manifest"' \
+        | jq 'select(.status = "no-capture")' -c \
+        | rg '"web-' \
+        | jq .manifest[].terminal_url -r \
+        | pv -l \
+        > /srv/sandcrawler/tasks/dataset_seedlist.manifest_terminal.txt
+
+### Exceptions Encountered
+
+    File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 193, in process
+        internetarchive.upload
+    [...]
+    ConnectionResetError: [Errno 104] Connection reset by peer
+    urllib3.exceptions.ProtocolError
+    requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), 'https://s3.us.archive.org/zenodo.org-3275525/rhOverM_Asymptotic_GeometricUnits_CoM.h5')
+
+
+    Traceback (most recent call last):
+      File "./ingest_tool.py", line 208, in <module>
+        main()
+      File "./ingest_tool.py", line 204, in main
+        args.func(args)
+      File "./ingest_tool.py", line 57, in run_requests
+        result = fileset_worker.process(request)
+      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 375, in process
+        archive_result = strategy_helper.process(dataset_meta)
+      File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 130, in process
+        r.raise_for_status()
+      File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status  
+        raise HTTPError(http_error_msg, response=self)
+    requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ndownloader.figshare.com/files/5474201
+
+download sometimes just slowly time out, like after a day or more
+
+
+    Traceback (most recent call last):
+      File "./ingest_tool.py", line 208, in <module>
+        main()
+      File "./ingest_tool.py", line 204, in main
+        args.func(args)
+      File "./ingest_tool.py", line 57, in run_requests
+        result = fileset_worker.process(request)
+      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 381, in process
+        archive_result = strategy_helper.process(dataset_meta)
+      File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 155, in process
+        file_meta = gen_file_metadata_path(local_path, allow_empty=True)
+      File "/srv/sandcrawler/src/python/sandcrawler/misc.py", line 89, in gen_file_metadata_path
+        mimetype = magic.Magic(mime=True).from_file(path)
+      File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/magic/__init__.py", line 111, in from_file
+        with _real_open(filename):
+    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sandcrawler/figshare.com-7925396-v1/HG02070.dedup.realigned.recalibrated.hc.g.vcf.gz'
+
+
+    Traceback (most recent call last):
+      File "./ingest_tool.py", line 208, in <module>
+        main()
+      File "./ingest_tool.py", line 204, in main
+        args.func(args)
+      File "./ingest_tool.py", line 57, in run_requests
+        result = fileset_worker.process(request)
+      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 314, in process
+        dataset_meta = platform_helper.process_request(request, resource, html_biblio)
+      File "/srv/sandcrawler/src/python/sandcrawler/fileset_platforms.py", line 208, in process_request
+        obj_latest = obj["data"]["latestVersion"]
+    KeyError: 'latestVersion'
+
+Fixed the above, trying again:
+
+    git log | head -n1
+    # commit ffdc901fa067db55fe6cfeb8d0c3807d29df092c
+
+    Wed Dec 15 21:57:42 UTC 2021
+
+    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
+        | shuf \
+        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
+        | pv -l \
+        > /srv/sandcrawler/tasks/ingest_dataset_combined_results4.json
+
+Zenodo seems really slow, let's try filtering those out:
+
+    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
+        | rg -v 10.5281 \
+        | shuf \
+        | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
+        | pv -l \
+        > /srv/sandcrawler/tasks/ingest_dataset_combined_results5.json
+    # 3.76k 15:12:53 [68.7m/s]
+
+    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
+        | rg -v 10.5281 \
+        | shuf \
+        | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
+        | pv -l \
+        > /srv/sandcrawler/tasks/ingest_dataset_combined_results6.json
+
+## Fatcat Import
+
+    wc -l ingest_dataset_combined_results*.json
+         126 ingest_dataset_combined_results2.json
+         153 ingest_dataset_combined_results3.json
+         275 ingest_dataset_combined_results4.json
+        3762 ingest_dataset_combined_results5.json
+        7736 ingest_dataset_combined_results6.json
+         182 ingest_dataset_combined_results.json
+           5 ingest_dataset_combined_results.ramp.json
+       12239 total
+
+    cat ingest_dataset_combined_results*.json \
+        | rg '^\{' \
+        | jq '[.request.fatcat.release_ident, . | tostring] | @tsv' -r \
+        | sort \
+        | uniq --check-chars 26 \
+        | cut -f2 \
+        | rg -v '\\\\' \
+        | pv -l \
+        > uniq_ingest_dataset_combined_results.json
+    # 9.48k 0:00:06 [1.54k/s]
+
+    cat uniq_ingest_dataset_combined_results.json | jq .status -r | sort | uniq -c | sort -nr
+       7941 no-capture
+        374 platform-404
+        369 terminal-bad-status
+        348 success-file
+        172 success
+         79 platform-scope
+         77 error-platform-download
+         47 empty-manifest
+         27 platform-restricted
+         20 too-many-files
+         12 redirect-loop
+          6 error-archiveorg-upload
+          3 too-large-size
+          3 mismatch
+          1 no-platform-match
+
+    cat uniq_ingest_dataset_combined_results.json \
+        | rg '"success' \
+        | jq 'select(.status == "success") | .' -c \
+        > uniq_ingest_dataset_combined_results.success.json
+
+    cat uniq_ingest_dataset_combined_results.json \
+        | rg '"success' \
+        | jq 'select(.status == "success-file") | .' -c \
+        > uniq_ingest_dataset_combined_results.success-file.json
+
+On fatcat QA instance:
+
+    git log | head -n1
+    # commit cca680e2cc4768a4d45e199f6256a433b25b4075
+
+    head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
+        | ./fatcat_import.py ingest-fileset-results -
+    # Counter({'total': 10, 'skip': 10, 'skip-single-file': 10, 'insert': 0, 'update': 0, 'exists': 0})
+
+    head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
+        | ./fatcat_import.py ingest-file-results -
+    # Counter({'total': 10, 'skip': 10, 'skip-ingest-type': 10, 'insert': 0, 'update': 0, 'exists': 0})
+
+Need to update fatcat file worker to support single-file filesets... was that the plan?
+
+    head /tmp/uniq_ingest_dataset_combined_results.success.json \
+        | ./fatcat_import.py ingest-fileset-results -
+    # Counter({'total': 10, 'skip': 10, 'skip-no-access-url': 10, 'insert': 0, 'update': 0, 'exists': 0})
+
+    # Counter({'total': 10, 'insert': 10, 'skip': 0, 'update': 0, 'exists': 0})
+
+
+## Summary
+
+As follow-up, it may be worth doing another manual round of ingest requests.
+After that, would be good to fill in "glue" code so that this can be done with
+kafka workers, and do re-tries/dumps using sandcrawler SQL database. Then can
+start scaling up more ingest, using ingest tool, "bulk mode" processing,
+heritrix crawls from `no-capture` dumps, etc, similar to bulk file ingest
+process.
+
+For scaling, let's do a "full" ingest request generation of all datasets, and
+crawl the base URL with heritrix, in fast/direct mode. Expect this to be tens
+of millions of mostly DOIs (doi.org URLs), should crawl quickly.
+
+Then, do bulk downloading with ingest worker, perhaps on misc-vm or aitio.
+uploading large datasets to archive.org, but not doing SPN web requests. Feed
+the resulting huge file seedlist into a heritrix crawl to download web files.
+
+Will need to add support for more specific platforms.
+
+
+### Huge Bulk Ingest Prep
+
+On prod instance:
+
+    ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query type:dataset \
+        | pv -l \
+        | gzip \
+        > /srv/fatcat/tasks/ingest_dataset_bulk.2022-01-05.json.gz
+    # Expecting 11264787 release objects in search queries
+    # TIMEOUT ERROR
+    # 6.07M 19:13:02 [87.7 /s] (partial)
+
+As follow-up, should do a full batch (not partial). For now search index is too
+unreliable (read timeouts).
+
+    zcat ingest_dataset_bulk.2022-01-05.partial.json.gz \
+        | jq .base_url -r \
+        | sort -u \
+        | shuf \
+        | awk '{print "F+ " $1}' \
+        > ingest_dataset_bulk.2022-01-05.partial.schedule
+
+## Retries (2022-01-12)
+
+This is after having done a bunch of crawling.
+
+    cat ingest_dataset_combined_results6.json \
+        | rg '"no-capture"' \
+        | jq 'select(.status = "no-capture")' -c \
+        | jq .request -c \
+        | pv -l \
+        > ingest_dataset_retry.json
+    => 6.51k 0:00:01 [3.55k/s]
+
+    cat /srv/sandcrawler/tasks/ingest_dataset_retry.json \
+        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
+        | pv -l \
+        > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json
+
author	Bryan Newbold <bnewbold@archive.org>	2022-01-27 17:55:15 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2022-01-27 17:55:15 -0800
commit	c8e2462471a010e4ae368941b539e9404f3768fc (patch)
tree	8b9eaa02fbe7e75a8dfb09f341f77e6d645cc3b9 /notes/ingest/2021-12-13_datasets.md
parent	2a96e2baeb7d318a4aa2abbda7052757a02f5167 (diff)
download	sandcrawler-c8e2462471a010e4ae368941b539e9404f3768fc.tar.gz sandcrawler-c8e2462471a010e4ae368941b539e9404f3768fc.zip