First round of production dataset ingest. Aiming to get one or two small
repositories entirely covered, and a few thousand datasets from all supported
platforms.

Planning to run with sandcrawler in batch mode on `wbgrp-svc263`, expecting up
to a TByte of content locally (on spinning disk). For successful output, will
run through fatcat import; for a subset of unsuccessful, will start a small
heritrix crawl.


## Ingest Generation

Summary:

    wc -l /srv/fatcat/tasks/ingest_dataset_*pilot.json
          2 /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json
       1702 /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
       2975 /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
      10000 /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
      10000 /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json

All the below ingest requests were combined into a single large file:

    cat /srv/fatcat/tasks/ingest_dataset*pilot.json | shuf | pv -l | gzip > /srv/fatcat/tasks/ingest_dataset_combined.json.gz
    # 24.7k 0:00:00 [91.9k/s]

### Figshare

- sample 10k datasets (not other types)
- want only "versioned" DOIs; use regex on DOI to ensure

    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.6084 type:dataset' \
        | rg '10\.6084/m9\.figshare\.\d+.v\d+' \
        | shuf -n10000 \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_figshare_pilot.json
    # Counter({'estimate': 505968, 'ingest_request': 50000, 'elasticsearch_release': 50000})

### Zenodo

- has DOIs (of course)
- want only "versioned" DOIs? how to skip?
- sample 10k

    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.5281 type:dataset' \
        | rg '10\.5281/zenodo' \
        | shuf -n10000 \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_zenodo_pilot.json

### Goettingen Research Online

- <https://data.goettingen-research-online.de/>
- Dataverse instance, not harvard-hosted
- ~1,400 datasets, ~10,500 files
- has DOIs
- `doi_prefix:10.25625`, then filter to only one slash

    ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query 'doi_prefix:10.25625 type:dataset' \
        | rg -v '10\.25625/[a-z0-9]+/[a-z0-9]' \
        | shuf \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_dataverse_goettingen_pilot.json
    # Counter({'ingest_request': 12739, 'elasticsearch_release': 12739, 'estimate': 12739})                                                                       # 1.7k 0:01:29 [  19 /s]

### Harvard Dataverse

- main harvard dataverse instance, many "sub-dataverses"
- ~137,000 datasets, ~1,400,000 files
- 10k sample

    ./fatcat_ingest.py --limit 50000 --ingest-type dataset --allow-non-oa query 'doi_prefix:10.7910 type:dataset' \
        | rg '10\.7910/dvn/[a-z0-9]{6}' \
        | rg -v '10\.7910/dvn/[a-z0-9]{6}/[a-z0-9]' \
        | shuf -n10000 \
        | pv -l \
        > /srv/fatcat/tasks/ingest_dataset_dataverse_harvard_pilot.json
    # Counter({'estimate': 660979, 'ingest_request': 50000, 'elasticsearch_release': 50000})                                                                      # 2.97k 0:03:26 [14.4 /s]

Note that this was fewer than expected, but moving on anyways.

### archive.org

A couple hand-filtered items.

"CAT" dataset
- item: <https://archive.org/details/CAT_DATASET>
- fatcat release (for paper): `release_36vy7s5gtba67fmyxlmijpsaui`

"The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing"
- https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62
- https://fatcat.wiki/release/7owybd2hrvdmdpm4zpo7hkn2pu (paper)


    {
        "ingest_type": "dataset",
        "ingest_request_source": "savepapernow",
        "base_url": "https://archive.org/details/CAT_DATASET",
        "release_stage": "published",
        "fatcat": {
            "release_ident": "36vy7s5gtba67fmyxlmijpsaui",
            "work_ident": "ycqtbhnfmzamheq2amztiwbsri"
        },
        "ext_ids": {},
        "link_source": "spn",
        "link_source_id": "36vy7s5gtba67fmyxlmijpsaui"
    }
    {
        "ingest_type": "dataset",
        "ingest_request_source": "savepapernow",
        "base_url": "https://archive.org/details/academictorrents_5e9ef2b5531ce3b965681be6eccab1fbd114af62",
        "release_stage": "published",
        "fatcat": {
            "release_ident": "7owybd2hrvdmdpm4zpo7hkn2pu",
            "work_ident": "3xkz7iffwbdfhbwhnd73iu66cu"
        },
        "ext_ids": {},
        "link_source": "spn",
        "link_source_id": "7owybd2hrvdmdpm4zpo7hkn2pu"
    }

    # paste and then Ctrl-D:
    cat | jq . -c > /srv/fatcat/tasks/ingest_dataset_dataverse_archiveorg_pilot.json


## Ingest Command

On `wbgrp-svc263`.

In the current version of tool, `skip_cleanup_local_files=True` by default, so
files will stick around.

Note that `--no-spn2` is passed, so we are expecting a lot of `no-capture` in the output.


    # first a small sample
    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | head -n5 \
        | pv -l \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results.ramp.json

    # ok, run the whole batch through
    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | pv -l \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 - \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results.json

Got an error:

    internetarchive.exceptions.AuthenticationError: No access_key or secret_key set! Have you run `ia configure`?

Did a hot patch to try to have the uploads happen under a session, with config from ENV, but didn't work:

    AttributeError: 'ArchiveSession' object has no attribute 'upload'

Going to hack with config in homedir for now.

Extract URLs for crawling:

    cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
        | rg '"no-capture"' \
        | rg -v '"manifest"' \
        | jq 'select(.status = "no-capture")' -c \
        | jq .request.base_url -r \
        | pv -l \
        > /srv/sandcrawler/tasks/dataset_seedlist.base_url.txt

    cat /srv/sandcrawler/tasks/ingest_dataset_combined_results*.json \
        | rg '"no-capture"' \
        | rg '"manifest"' \
        | jq 'select(.status = "no-capture")' -c \
        | rg '"web-' \
        | jq .manifest[].terminal_url -r \
        | pv -l \
        > /srv/sandcrawler/tasks/dataset_seedlist.manifest_terminal.txt

### Exceptions Encountered

    File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 193, in process
        internetarchive.upload
    [...]
    ConnectionResetError: [Errno 104] Connection reset by peer
    urllib3.exceptions.ProtocolError
    requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), 'https://s3.us.archive.org/zenodo.org-3275525/rhOverM_Asymptotic_GeometricUnits_CoM.h5')


    Traceback (most recent call last):
      File "./ingest_tool.py", line 208, in <module>
        main()
      File "./ingest_tool.py", line 204, in main
        args.func(args)
      File "./ingest_tool.py", line 57, in run_requests
        result = fileset_worker.process(request)
      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 375, in process
        archive_result = strategy_helper.process(dataset_meta)
      File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 130, in process
        r.raise_for_status()
      File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status  
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ndownloader.figshare.com/files/5474201

download sometimes just slowly time out, like after a day or more


    Traceback (most recent call last):
      File "./ingest_tool.py", line 208, in <module>
        main()
      File "./ingest_tool.py", line 204, in main
        args.func(args)
      File "./ingest_tool.py", line 57, in run_requests
        result = fileset_worker.process(request)
      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 381, in process
        archive_result = strategy_helper.process(dataset_meta)
      File "/srv/sandcrawler/src/python/sandcrawler/fileset_strategies.py", line 155, in process
        file_meta = gen_file_metadata_path(local_path, allow_empty=True)
      File "/srv/sandcrawler/src/python/sandcrawler/misc.py", line 89, in gen_file_metadata_path
        mimetype = magic.Magic(mime=True).from_file(path)
      File "/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/magic/__init__.py", line 111, in from_file
        with _real_open(filename):
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sandcrawler/figshare.com-7925396-v1/HG02070.dedup.realigned.recalibrated.hc.g.vcf.gz'


    Traceback (most recent call last):
      File "./ingest_tool.py", line 208, in <module>
        main()
      File "./ingest_tool.py", line 204, in main
        args.func(args)
      File "./ingest_tool.py", line 57, in run_requests
        result = fileset_worker.process(request)
      File "/srv/sandcrawler/src/python/sandcrawler/ingest_fileset.py", line 314, in process
        dataset_meta = platform_helper.process_request(request, resource, html_biblio)
      File "/srv/sandcrawler/src/python/sandcrawler/fileset_platforms.py", line 208, in process_request
        obj_latest = obj["data"]["latestVersion"]
    KeyError: 'latestVersion'

Fixed the above, trying again:

    git log | head -n1
    # commit ffdc901fa067db55fe6cfeb8d0c3807d29df092c

    Wed Dec 15 21:57:42 UTC 2021

    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | shuf \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results4.json

Zenodo seems really slow, let's try filtering those out:

    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | rg -v 10.5281 \
        | shuf \
        | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results5.json
    # 3.76k 15:12:53 [68.7m/s]

    zcat /srv/sandcrawler/tasks/ingest_dataset_combined.json.gz \
        | rg -v 10.5281 \
        | shuf \
        | parallel -j8 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_combined_results6.json

## Fatcat Import

    wc -l ingest_dataset_combined_results*.json
         126 ingest_dataset_combined_results2.json
         153 ingest_dataset_combined_results3.json
         275 ingest_dataset_combined_results4.json
        3762 ingest_dataset_combined_results5.json
        7736 ingest_dataset_combined_results6.json
         182 ingest_dataset_combined_results.json
           5 ingest_dataset_combined_results.ramp.json
       12239 total

    cat ingest_dataset_combined_results*.json \
        | rg '^\{' \
        | jq '[.request.fatcat.release_ident, . | tostring] | @tsv' -r \
        | sort \
        | uniq --check-chars 26 \
        | cut -f2 \
        | rg -v '\\\\' \
        | pv -l \
        > uniq_ingest_dataset_combined_results.json
    # 9.48k 0:00:06 [1.54k/s]

    cat uniq_ingest_dataset_combined_results.json | jq .status -r | sort | uniq -c | sort -nr
       7941 no-capture
        374 platform-404
        369 terminal-bad-status
        348 success-file
        172 success
         79 platform-scope
         77 error-platform-download
         47 empty-manifest
         27 platform-restricted
         20 too-many-files
         12 redirect-loop
          6 error-archiveorg-upload
          3 too-large-size
          3 mismatch
          1 no-platform-match

    cat uniq_ingest_dataset_combined_results.json \
        | rg '"success' \
        | jq 'select(.status == "success") | .' -c \
        > uniq_ingest_dataset_combined_results.success.json

    cat uniq_ingest_dataset_combined_results.json \
        | rg '"success' \
        | jq 'select(.status == "success-file") | .' -c \
        > uniq_ingest_dataset_combined_results.success-file.json

On fatcat QA instance:

    git log | head -n1
    # commit cca680e2cc4768a4d45e199f6256a433b25b4075

    head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
        | ./fatcat_import.py ingest-fileset-results -
    # Counter({'total': 10, 'skip': 10, 'skip-single-file': 10, 'insert': 0, 'update': 0, 'exists': 0})

    head /tmp/uniq_ingest_dataset_combined_results.success-file.json \
        | ./fatcat_import.py ingest-file-results -
    # Counter({'total': 10, 'skip': 10, 'skip-ingest-type': 10, 'insert': 0, 'update': 0, 'exists': 0})

Need to update fatcat file worker to support single-file filesets... was that the plan?

    head /tmp/uniq_ingest_dataset_combined_results.success.json \
        | ./fatcat_import.py ingest-fileset-results -
    # Counter({'total': 10, 'skip': 10, 'skip-no-access-url': 10, 'insert': 0, 'update': 0, 'exists': 0})

    # Counter({'total': 10, 'insert': 10, 'skip': 0, 'update': 0, 'exists': 0})


## Summary

As follow-up, it may be worth doing another manual round of ingest requests.
After that, would be good to fill in "glue" code so that this can be done with
kafka workers, and do re-tries/dumps using sandcrawler SQL database. Then can
start scaling up more ingest, using ingest tool, "bulk mode" processing,
heritrix crawls from `no-capture` dumps, etc, similar to bulk file ingest
process.

For scaling, let's do a "full" ingest request generation of all datasets, and
crawl the base URL with heritrix, in fast/direct mode. Expect this to be tens
of millions of mostly DOIs (doi.org URLs), should crawl quickly.

Then, do bulk downloading with ingest worker, perhaps on misc-vm or aitio.
uploading large datasets to archive.org, but not doing SPN web requests. Feed
the resulting huge file seedlist into a heritrix crawl to download web files.

Will need to add support for more specific platforms.


### Huge Bulk Ingest Prep

On prod instance:

    ./fatcat_ingest.py --ingest-type dataset --allow-non-oa query type:dataset \
        | pv -l \
        | gzip \
        > /srv/fatcat/tasks/ingest_dataset_bulk.2022-01-05.json.gz
    # Expecting 11264787 release objects in search queries
    # TIMEOUT ERROR
    # 6.07M 19:13:02 [87.7 /s] (partial)

As follow-up, should do a full batch (not partial). For now search index is too
unreliable (read timeouts).

    zcat ingest_dataset_bulk.2022-01-05.partial.json.gz \
        | jq .base_url -r \
        | sort -u \
        | shuf \
        | awk '{print "F+ " $1}' \
        > ingest_dataset_bulk.2022-01-05.partial.schedule

## Retries (2022-01-12)

This is after having done a bunch of crawling.

    cat ingest_dataset_combined_results6.json \
        | rg '"no-capture"' \
        | jq 'select(.status = "no-capture")' -c \
        | jq .request -c \
        | pv -l \
        > ingest_dataset_retry.json
    => 6.51k 0:00:01 [3.55k/s]

    cat /srv/sandcrawler/tasks/ingest_dataset_retry.json \
        | parallel -j4 --linebuffer --round-robin --pipe ./ingest_tool.py requests --no-spn2 --enable-sentry - \
        | pv -l \
        > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json

## Retries (2022-02)

Finally got things to complete end to end for this batch!

    cat ingest_dataset_retry_results5.json | jq .status -r | sort | uniq -c | sort -nr
       3220 terminal-bad-status
       2120 no-capture
        380 empty-manifest
        264 success-file
        251 success
        126 success-existing
         39 mismatch
         28 error-platform-download
         24 too-many-files
         20 platform-scope
         13 platform-restricted
         13 mismatch-size
          6 too-large-size
          3 transfer-encoding-error
          2 no-platform-match
          2 error-archiveorg-upload
          1 redirect-loop
          1 empty-blob

Some more URLs to crawl:

    cat ingest_dataset_retry_results5.json \
        | rg '"no-capture"' \
        | rg -v '"manifest"' \
        | jq 'select(.status = "no-capture")' -c \
        | jq .request.base_url -r \
        | pv -l \
        > /srv/sandcrawler/tasks/dataset_seedlist_retries5.base_url.txt
    # 1.00
    # just a single DOI that failed to crawl, for whatever reason

    cat ingest_dataset_retry_results5.json \
        | rg '"no-capture"' \
        | rg '"manifest"' \
        | jq 'select(.status = "no-capture")' -c \
        | rg '"web-' \
        | jq .manifest[].terminal_url -r \
        | pv -l \
        > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt

These are ready to crawl, in the existing dataset crawl.

    cat /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt \
        | sort -u \
        | shuf \
        | awk '{print "F+ " $1}' \
        > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.schedule