diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-03-22 16:03:46 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-03-22 16:03:46 -0700 |
commit | d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch) | |
tree | b4b8a9856eca7694d048f4f3e8086f8c3539682d | |
parent | fd6dc7f36aecb6a303513476825cfe681500f02d (diff) | |
download | sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip |
various ingest/task notes
-rw-r--r-- | notes/ingest/2021-12-13_datasets.md | 53 | ||||
-rw-r--r-- | notes/ingest/2022-01-13_doi_crawl.md | 29 | ||||
-rw-r--r-- | notes/ingest/2022-03_doaj.md | 12 | ||||
-rw-r--r-- | notes/tasks/2021-12-06_regrobid.md | 8 |
4 files changed, 97 insertions, 5 deletions
diff --git a/notes/ingest/2021-12-13_datasets.md b/notes/ingest/2021-12-13_datasets.md index edad789..1df633f 100644 --- a/notes/ingest/2021-12-13_datasets.md +++ b/notes/ingest/2021-12-13_datasets.md @@ -396,3 +396,56 @@ This is after having done a bunch of crawling. | pv -l \ > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json +## Retries (2022-02) + +Finally got things to complete end to end for this batch! + + cat ingest_dataset_retry_results5.json | jq .status -r | sort | uniq -c | sort -nr + 3220 terminal-bad-status + 2120 no-capture + 380 empty-manifest + 264 success-file + 251 success + 126 success-existing + 39 mismatch + 28 error-platform-download + 24 too-many-files + 20 platform-scope + 13 platform-restricted + 13 mismatch-size + 6 too-large-size + 3 transfer-encoding-error + 2 no-platform-match + 2 error-archiveorg-upload + 1 redirect-loop + 1 empty-blob + +Some more URLs to crawl: + + cat ingest_dataset_retry_results5.json \ + | rg '"no-capture"' \ + | rg -v '"manifest"' \ + | jq 'select(.status = "no-capture")' -c \ + | jq .request.base_url -r \ + | pv -l \ + > /srv/sandcrawler/tasks/dataset_seedlist_retries5.base_url.txt + # 1.00 + # just a single DOI that failed to crawl, for whatever reason + + cat ingest_dataset_retry_results5.json \ + | rg '"no-capture"' \ + | rg '"manifest"' \ + | jq 'select(.status = "no-capture")' -c \ + | rg '"web-' \ + | jq .manifest[].terminal_url -r \ + | pv -l \ + > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt + +These are ready to crawl, in the existing dataset crawl. + + cat /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt \ + | sort -u \ + | shuf \ + | awk '{print "F+ " $1}' \ + > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.schedule + diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md index 09a3b46..a6f08dd 100644 --- a/notes/ingest/2022-01-13_doi_crawl.md +++ b/notes/ingest/2022-01-13_doi_crawl.md @@ -137,7 +137,7 @@ many of these are likely to crawl successfully. > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz # re-running 2022-02-08 after this VM was upgraded # Expecting 8321448 release objects in search queries - # TODO: in-progress + # DONE This is large enough that it will probably be a bulk ingest, and then probably a follow-up crawl. @@ -219,3 +219,30 @@ Added to `JOURNALS-PATCH-CRAWL-2022-01` Unless it is a 404, should retry. TODO: generate this list + +## Non-OA DOI Bulk Ingest + +Had previously run: + + cat ingest_nonoa_doi.json.gz \ + | rg -v "doi.org/10.2139/" \ + | rg -v "doi.org/10.1021/" \ + | rg -v "doi.org/10.1121/" \ + | rg -v "doi.org/10.1515/" \ + | rg -v "doi.org/10.1093/" \ + | rg -v "europepmc.org" \ + | pv -l \ + | gzip \ + > nonoa_doi.filtered.ingests.json.gz + # 7.35M 0:01:13 [99.8k/s] + +Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has +entirely finished, but after almost all queues (domains) have been done for +several days. + + zcat nonoa_doi.filtered.ingests.json.gz \ + | rg -v "\\\\" \ + | jq . -c \ + | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +Looks like many jstage `no-capture` status; these are still (slowly) crawling. diff --git a/notes/ingest/2022-03_doaj.md b/notes/ingest/2022-03_doaj.md index bace480..9722459 100644 --- a/notes/ingest/2022-03_doaj.md +++ b/notes/ingest/2022-03_doaj.md @@ -264,3 +264,15 @@ Create seedlist: Send off an added to `TARGETED-ARTICLE-CRAWL-2022-03` heritrix crawl, will re-ingest when that completes (a week or two?). + + +## Bulk Ingest + +After `TARGETED-ARTICLE-CRAWL-2022-03` wrap-up. + + # 2022-03-22 + cat /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.requests.json \ + | rg -v "\\\\" \ + | jq . -c \ + | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + diff --git a/notes/tasks/2021-12-06_regrobid.md b/notes/tasks/2021-12-06_regrobid.md index d879277..79ea9f9 100644 --- a/notes/tasks/2021-12-06_regrobid.md +++ b/notes/tasks/2021-12-06_regrobid.md @@ -258,10 +258,10 @@ Submit individual batches like: Overall progress: x ungrobided_fatcat.2021-12-11.grobid_old.split_00.json - . ungrobided_fatcat.2021-12-11.grobid_old.split_01.json - => ungrobided_fatcat.2021-12-11.grobid_old.split_02.json - => ungrobided_fatcat.2021-12-11.grobid_old.split_03.json - => ungrobided_fatcat.2021-12-11.grobid_old.split_04.json + x ungrobided_fatcat.2021-12-11.grobid_old.split_01.json + x ungrobided_fatcat.2021-12-11.grobid_old.split_02.json + x ungrobided_fatcat.2021-12-11.grobid_old.split_03.json + . ungrobided_fatcat.2021-12-11.grobid_old.split_04.json => ungrobided_fatcat.2021-12-11.grobid_old.split_05.json => ungrobided_fatcat.2021-12-11.grobid_old.split_06.json => ungrobided_fatcat.2021-12-11.grobid_old.split_07.json |