aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2022-03-22 16:03:46 -0700
committerBryan Newbold <bnewbold@archive.org>2022-03-22 16:03:46 -0700
commitd3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch)
treeb4b8a9856eca7694d048f4f3e8086f8c3539682d
parentfd6dc7f36aecb6a303513476825cfe681500f02d (diff)
downloadsandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz
sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip
various ingest/task notes
-rw-r--r--notes/ingest/2021-12-13_datasets.md53
-rw-r--r--notes/ingest/2022-01-13_doi_crawl.md29
-rw-r--r--notes/ingest/2022-03_doaj.md12
-rw-r--r--notes/tasks/2021-12-06_regrobid.md8
4 files changed, 97 insertions, 5 deletions
diff --git a/notes/ingest/2021-12-13_datasets.md b/notes/ingest/2021-12-13_datasets.md
index edad789..1df633f 100644
--- a/notes/ingest/2021-12-13_datasets.md
+++ b/notes/ingest/2021-12-13_datasets.md
@@ -396,3 +396,56 @@ This is after having done a bunch of crawling.
| pv -l \
> /srv/sandcrawler/tasks/ingest_dataset_retry_results.json
+## Retries (2022-02)
+
+Finally got things to complete end to end for this batch!
+
+ cat ingest_dataset_retry_results5.json | jq .status -r | sort | uniq -c | sort -nr
+ 3220 terminal-bad-status
+ 2120 no-capture
+ 380 empty-manifest
+ 264 success-file
+ 251 success
+ 126 success-existing
+ 39 mismatch
+ 28 error-platform-download
+ 24 too-many-files
+ 20 platform-scope
+ 13 platform-restricted
+ 13 mismatch-size
+ 6 too-large-size
+ 3 transfer-encoding-error
+ 2 no-platform-match
+ 2 error-archiveorg-upload
+ 1 redirect-loop
+ 1 empty-blob
+
+Some more URLs to crawl:
+
+ cat ingest_dataset_retry_results5.json \
+ | rg '"no-capture"' \
+ | rg -v '"manifest"' \
+ | jq 'select(.status = "no-capture")' -c \
+ | jq .request.base_url -r \
+ | pv -l \
+ > /srv/sandcrawler/tasks/dataset_seedlist_retries5.base_url.txt
+ # 1.00
+ # just a single DOI that failed to crawl, for whatever reason
+
+ cat ingest_dataset_retry_results5.json \
+ | rg '"no-capture"' \
+ | rg '"manifest"' \
+ | jq 'select(.status = "no-capture")' -c \
+ | rg '"web-' \
+ | jq .manifest[].terminal_url -r \
+ | pv -l \
+ > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt
+
+These are ready to crawl, in the existing dataset crawl.
+
+ cat /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt \
+ | sort -u \
+ | shuf \
+ | awk '{print "F+ " $1}' \
+ > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.schedule
+
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
index 09a3b46..a6f08dd 100644
--- a/notes/ingest/2022-01-13_doi_crawl.md
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -137,7 +137,7 @@ many of these are likely to crawl successfully.
> /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
# re-running 2022-02-08 after this VM was upgraded
# Expecting 8321448 release objects in search queries
- # TODO: in-progress
+ # DONE
This is large enough that it will probably be a bulk ingest, and then probably
a follow-up crawl.
@@ -219,3 +219,30 @@ Added to `JOURNALS-PATCH-CRAWL-2022-01`
Unless it is a 404, should retry.
TODO: generate this list
+
+## Non-OA DOI Bulk Ingest
+
+Had previously run:
+
+ cat ingest_nonoa_doi.json.gz \
+ | rg -v "doi.org/10.2139/" \
+ | rg -v "doi.org/10.1021/" \
+ | rg -v "doi.org/10.1121/" \
+ | rg -v "doi.org/10.1515/" \
+ | rg -v "doi.org/10.1093/" \
+ | rg -v "europepmc.org" \
+ | pv -l \
+ | gzip \
+ > nonoa_doi.filtered.ingests.json.gz
+ # 7.35M 0:01:13 [99.8k/s]
+
+Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
+entirely finished, but after almost all queues (domains) have been done for
+several days.
+
+ zcat nonoa_doi.filtered.ingests.json.gz \
+ | rg -v "\\\\" \
+ | jq . -c \
+ | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Looks like many jstage `no-capture` status; these are still (slowly) crawling.
diff --git a/notes/ingest/2022-03_doaj.md b/notes/ingest/2022-03_doaj.md
index bace480..9722459 100644
--- a/notes/ingest/2022-03_doaj.md
+++ b/notes/ingest/2022-03_doaj.md
@@ -264,3 +264,15 @@ Create seedlist:
Send off an added to `TARGETED-ARTICLE-CRAWL-2022-03` heritrix crawl, will
re-ingest when that completes (a week or two?).
+
+
+## Bulk Ingest
+
+After `TARGETED-ARTICLE-CRAWL-2022-03` wrap-up.
+
+ # 2022-03-22
+ cat /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.requests.json \
+ | rg -v "\\\\" \
+ | jq . -c \
+ | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
diff --git a/notes/tasks/2021-12-06_regrobid.md b/notes/tasks/2021-12-06_regrobid.md
index d879277..79ea9f9 100644
--- a/notes/tasks/2021-12-06_regrobid.md
+++ b/notes/tasks/2021-12-06_regrobid.md
@@ -258,10 +258,10 @@ Submit individual batches like:
Overall progress:
x ungrobided_fatcat.2021-12-11.grobid_old.split_00.json
- . ungrobided_fatcat.2021-12-11.grobid_old.split_01.json
- => ungrobided_fatcat.2021-12-11.grobid_old.split_02.json
- => ungrobided_fatcat.2021-12-11.grobid_old.split_03.json
- => ungrobided_fatcat.2021-12-11.grobid_old.split_04.json
+ x ungrobided_fatcat.2021-12-11.grobid_old.split_01.json
+ x ungrobided_fatcat.2021-12-11.grobid_old.split_02.json
+ x ungrobided_fatcat.2021-12-11.grobid_old.split_03.json
+ . ungrobided_fatcat.2021-12-11.grobid_old.split_04.json
=> ungrobided_fatcat.2021-12-11.grobid_old.split_05.json
=> ungrobided_fatcat.2021-12-11.grobid_old.split_06.json
=> ungrobided_fatcat.2021-12-11.grobid_old.split_07.json