various ingest/task notes

author: Bryan Newbold <bnewbold@archive.org> 2022-03-22 16:03:46 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2022-03-22 16:03:46 -0700
commit: d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch)
tree: b4b8a9856eca7694d048f4f3e8086f8c3539682d
parent: fd6dc7f36aecb6a303513476825cfe681500f02d (diff)
download: sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz
sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip
4 files changed, 97 insertions, 5 deletions
diff --git a/notes/ingest/2021-12-13_datasets.md b/notes/ingest/2021-12-13_datasets.md
index edad789..1df633f 100644
--- a/notes/ingest/2021-12-13_datasets.md
+++ b/notes/ingest/2021-12-13_datasets.md
@@ -396,3 +396,56 @@ This is after having done a bunch of crawling.
         | pv -l \
         > /srv/sandcrawler/tasks/ingest_dataset_retry_results.json
 
+## Retries (2022-02)
+
+Finally got things to complete end to end for this batch!
+
+    cat ingest_dataset_retry_results5.json | jq .status -r | sort | uniq -c | sort -nr
+       3220 terminal-bad-status
+       2120 no-capture
+        380 empty-manifest
+        264 success-file
+        251 success
+        126 success-existing
+         39 mismatch
+         28 error-platform-download
+         24 too-many-files
+         20 platform-scope
+         13 platform-restricted
+         13 mismatch-size
+          6 too-large-size
+          3 transfer-encoding-error
+          2 no-platform-match
+          2 error-archiveorg-upload
+          1 redirect-loop
+          1 empty-blob
+
+Some more URLs to crawl:
+
+    cat ingest_dataset_retry_results5.json \
+        | rg '"no-capture"' \
+        | rg -v '"manifest"' \
+        | jq 'select(.status = "no-capture")' -c \
+        | jq .request.base_url -r \
+        | pv -l \
+        > /srv/sandcrawler/tasks/dataset_seedlist_retries5.base_url.txt
+    # 1.00
+    # just a single DOI that failed to crawl, for whatever reason
+
+    cat ingest_dataset_retry_results5.json \
+        | rg '"no-capture"' \
+        | rg '"manifest"' \
+        | jq 'select(.status = "no-capture")' -c \
+        | rg '"web-' \
+        | jq .manifest[].terminal_url -r \
+        | pv -l \
+        > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt
+
+These are ready to crawl, in the existing dataset crawl.
+
+    cat /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.txt \
+        | sort -u \
+        | shuf \
+        | awk '{print "F+ " $1}' \
+        > /srv/sandcrawler/tasks/dataset_seedlist_retries5.manifest_terminal.schedule
+
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
index 09a3b46..a6f08dd 100644
--- a/notes/ingest/2022-01-13_doi_crawl.md
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -137,7 +137,7 @@ many of these are likely to crawl successfully.
         > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
     # re-running 2022-02-08 after this VM was upgraded
     # Expecting 8321448 release objects in search queries
-    # TODO: in-progress
+    # DONE
 
 This is large enough that it will probably be a bulk ingest, and then probably
 a follow-up crawl.
@@ -219,3 +219,30 @@ Added to `JOURNALS-PATCH-CRAWL-2022-01`
 Unless it is a 404, should retry.
 
 TODO: generate this list
+
+## Non-OA DOI Bulk Ingest
+
+Had previously run:
+
+    cat ingest_nonoa_doi.json.gz \
+        | rg -v "doi.org/10.2139/" \
+        | rg -v "doi.org/10.1021/" \
+        | rg -v "doi.org/10.1121/" \
+        | rg -v "doi.org/10.1515/" \
+        | rg -v "doi.org/10.1093/" \
+        | rg -v "europepmc.org" \
+        | pv -l \
+        | gzip \
+        > nonoa_doi.filtered.ingests.json.gz
+    # 7.35M 0:01:13 [99.8k/s]
+
+Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
+entirely finished, but after almost all queues (domains) have been done for
+several days.
+
+    zcat nonoa_doi.filtered.ingests.json.gz \
+        | rg -v "\\\\" \
+        | jq . -c \
+        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Looks like many jstage `no-capture` status; these are still (slowly) crawling.
diff --git a/notes/ingest/2022-03_doaj.md b/notes/ingest/2022-03_doaj.md
index bace480..9722459 100644
--- a/notes/ingest/2022-03_doaj.md
+++ b/notes/ingest/2022-03_doaj.md
@@ -264,3 +264,15 @@ Create seedlist:
 
 Send off an added to `TARGETED-ARTICLE-CRAWL-2022-03` heritrix crawl, will
 re-ingest when that completes (a week or two?).
+
+
+## Bulk Ingest
+
+After `TARGETED-ARTICLE-CRAWL-2022-03` wrap-up.
+
+    # 2022-03-22
+    cat /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.requests.json \
+        | rg -v "\\\\" \
+        | jq . -c \
+        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
diff --git a/notes/tasks/2021-12-06_regrobid.md b/notes/tasks/2021-12-06_regrobid.md
index d879277..79ea9f9 100644
--- a/notes/tasks/2021-12-06_regrobid.md
+++ b/notes/tasks/2021-12-06_regrobid.md
@@ -258,10 +258,10 @@ Submit individual batches like:
 Overall progress:
 
     x  ungrobided_fatcat.2021-12-11.grobid_old.split_00.json
-    .  ungrobided_fatcat.2021-12-11.grobid_old.split_01.json
-    => ungrobided_fatcat.2021-12-11.grobid_old.split_02.json
-    => ungrobided_fatcat.2021-12-11.grobid_old.split_03.json
-    => ungrobided_fatcat.2021-12-11.grobid_old.split_04.json
+    x  ungrobided_fatcat.2021-12-11.grobid_old.split_01.json
+    x  ungrobided_fatcat.2021-12-11.grobid_old.split_02.json
+    x  ungrobided_fatcat.2021-12-11.grobid_old.split_03.json
+    .  ungrobided_fatcat.2021-12-11.grobid_old.split_04.json
     => ungrobided_fatcat.2021-12-11.grobid_old.split_05.json
     => ungrobided_fatcat.2021-12-11.grobid_old.split_06.json
     => ungrobided_fatcat.2021-12-11.grobid_old.split_07.json
author	Bryan Newbold <bnewbold@archive.org>	2022-03-22 16:03:46 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2022-03-22 16:03:46 -0700
commit	d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch)
tree	b4b8a9856eca7694d048f4f3e8086f8c3539682d
parent	fd6dc7f36aecb6a303513476825cfe681500f02d (diff)
download	sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip