various ingest/task notes

author: Bryan Newbold <bnewbold@archive.org> 2022-03-22 16:03:46 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2022-03-22 16:03:46 -0700
commit: d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch)
tree: b4b8a9856eca7694d048f4f3e8086f8c3539682d /notes/ingest/2022-01-13_doi_crawl.md
parent: fd6dc7f36aecb6a303513476825cfe681500f02d (diff)
download: sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz
sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip
1 files changed, 28 insertions, 1 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
index 09a3b46..a6f08dd 100644
--- a/notes/ingest/2022-01-13_doi_crawl.md
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -137,7 +137,7 @@ many of these are likely to crawl successfully.
         > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
     # re-running 2022-02-08 after this VM was upgraded
     # Expecting 8321448 release objects in search queries
-    # TODO: in-progress
+    # DONE
 
 This is large enough that it will probably be a bulk ingest, and then probably
 a follow-up crawl.
@@ -219,3 +219,30 @@ Added to `JOURNALS-PATCH-CRAWL-2022-01`
 Unless it is a 404, should retry.
 
 TODO: generate this list
+
+## Non-OA DOI Bulk Ingest
+
+Had previously run:
+
+    cat ingest_nonoa_doi.json.gz \
+        | rg -v "doi.org/10.2139/" \
+        | rg -v "doi.org/10.1021/" \
+        | rg -v "doi.org/10.1121/" \
+        | rg -v "doi.org/10.1515/" \
+        | rg -v "doi.org/10.1093/" \
+        | rg -v "europepmc.org" \
+        | pv -l \
+        | gzip \
+        > nonoa_doi.filtered.ingests.json.gz
+    # 7.35M 0:01:13 [99.8k/s]
+
+Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
+entirely finished, but after almost all queues (domains) have been done for
+several days.
+
+    zcat nonoa_doi.filtered.ingests.json.gz \
+        | rg -v "\\\\" \
+        | jq . -c \
+        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Looks like many jstage `no-capture` status; these are still (slowly) crawling.
author	Bryan Newbold <bnewbold@archive.org>	2022-03-22 16:03:46 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2022-03-22 16:03:46 -0700
commit	d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch)
tree	b4b8a9856eca7694d048f4f3e8086f8c3539682d /notes/ingest/2022-01-13_doi_crawl.md
parent	fd6dc7f36aecb6a303513476825cfe681500f02d (diff)
download	sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip