aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2022-01-13_doi_crawl.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/ingest/2022-01-13_doi_crawl.md')
-rw-r--r--notes/ingest/2022-01-13_doi_crawl.md29
1 files changed, 28 insertions, 1 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
index 09a3b46..a6f08dd 100644
--- a/notes/ingest/2022-01-13_doi_crawl.md
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -137,7 +137,7 @@ many of these are likely to crawl successfully.
> /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
# re-running 2022-02-08 after this VM was upgraded
# Expecting 8321448 release objects in search queries
- # TODO: in-progress
+ # DONE
This is large enough that it will probably be a bulk ingest, and then probably
a follow-up crawl.
@@ -219,3 +219,30 @@ Added to `JOURNALS-PATCH-CRAWL-2022-01`
Unless it is a 404, should retry.
TODO: generate this list
+
+## Non-OA DOI Bulk Ingest
+
+Had previously run:
+
+ cat ingest_nonoa_doi.json.gz \
+ | rg -v "doi.org/10.2139/" \
+ | rg -v "doi.org/10.1021/" \
+ | rg -v "doi.org/10.1121/" \
+ | rg -v "doi.org/10.1515/" \
+ | rg -v "doi.org/10.1093/" \
+ | rg -v "europepmc.org" \
+ | pv -l \
+ | gzip \
+ > nonoa_doi.filtered.ingests.json.gz
+ # 7.35M 0:01:13 [99.8k/s]
+
+Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
+entirely finished, but after almost all queues (domains) have been done for
+several days.
+
+ zcat nonoa_doi.filtered.ingests.json.gz \
+ | rg -v "\\\\" \
+ | jq . -c \
+ | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Looks like many jstage `no-capture` status; these are still (slowly) crawling.