diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-03-22 16:03:46 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-03-22 16:03:46 -0700 |
commit | d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch) | |
tree | b4b8a9856eca7694d048f4f3e8086f8c3539682d /notes/ingest/2022-01-13_doi_crawl.md | |
parent | fd6dc7f36aecb6a303513476825cfe681500f02d (diff) | |
download | sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip |
various ingest/task notes
Diffstat (limited to 'notes/ingest/2022-01-13_doi_crawl.md')
-rw-r--r-- | notes/ingest/2022-01-13_doi_crawl.md | 29 |
1 files changed, 28 insertions, 1 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md index 09a3b46..a6f08dd 100644 --- a/notes/ingest/2022-01-13_doi_crawl.md +++ b/notes/ingest/2022-01-13_doi_crawl.md @@ -137,7 +137,7 @@ many of these are likely to crawl successfully. > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz # re-running 2022-02-08 after this VM was upgraded # Expecting 8321448 release objects in search queries - # TODO: in-progress + # DONE This is large enough that it will probably be a bulk ingest, and then probably a follow-up crawl. @@ -219,3 +219,30 @@ Added to `JOURNALS-PATCH-CRAWL-2022-01` Unless it is a 404, should retry. TODO: generate this list + +## Non-OA DOI Bulk Ingest + +Had previously run: + + cat ingest_nonoa_doi.json.gz \ + | rg -v "doi.org/10.2139/" \ + | rg -v "doi.org/10.1021/" \ + | rg -v "doi.org/10.1121/" \ + | rg -v "doi.org/10.1515/" \ + | rg -v "doi.org/10.1093/" \ + | rg -v "europepmc.org" \ + | pv -l \ + | gzip \ + > nonoa_doi.filtered.ingests.json.gz + # 7.35M 0:01:13 [99.8k/s] + +Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has +entirely finished, but after almost all queues (domains) have been done for +several days. + + zcat nonoa_doi.filtered.ingests.json.gz \ + | rg -v "\\\\" \ + | jq . -c \ + | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +Looks like many jstage `no-capture` status; these are still (slowly) crawling. |