diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-03-22 16:03:46 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-03-22 16:03:46 -0700 |
commit | d3638a9fd9ed11fb4484038852f8e02b2f5a7b41 (patch) | |
tree | b4b8a9856eca7694d048f4f3e8086f8c3539682d /notes/ingest/2022-03_doaj.md | |
parent | fd6dc7f36aecb6a303513476825cfe681500f02d (diff) | |
download | sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.tar.gz sandcrawler-d3638a9fd9ed11fb4484038852f8e02b2f5a7b41.zip |
various ingest/task notes
Diffstat (limited to 'notes/ingest/2022-03_doaj.md')
-rw-r--r-- | notes/ingest/2022-03_doaj.md | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/notes/ingest/2022-03_doaj.md b/notes/ingest/2022-03_doaj.md index bace480..9722459 100644 --- a/notes/ingest/2022-03_doaj.md +++ b/notes/ingest/2022-03_doaj.md @@ -264,3 +264,15 @@ Create seedlist: Send off an added to `TARGETED-ARTICLE-CRAWL-2022-03` heritrix crawl, will re-ingest when that completes (a week or two?). + + +## Bulk Ingest + +After `TARGETED-ARTICLE-CRAWL-2022-03` wrap-up. + + # 2022-03-22 + cat /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.requests.json \ + | rg -v "\\\\" \ + | jq . -c \ + | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + |