aboutsummaryrefslogtreecommitdiffstats
path: root/notes/tasks
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-09-02 16:10:13 -0700
committerBryan Newbold <bnewbold@archive.org>2020-09-02 16:10:13 -0700
commit8cc3cebd2392d16026214f5e92b99a322ef2e044 (patch)
treee7252254193fb06d1f686bbdd4c18014a9f0ad15 /notes/tasks
parent48e72c4a49b2a8e057d74fa5f9cbf5c7d145289c (diff)
downloadsandcrawler-8cc3cebd2392d16026214f5e92b99a322ef2e044.tar.gz
sandcrawler-8cc3cebd2392d16026214f5e92b99a322ef2e044.zip
follow-up notes on processing 'holes'
Diffstat (limited to 'notes/tasks')
-rw-r--r--notes/tasks/2020-07-22_processing_holes.md19
1 files changed, 19 insertions, 0 deletions
diff --git a/notes/tasks/2020-07-22_processing_holes.md b/notes/tasks/2020-07-22_processing_holes.md
index 363989a..70e2b59 100644
--- a/notes/tasks/2020-07-22_processing_holes.md
+++ b/notes/tasks/2020-07-22_processing_holes.md
@@ -18,6 +18,11 @@ Full batch:
cat dump_unextracted_pdf_petabox.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.unextracted -p -1
+Re-ran on 2020-08-19:
+
+ wc -l dump_unextracted_pdf_petabox.2020-08-19.json
+ 971194 dump_unextracted_pdf_petabox.2020-08-19.json
+
## `pdf_meta` missing CDX rows
First, the GROBID-ized rows but only if has a fatcat file as well.
@@ -26,6 +31,13 @@ First, the GROBID-ized rows but only if has a fatcat file as well.
cat dump_unextracted_pdf.fatcat.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.unextracted -p -1
+Re-ran on 2020-08-19:
+
+ wc -l dump_unextracted_pdf.fatcat.2020-08-19.json
+ 65517 dump_unextracted_pdf.fatcat.2020-08-19.json
+
+Enqueued!
+
## `GROBID` missing petabox rows
wc -l /grande/snapshots/dump_ungrobided_pdf_petabox.2020-07-22.json
@@ -39,6 +51,13 @@ Full batch:
cat dump_ungrobided_pdf_petabox.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ungrobided-pg -p -1
+Re-ran on 2020-08-19:
+
+ wc -l dump_ungrobided_pdf_petabox.2020-08-19.json
+ 933 dump_ungrobided_pdf_petabox.2020-08-19.json
+
+Enqueued!
+
## `GROBID` for missing CDX rows in fatcat
wc -l dump_ungrobided_pdf.fatcat.2020-07-22.json