aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-10-11 21:43:03 -0700
committerBryan Newbold <bnewbold@archive.org>2020-10-11 21:43:03 -0700
commit3315e52d32701492c81758e2d297dbb501e17bc9 (patch)
tree3f91d5a414523c4b584ad8a92b956ae45dde11fa /notes/ingest
parentca75f7295c3f5383534b25069ec1e64e4064cef6 (diff)
downloadsandcrawler-3315e52d32701492c81758e2d297dbb501e17bc9.tar.gz
sandcrawler-3315e52d32701492c81758e2d297dbb501e17bc9.zip
update unpaywall 2020-04 notes
Diffstat (limited to 'notes/ingest')
-rw-r--r--notes/ingest/2020-04_unpaywall.md32
1 files changed, 32 insertions, 0 deletions
diff --git a/notes/ingest/2020-04_unpaywall.md b/notes/ingest/2020-04_unpaywall.md
index 87600fd..a5e3bb1 100644
--- a/notes/ingest/2020-04_unpaywall.md
+++ b/notes/ingest/2020-04_unpaywall.md
@@ -277,4 +277,36 @@ Enqueue internal failures for re-ingest:
ingest_file_result.status = 'wayback-error'
)
) TO '/grande/snapshots/unpaywall_errors_2020-08-28.rows.json';
+ => 409606
+ ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_errors_2020-08-28.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_errors_2020-08-28.requests.json
+
+ cat /grande/snapshots/unpaywall_errors_2020-08-28.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+And after *that* (which ran quickly):
+
+ status | count
+ -------------------------------------+----------
+ success | 22281874
+ no-pdf-link | 2258352
+ redirect-loop | 1499251
+ terminal-bad-status | 1004781
+ no-capture | 401333
+ wrong-mimetype | 112068
+ cdx-error | 32259
+ link-loop | 30137
+ null-body | 13886
+ wayback-error | 11653
+ gateway-timeout | 3689
+ spn2-cdx-lookup-failure | 1229
+ petabox-error | 1036
+ redirects-exceeded | 749
+ invalid-host-resolution | 464
+ spn2-error | 107
+ spn2-error:job-failed | 91
+ bad-redirect | 26
+ spn2-error:soft-time-limit-exceeded | 9
+ bad-gzip-encoding | 5
+ (20 rows)
+
+22063013 -> 22281874 = + 218,861 success, not bad!