diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-10-11 21:43:03 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-10-11 21:43:03 -0700 |
commit | 3315e52d32701492c81758e2d297dbb501e17bc9 (patch) | |
tree | 3f91d5a414523c4b584ad8a92b956ae45dde11fa /notes/ingest | |
parent | ca75f7295c3f5383534b25069ec1e64e4064cef6 (diff) | |
download | sandcrawler-3315e52d32701492c81758e2d297dbb501e17bc9.tar.gz sandcrawler-3315e52d32701492c81758e2d297dbb501e17bc9.zip |
update unpaywall 2020-04 notes
Diffstat (limited to 'notes/ingest')
-rw-r--r-- | notes/ingest/2020-04_unpaywall.md | 32 |
1 files changed, 32 insertions, 0 deletions
diff --git a/notes/ingest/2020-04_unpaywall.md b/notes/ingest/2020-04_unpaywall.md index 87600fd..a5e3bb1 100644 --- a/notes/ingest/2020-04_unpaywall.md +++ b/notes/ingest/2020-04_unpaywall.md @@ -277,4 +277,36 @@ Enqueue internal failures for re-ingest: ingest_file_result.status = 'wayback-error' ) ) TO '/grande/snapshots/unpaywall_errors_2020-08-28.rows.json'; + => 409606 + ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_errors_2020-08-28.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_errors_2020-08-28.requests.json + + cat /grande/snapshots/unpaywall_errors_2020-08-28.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +And after *that* (which ran quickly): + + status | count + -------------------------------------+---------- + success | 22281874 + no-pdf-link | 2258352 + redirect-loop | 1499251 + terminal-bad-status | 1004781 + no-capture | 401333 + wrong-mimetype | 112068 + cdx-error | 32259 + link-loop | 30137 + null-body | 13886 + wayback-error | 11653 + gateway-timeout | 3689 + spn2-cdx-lookup-failure | 1229 + petabox-error | 1036 + redirects-exceeded | 749 + invalid-host-resolution | 464 + spn2-error | 107 + spn2-error:job-failed | 91 + bad-redirect | 26 + spn2-error:soft-time-limit-exceeded | 9 + bad-gzip-encoding | 5 + (20 rows) + +22063013 -> 22281874 = + 218,861 success, not bad! |