aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-05-28 14:28:08 -0700
committerBryan Newbold <bnewbold@archive.org>2020-05-28 14:28:08 -0700
commitb839dcb734805397b8bf611eb77942b9555f4915 (patch)
tree1f2046d15216c65e6c3949ef25804eb9297c395e
parent46c422e4b6d8e6a36ea65af19afd124ab42e457c (diff)
downloadsandcrawler-b839dcb734805397b8bf611eb77942b9555f4915.tar.gz
sandcrawler-b839dcb734805397b8bf611eb77942b9555f4915.zip
ingest: OAI-PMH count table
-rw-r--r--notes/ingest/2020-05_oai_pmh.md24
1 files changed, 24 insertions, 0 deletions
diff --git a/notes/ingest/2020-05_oai_pmh.md b/notes/ingest/2020-05_oai_pmh.md
index 37e7dfc..2f20415 100644
--- a/notes/ingest/2020-05_oai_pmh.md
+++ b/notes/ingest/2020-05_oai_pmh.md
@@ -142,6 +142,30 @@ but doesn't matter because fatcat wasn't importing these anyways):
ORDER BY COUNT DESC
LIMIT 20;
+ status | count
+ -------------------------+----------
+ no-capture | 42565875
+ success | 5227609
+ no-pdf-link | 2156341
+ redirect-loop | 559721
+ cdx-error | 260446
+ wrong-mimetype | 148871
+ terminal-bad-status | 109725
+ link-loop | 92792
+ null-body | 30688
+ | 15287
+ petabox-error | 11109
+ wayback-error | 6261
+ skip-url-blocklist | 184
+ gateway-timeout | 86
+ bad-gzip-encoding | 25
+ invalid-host-resolution | 24
+ spn2-cdx-lookup-failure | 22
+ bad-redirect | 15
+ spn2-error | 4
+ spn2-error:job-failed | 2
+ (20 rows)
+
Dump again for crawling:
COPY (