New snapshot released 2021-07-02. Should be "boring" ingest and crawl. ## Transform and Load # in sandcrawler pipenv on sandcrawler1-vm (svc506) zcat /srv/sandcrawler/tasks/unpaywall_snapshot_2021-07-02T151134.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /srv/sandcrawler/tasks/unpaywall_snapshot_2021-07-02.ingest_request.json => 32.2M 3:01:52 [2.95k/s] cat /srv/sandcrawler/tasks/unpaywall_snapshot_2021-07-02.ingest_request.json | pv -l | ./persist_tool.py ingest-request - => Worker: Counter({'total': 32196260, 'insert-requests': 3325954, 'update-requests': 0}) => JSON lines pushed: Counter({'total': 32196260, 'pushed': 32196260}) ## Dump new URLs, Transform, Bulk Ingest COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' -- AND date(ingest_request.created) > '2021-01-01' AND (ingest_file_result.status IS NULL OR ingest_file_result.status = 'no-capture') ) TO '/srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.rows.json'; => COPY 3556146 # previous, 2020-10 run: COPY 4216339 # previous, 2021-07 run: COPY 3277484 Oops, should have run instead, with the date filter: COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2021-07-01' AND (ingest_file_result.status IS NULL OR ingest_file_result.status = 'no-capture') ) TO '/srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.rows.json'; But didn't, so processed all instead. ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.rows.json | pv -l | shuf > /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.ingest_request.json => 3.56M 0:01:59 [29.8k/s] Enqueue the whole batch: cat /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 => done, on 2021-07-13 ## Check Pre-Crawl Status Only the recent bulk ingest: SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2021-07-01' GROUP BY status ORDER BY COUNT DESC LIMIT 20; status | count -------------------------+--------- no-capture | 1831827 success | 1343604 redirect-loop | 103999 terminal-bad-status | 19845 no-pdf-link | 17448 link-loop | 5027 wrong-mimetype | 2270 cdx-error | 523 body-too-large | 321 null-body | 298 wayback-content-error | 242 petabox-error | 155 gateway-timeout | 138 invalid-host-resolution | 120 wayback-error | 109 blocked-cookie | 9 timeout | 7 | 3 bad-redirect | 3 spn2-cdx-lookup-failure | 3 (20 rows) ## Dump Seedlist Dump rows: COPY ( SELECT row_to_json(t1.*) FROM ( SELECT ingest_request.*, ingest_file_result as result FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND date(ingest_request.created) > '2021-07-01' AND ingest_request.link_source = 'unpaywall' AND (ingest_file_result.status = 'no-capture' OR ingest_file_result.status = 'cdx-error' OR ingest_file_result.status = 'wayback-error' OR ingest_file_result.status = 'gateway-timeout' OR ingest_file_result.status = 'spn2-cdx-lookup-failure' ) AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%' AND ingest_request.base_url NOT LIKE '%pubs.acs.org%' AND ingest_request.base_url NOT LIKE '%ahajournals.org%' AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%' AND ingest_request.base_url NOT LIKE '%aip.scitation.org%' AND ingest_request.base_url NOT LIKE '%academic.oup.com%' AND ingest_request.base_url NOT LIKE '%tandfonline.com%' AND ingest_request.base_url NOT LIKE '%.archive.org%' AND ingest_request.base_url NOT LIKE '%://archive.org%' AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%' AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%' AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%' AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%' AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%' AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%' AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%' AND ingest_file_result.terminal_url NOT LIKE '%.archive.org%' AND ingest_file_result.terminal_url NOT LIKE '%://archive.org%' ) t1 ) TO '/srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json'; => COPY 1743186 Prep ingest requests (for post-crawl use): ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | pv -l > /srv/sandcrawler/tasks/unpaywall_crawl_ingest_2021-07-02.json => 1.74M 0:01:33 [18.6k/s] And actually dump seedlist(s): cat /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | jq -r .base_url | sort -u -S 4G > /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.url.txt cat /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | rg '"no-capture"' | jq -r .result.terminal_url | rg -v ^null$ | sort -u -S 4G > /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.terminal_url.txt cat /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | rg -v '"no-capture"' | jq -r .base_url | sort -u -S 4G > /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.no_terminal_url.txt wc -l /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.*.txt 1 /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.no_terminal_url.txt 1643963 /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.terminal_url.txt 1644028 /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.url.txt 3287992 total Then run crawl (see `journal-crawls` git repo). ## Post-Crawl Bulk Ingest cat /srv/sandcrawler/tasks/unpaywall_crawl_ingest_2021-07-02.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 => 1.74M 0:01:59 [14.6k/s] ## Post-Ingest Stats Only the recent updates: SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2021-07-01' GROUP BY status ORDER BY COUNT DESC LIMIT 20; status | count -------------------------+--------- success | 2690258 redirect-loop | 227328 no-capture | 157368 terminal-bad-status | 118943 no-pdf-link | 92698 blocked-cookie | 19478 link-loop | 9249 wrong-mimetype | 4918 cdx-error | 1786 wayback-error | 1497 null-body | 1302 body-too-large | 433 wayback-content-error | 245 petabox-error | 171 gateway-timeout | 138 invalid-host-resolution | 120 timeout | 12 bad-redirect | 4 | 3 spn2-cdx-lookup-failure | 1 (20 rows) Only the recent updates, by publication stage: SELECT ingest_request.release_stage, ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2021-07-01' GROUP BY release_stage, status ORDER BY release_stage, COUNT DESC LIMIT 100; release_stage | status | count ---------------+-------------------------+--------- accepted | success | 103144 accepted | no-pdf-link | 53981 accepted | terminal-bad-status | 4102 accepted | link-loop | 2799 accepted | no-capture | 2315 accepted | redirect-loop | 2171 accepted | blocked-cookie | 234 accepted | cdx-error | 140 accepted | wayback-error | 101 accepted | wrong-mimetype | 38 accepted | null-body | 10 accepted | petabox-error | 5 accepted | wayback-content-error | 4 accepted | gateway-timeout | 2 accepted | body-too-large | 2 published | success | 1919100 published | no-capture | 130104 published | redirect-loop | 127482 published | terminal-bad-status | 43118 published | no-pdf-link | 33505 published | blocked-cookie | 19034 published | link-loop | 6241 published | wrong-mimetype | 4163 published | null-body | 1195 published | cdx-error | 1151 published | wayback-error | 1105 published | wayback-content-error | 197 published | body-too-large | 195 published | petabox-error | 118 published | gateway-timeout | 35 published | invalid-host-resolution | 13 published | timeout | 8 published | bad-redirect | 2 published | spn2-cdx-lookup-failure | 1 published | bad-gzip-encoding | 1 submitted | success | 668014 submitted | redirect-loop | 97675 submitted | terminal-bad-status | 71723 submitted | no-capture | 24949 submitted | no-pdf-link | 5212 submitted | wrong-mimetype | 717 submitted | cdx-error | 495 submitted | wayback-error | 291 submitted | body-too-large | 236 submitted | blocked-cookie | 210 submitted | link-loop | 209 submitted | invalid-host-resolution | 107 submitted | gateway-timeout | 101 submitted | null-body | 97 submitted | petabox-error | 48 submitted | wayback-content-error | 44 submitted | timeout | 4 submitted | | 3 submitted | bad-redirect | 2 submitted | remote-server-error | 1 (55 rows) In total, this iteration of unpaywall ingest resulted in: - 3,325,954 raw ingest requests (new URLs total) - 1,743,186 (52% of all) of these had not been seen/crawled from any source yet (?), and attempted to crawl - 1,346,654 (77% of crawled) success from new heritrix crawling - 2,690,258 (80%) total success (including crawled initially for other reasons; out of all new URLs including those not expected to be success) ## Live Ingest Follow-Up Will run SPN requests on the ~160k `no-capture` URLs: COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2021-07-01' AND (ingest_file_result.status IS NULL OR ingest_file_result.status = 'no-capture') ) TO '/srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.rows.json'; => COPY 157371 ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.rows.json | pv -l | shuf > /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.ingest_request.json => 157k 0:00:04 [31.6k/s] Enqueue the whole batch: cat /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1 => DONE