This is the first ingest (and crawl) of URLs from DOAJ article-level metadata. It will include at least 'pdf' and 'html' ingest requests, not just 'pdf' as in the past. Working off a 2020-11-13 snapshot. ## Transform and Load # in sandcrawler pipenv on aitio zcat /schnell/DOAJ-CRAWL-2020-11/doaj_article_data_2020-11-13_all.json.gz | ./scripts/doaj2ingestrequest.py - | pv -l > /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json => 6.7M 0:24:28 [4.57k/s] cat /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json | pv -l | ./persist_tool.py ingest-request - => ran in to error with blank `base_url` Second try after patches: zcat /schnell/DOAJ-CRAWL-2020-11/doaj_article_data_2020-11-13_all.json.gz | ./scripts/doaj2ingestrequest.py - | pv -l > /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json => 6.7M 0:24:29 [4.56k/s] cat /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json | pv -l | ./persist_tool.py ingest-request - => Worker: Counter({'total': 6703036, 'insert-requests': 163854, 'update-requests': 0}) => JSON lines pushed: Counter({'total': 6703036, 'pushed': 6703036}) ## Check Pre-Crawl Status SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.link_source = 'doaj' GROUP BY ingest_request.ingest_type, status -- next time include ingest_type in sort ORDER BY COUNT DESC LIMIT 30; ingest_type | status | count -------------+-------------------------+--------- pdf | | 3711532 html | | 2429003 pdf | success | 454403 pdf | redirect-loop | 48587 pdf | no-pdf-link | 24901 pdf | no-capture | 11569 xml | | 9442 pdf | link-loop | 8466 pdf | terminal-bad-status | 2015 pdf | wrong-mimetype | 1441 pdf | null-body | 1057 pdf | petabox-error | 299 pdf | cdx-error | 124 pdf | gateway-timeout | 114 pdf | wayback-error | 77 pdf | spn2-cdx-lookup-failure | 20 pdf | invalid-host-resolution | 4 pdf | spn2-error | 1 (18 rows) ## Dump new URLs, Transform, Bulk Ingest (PDF and XML only) Dump: COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url AND ingest_file_result.ingest_type = ingest_request.ingest_type WHERE (ingest_request.ingest_type = 'pdf' OR ingest_request.ingest_type = 'xml') AND ingest_request.link_source = 'doaj' -- AND date(ingest_request.created) > '2020-12-01' AND (ingest_file_result.status IS NULL OR ingest_file_result.status = 'no-capture') ) TO '/grande/snapshots/doaj_noingest_2020-11-19.rows.json'; => COPY 3732543 Transform: ./scripts/ingestrequest_row2json.py /grande/snapshots/doaj_noingest_2020-11-19.rows.json | pv -l | shuf > /grande/snapshots/doaj_noingest_2020-11-19.ingest_request.json => 3.73M 0:02:18 [26.9k/s] Definitely some non-URL strings in there; should try to filter those out earlier in the transform process. And/or have a constraint on the URL column in the database. Enqueue the whole batch: cat /grande/snapshots/doaj_noingest_2020-11-19.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 Started this batch off at 2020-11-19 18:10 (Pacific time) Stats after run: SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.link_source = 'doaj' GROUP BY ingest_request.ingest_type, status ORDER BY ingest_request.ingest_type, COUNT DESC LIMIT 30; ## Dump Seedlist After preliminary bulk ingest attempts, dump rows: COPY ( SELECT row_to_json(t1.*) FROM ( SELECT ingest_request.*, ingest_file_result as result FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url AND ingest_file_result.ingest_type = ingest_request.ingest_type WHERE ingest_request.link_source = 'doaj' AND (ingest_request.ingest_type = 'pdf' OR ingest_request.ingest_type = 'xml') AND ingest_file_result.status != 'success' AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%' AND ingest_request.base_url NOT LIKE '%pubs.acs.org%' AND ingest_request.base_url NOT LIKE '%ahajournals.org%' AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%' AND ingest_request.base_url NOT LIKE '%aip.scitation.org%' AND ingest_request.base_url NOT LIKE '%academic.oup.com%' AND ingest_request.base_url NOT LIKE '%tandfonline.com%' AND ingest_request.base_url NOT LIKE '%://archive.org/%' AND ingest_request.base_url NOT LIKE '%://web.archive.org/%' AND ingest_request.base_url NOT LIKE '%://www.archive.org/%' AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%' AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%' AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%' AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%' AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%' AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%' AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%' AND ingest_file_result.terminal_url NOT LIKE '%://archive.org/%' AND ingest_file_result.terminal_url NOT LIKE '%://web.archive.org/%' AND ingest_file_result.terminal_url NOT LIKE '%://www.archive.org/%' ) t1 ) TO '/grande/snapshots/doaj_seedlist_2020-11-19.rows.json'; => 1,899,555 TODO: filter for valid URLs Prep ingest requests (for post-crawl use): ./scripts/ingestrequest_row2json.py /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | pv -l > /grande/snapshots/doaj_crawl_ingest_2020-11-19.json And actually dump seedlist(s): cat /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | jq -r .base_url | sort -u -S 4G > /grande/snapshots/doaj_seedlist_2020-11-19.url.txt cat /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | rg '"no-capture"' | jq -r .result.terminal_url | rg -v ^null$ | sort -u -S 4G > /grande/snapshots/doaj_seedlist_2020-11-19.terminal_url.txt cat /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | rg -v '"no-capture"' | jq -r .base_url | sort -u -S 4G > /grande/snapshots/doaj_seedlist_2020-11-19.no_terminal_url.txt wc -l doaj_seedlist_2020-11-19.*.txt ## Post-Crawl Ingest Re-run all ingests, from original batch (pdf, xml, and html), now that DOAJ identifiers are all in fatcat: cat /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 # started 2020-12-23 15:05 (Pacific) # finished around 2020-12-31, after one long/slow partition Stats again after everything: SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.link_source = 'doaj' GROUP BY ingest_request.ingest_type, status ORDER BY ingest_request.ingest_type, COUNT DESC LIMIT 50; ingest_type | status | count -------------+--------------------------+--------- html | wrong-scope | 1089423 html | no-capture | 423917 html | redirect-loop | 212910 html | unknown-scope | 204069 html | html-resource-no-capture | 165587 html | success | 122937 html | null-body | 100296 html | wayback-content-error | 53918 html | wrong-mimetype | 18908 html | terminal-bad-status | 14059 html | petabox-error | 13520 html | cdx-error | 6823 html | wayback-error | 890 html | | 620 html | blocked-cookie | 543 html | blocked-captcha | 250 html | redirects-exceeded | 135 html | too-many-resources | 111 html | max-hops-exceeded | 84 html | bad-redirect | 3 pdf | success | 2851324 pdf | no-pdf-link | 529914 pdf | redirect-loop | 349494 pdf | no-capture | 272202 pdf | null-body | 129027 pdf | terminal-bad-status | 91796 pdf | link-loop | 25267 pdf | wrong-mimetype | 6504 pdf | wayback-error | 2968 pdf | | 2068 pdf | wayback-content-error | 1548 pdf | cdx-error | 1095 pdf | petabox-error | 1024 pdf | bad-redirect | 203 pdf | redirects-exceeded | 135 pdf | timeout | 20 pdf | max-hops-exceeded | 19 pdf | bad-gzip-encoding | 2 xml | success | 6897 xml | null-body | 2353 xml | wrong-mimetype | 184 xml | no-capture | 5 xml | cdx-error | 3 (43 rows) And on filtered subset that we actually crawled: SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.link_source = 'doaj' AND (ingest_request.ingest_type = 'pdf' OR ingest_request.ingest_type = 'xml') AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%' AND ingest_request.base_url NOT LIKE '%pubs.acs.org%' AND ingest_request.base_url NOT LIKE '%ahajournals.org%' AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%' AND ingest_request.base_url NOT LIKE '%aip.scitation.org%' AND ingest_request.base_url NOT LIKE '%academic.oup.com%' AND ingest_request.base_url NOT LIKE '%tandfonline.com%' AND ingest_request.base_url NOT LIKE '%://archive.org/%' AND ingest_request.base_url NOT LIKE '%://web.archive.org/%' AND ingest_request.base_url NOT LIKE '%://www.archive.org/%' AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%' AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%' AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%' AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%' AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%' AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%' AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%' AND ingest_file_result.terminal_url NOT LIKE '%://archive.org/%' AND ingest_file_result.terminal_url NOT LIKE '%://web.archive.org/%' AND ingest_file_result.terminal_url NOT LIKE '%://www.archive.org/%' GROUP BY ingest_request.ingest_type, status ORDER BY ingest_request.ingest_type, COUNT DESC LIMIT 50; ingest_type | status | count -------------+-----------------------+--------- pdf | success | 2851286 pdf | no-pdf-link | 527495 pdf | redirect-loop | 345138 pdf | no-capture | 268140 pdf | null-body | 129027 pdf | terminal-bad-status | 91125 pdf | link-loop | 25267 pdf | wrong-mimetype | 6504 pdf | wayback-error | 2907 pdf | petabox-error | 363 pdf | wayback-content-error | 242 pdf | bad-redirect | 203 pdf | redirects-exceeded | 135 pdf | max-hops-exceeded | 19 pdf | cdx-error | 15 pdf | bad-gzip-encoding | 2 xml | success | 6897 xml | null-body | 2353 xml | wrong-mimetype | 184 xml | no-capture | 5 (20 rows)