A new snapshot was released in April 2020 (the snapshot is from 2020-02-25, but not released for more than a month). Primary goal is: - generate ingest requests for only *new* URLs - bulk ingest these new URLs - crawl any no-capture URLs from that batch - re-bulk-ingest the no-capture batch - analytics on failed ingests. eg, any particular domains that are failing to crawl This ingest pipeline was started on 2020-04-07 by bnewbold. Ran through the first two steps again on 2020-05-03 after unpaywall had released another dump (dated 2020-04-27). ## Transform and Load # in sandcrawler pipenv on aitio zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-02-25T115244.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json => 24.7M 5:17:03 [ 1.3k/s] cat /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json | pv -l | ./persist_tool.py ingest-request - => 24.7M => Worker: Counter({'total': 24712947, 'insert-requests': 4282167, 'update-requests': 0}) Second time: # in sandcrawler pipenv on aitio zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-04-27T153236.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json => 25.2M 3:16:28 [2.14k/s] cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request - => Worker: Counter({'total': 25189390, 'insert-requests': 1408915, 'update-requests': 0}) => JSON lines pushed: Counter({'pushed': 25189390, 'total': 25189390}) ## Dump new URLs and Bulk Ingest COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2020-04-01' AND ingest_file_result.status IS NULL ) TO '/grande/snapshots/unpaywall_noingest_2020-04-08.rows.json'; => 3696189 WARNING: forgot to transform from rows to ingest requests. cat /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 Second time: COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2020-05-01' AND ingest_file_result.status IS NULL ) TO '/grande/snapshots/unpaywall_noingest_2020-05-03.rows.json'; => 1799760 WARNING: forgot to transform from rows to ingest requests. cat /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 ## Dump no-capture, Run Crawl Make two ingest request dumps: one with "all" URLs, which we will have heritrix attempt to crawl, and then one with certain domains filtered out, which we may or may not bother trying to ingest (due to expectation of failure). COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2020-04-01' AND ingest_file_result.status = 'no-capture' ) TO '/grande/snapshots/unpaywall_nocapture_all_2020-05-04.rows.json'; => 2734145 COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2020-04-01' AND ingest_file_result.status = 'no-capture' AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%' AND ingest_request.base_url NOT LIKE '%pubs.acs.org%' AND ingest_request.base_url NOT LIKE '%ahajournals.org%' AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%' AND ingest_request.base_url NOT LIKE '%aip.scitation.org%' AND ingest_request.base_url NOT LIKE '%academic.oup.com%' AND ingest_request.base_url NOT LIKE '%tandfonline.com%' ) TO '/grande/snapshots/unpaywall_nocapture_2020-05-04.rows.json'; => 2602408 NOTE: forgot here to transform from "rows" to ingest requests. Not actually a very significant size difference after all. See `journal-crawls` repo for details on seedlist generation and crawling. ## Re-Ingest Post-Crawl NOTE: if we *do* want to do cleanup eventually, could look for fatcat edits between 2020-04-01 and 2020-05-25 which have limited "extra" metadata (eg, no evidence or `oa_status`). The earlier bulk ingests were done wrong (forgot to transform from rows to full ingest request docs), so going to re-do those, which should be a superset of the nocapture crawl URLs.: ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-04-08.json => 1.26M 0:00:58 [21.5k/s] => previously: 3,696,189 ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-05-03.json => 1.26M 0:00:56 [22.3k/s] Crap, looks like the 2020-04-08 segment got overwriten with 2020-05 data by accident. Hrm... need to re-ingest *all* recent unpaywall URLs: COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'unpaywall' AND date(ingest_request.created) > '2020-04-01' ) TO '/grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json'; => COPY 5691106 ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json => 5.69M 0:04:26 [21.3k/s] Start small: cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | head -n200 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 Looks good (whew), run the full thing: cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1