It seems that many gold OA DOIs on were not ingesting simply because the HTML url extraction was not working for a particular version of OJS. Let's re-try all ~2.5 million of these in bulk mode and see how many are 'no-capture' vs. other errors, then possibly re-crawl a large number. ## Bulk Ingest Dump ingest requests ./fatcat_ingest.py query 'is_oa:true preservation:none !arxiv_id:* !pmcid:* !publisher_type:big5 type:article-journal' | pv -l > /srv/fatcat/snapshots/oa_doi_20200915.ingest_request.json Expecting 2569876 release objects in search queries Counter({'elasticsearch_release': 2569880, 'estimate': 2569880, 'ingest_request': 2063034}) Enqueue cat /srv/fatcat/snapshots/oa_doi_20200915.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 Started at about: Thu Sep 17 00:15:00 UTC 2020 2020-09-17T00:15:00Z ## Stats SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.ingest_request_source = 'fatcat-ingest' AND ingest_file_result.updated >= '2020-09-16' GROUP BY status ORDER BY COUNT DESC LIMIT 30; status | count -------------------------------------+-------- no-capture | 513462 success | 206042 no-pdf-link | 186779 terminal-bad-status | 40372 redirect-loop | 33103 cdx-error | 24078 link-loop | 13494 spn2-cdx-lookup-failure | 10247 gateway-timeout | 4407 wrong-mimetype | 3213 petabox-error | 866 null-body | 449 spn2-error | 217 wayback-error | 129 spn2-error:job-failed | 64 bad-redirect | 6 spn2-error:soft-time-limit-exceeded | 1 (17 rows) This was only about half the requests. Try... broader? SELECT ingest_file_result.status, COUNT(*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'doi' AND (ingest_request.ingest_request_source = 'fatcat-ingest' OR ingest_request.ingest_request_source = 'fatcat-changelog') AND ingest_file_result.updated >= '2020-09-15' AND ingest_file_result.updated <= '2020-09-20' GROUP BY status ORDER BY COUNT DESC LIMIT 30; status | count -------------------------------------+-------- no-capture | 579952 success | 387325 no-pdf-link | 380406 terminal-bad-status | 63743 redirect-loop | 53893 cdx-error | 46024 spn2-cdx-lookup-failure | 28347 link-loop | 22573 gateway-timeout | 11686 wrong-mimetype | 6294 null-body | 3509 petabox-error | 2388 spn2-error | 1023 spn2-error:job-failed | 462 wayback-error | 347 spn2-error:soft-time-limit-exceeded | 20 bad-redirect | 11 (17 rows) What top domains for those `no-pdf-link` (or similar)? SELECT domain, status, COUNT((domain, status)) FROM ( SELECT ingest_file_result.ingest_type, ingest_file_result.status, substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result LEFT JOIN ingest_request ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'doi' AND (ingest_request.ingest_request_source = 'fatcat-ingest' OR ingest_request.ingest_request_source = 'fatcat-changelog') AND ingest_file_result.updated >= '2020-09-15' AND ingest_file_result.updated <= '2020-09-20' ) t1 WHERE t1.domain != '' AND t1.status != 'success' AND t1.status != 'no-capture' GROUP BY domain, status ORDER BY COUNT DESC LIMIT 30; domain | status | count ------------------------------+-------------------------+------- zenodo.org | no-pdf-link | 56488 figshare.com | no-pdf-link | 55337 www.egms.de | redirect-loop | 22686 zenodo.org | terminal-bad-status | 22128 tandf.figshare.com | no-pdf-link | 20027 springernature.figshare.com | no-pdf-link | 17181 cairn.info | terminal-bad-status | 13836 www.persee.fr | terminal-bad-status | 7565 projecteuclid.org | link-loop | 7449 www.cairn.info | no-pdf-link | 6992 scialert.net | no-pdf-link | 6621 www.cairn.info | link-loop | 5870 utpjournals.press | no-pdf-link | 5772 journals.openedition.org | redirect-loop | 5464 www.egms.de | no-pdf-link | 5223 archaeologydataservice.ac.uk | no-pdf-link | 4881 rs.figshare.com | no-pdf-link | 4773 www.degruyter.com | spn2-cdx-lookup-failure | 4763 koreascience.or.kr | no-pdf-link | 4487 cancerres.aacrjournals.org | no-pdf-link | 4124 cms.math.ca | no-pdf-link | 3441 volcano.si.edu | no-pdf-link | 3424 www.mathnet.ru | no-pdf-link | 3229 tidsskriftet.no | no-pdf-link | 3012 journals.plos.org | no-pdf-link | 3005 tudigit.ulb.tu-darmstadt.de | no-pdf-link | 2796 www.cairn.info:80 | link-loop | 2647 hammer.figshare.com | no-pdf-link | 2627 www.psychosocial.com | no-pdf-link | 2457 osf.io | terminal-bad-status | 2388 (30 rows) Will look at link extraction for: - scialert.net - utpjournals.press - koreascience.or.kr - cancerres.aacrjournals.org - cms.math.ca - volcano.si.edu - www.mathnet.ru - www.psychosocial.com ## Re-Ingest Going to re-run ingest to handle `no-capture` cases, to extract the missing terminal URLs: COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'doi' AND (ingest_request.ingest_request_source = 'fatcat-ingest' OR ingest_request.ingest_request_source = 'fatcat-changelog') AND ingest_file_result.updated >= '2020-09-15' AND ingest_file_result.updated <= '2020-09-20' AND ingest_file_result.status = 'no-capture' -- AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%' ) TO '/grande/snapshots/oa_doi_reingest_nocapture_20201012.rows.json'; => COPY 579952 ./scripts/ingestrequest_row2json.py /grande/snapshots/oa_doi_reingest_nocapture_20201012.rows.json | pv -l | shuf > /grande/snapshots/oa_doi_reingest_nocapture_20201012.ingest_request.json => 579k 0:00:22 [25.9k/s] cat /grande/snapshots/oa_doi_reingest_nocapture_20201012.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 After that, will re-crawl somewhat broadly: COPY ( SELECT row_to_json(r) FROM ( SELECT ingest_request.*, ingest_file_result.terminal_url as terminal_url FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_request.link_source = 'doi' AND (ingest_request.ingest_request_source = 'fatcat-ingest' OR ingest_request.ingest_request_source = 'fatcat-changelog') AND ((ingest_file_result.updated >= '2020-09-15' AND ingest_file_result.updated <= '2020-09-20') OR (ingest_file_result.updated >= '2020-10-11')) AND ingest_file_result.status != 'success' ) r ) TO '/grande/snapshots/oa_doi_reingest_recrawl_20201014.rows.json';