## Queries to find broken domains Top domains with failed ingests: SELECT domain, status, COUNT((domain, status)) FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain != '' AND t1.status != 'success' AND t1.status != 'no-capture' GROUP BY domain, status ORDER BY COUNT DESC LIMIT 30; Status overview for a particular domain: SELECT domain, status, COUNT((domain, status)) FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain = 'osapublishing.org' GROUP BY domain, status ORDER BY COUNT DESC; SELECT domain, terminal_status_code, COUNT((domain, terminal_status_code)) FROM (SELECT terminal_status_code, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain = 'osapublishing.org' AND t1.terminal_status_code is not null GROUP BY domain, terminal_status_code ORDER BY COUNT DESC; Sample recent failures: SELECT * FROM ingest_file_result WHERE terminal_url LIKE '%osapublishing.org%' AND status = 'terminal-bad-status' ORDER BY updated DESC LIMIT 10; ## Failing www.osapublishing.org this publisher (The Optical Society) is systemically using a CAPTCHA to gate access to PDFs. bummer! could ask them to white-list? has citation_pdf_url, so that isn't an issue status: "no-pdf-link" hops: "https://doi.org/10.1364/optica.6.000798", "https://www.osapublishing.org/viewmedia.cfm?uri=optica-6-6-798&seq=0" "https://www.osapublishing.org/captcha/?guid=830CEAB5-09BD-6140-EABD-751200C78B1C" domain | status | count -----------------------+---------------------+------- www.osapublishing.org | no-capture | 16680 www.osapublishing.org | no-pdf-link | 373 www.osapublishing.org | redirect-loop | 19 www.osapublishing.org | terminal-bad-status | 5 www.osapublishing.org | cdx-error | 1 www.osapublishing.org | wrong-mimetype | 1 www.osapublishing.org | spn-error | 1 www.osapublishing.org | success | 1 www.osapublishing.org | wayback-error | 1 (9 rows) www.persee.fr Seems to be mostly blocking or rate-limiting? domain | status | count ---------------+-------------------------------------+------- www.persee.fr | no-capture | 37862 www.persee.fr | terminal-bad-status | 3134 www.persee.fr | gateway-timeout | 2828 www.persee.fr | no-pdf-link | 431 www.persee.fr | spn-error | 75 www.persee.fr | redirect-loop | 23 www.persee.fr | success | 8 www.persee.fr | spn2-error | 2 www.persee.fr | spn2-error:soft-time-limit-exceeded | 1 www.persee.fr | wrong-mimetype | 1 (10 rows) journals.openedition.org PDF access is via "freemium" subscription. Get redirects to: https://auth.openedition.org/authorized_ip?url=http%3A%2F%2Fjournals.openedition.org%2Fnuevomundo%2Fpdf%2F61053 Content is technically open access (HTML and license; for all content?), but can't be crawled as PDF without subscription. domain | status | count --------------------------+-------------------------+------- journals.openedition.org | redirect-loop | 29587 journals.openedition.org | success | 6821 journals.openedition.org | no-pdf-link | 1507 journals.openedition.org | no-capture | 412 journals.openedition.org | wayback-error | 32 journals.openedition.org | wrong-mimetype | 27 journals.openedition.org | terminal-bad-status | 13 journals.openedition.org | spn2-cdx-lookup-failure | 4 journals.openedition.org | spn-remote-error | 1 journals.openedition.org | null-body | 1 journals.openedition.org | cdx-error | 1 (11 rows) journals.lww.com no-pdf-link domain | status | count ------------------+----------------+------- journals.lww.com | no-pdf-link | 11668 journals.lww.com | wrong-mimetype | 131 (2 rows) doi prefix: 10.1097 data-pdf-url="https://pdfs.journals.lww.com/spinejournal/9000/00000/Making_the_Most_of_Systematic_Reviews_and.94318.pdf?token=method|ExpireAbsolute;source|Journals;ttl|1582413672903;payload|mY8D3u1TCCsNvP5E421JYK6N6XICDamxByyYpaNzk7FKjTaa1Yz22MivkHZqjGP4kdS2v0J76WGAnHACH69s21Csk0OpQi3YbjEMdSoz2UhVybFqQxA7lKwSUlA502zQZr96TQRwhVlocEp/sJ586aVbcBFlltKNKo+tbuMfL73hiPqJliudqs17cHeLcLbV/CqjlP3IO0jGHlHQtJWcICDdAyGJMnpi6RlbEJaRheGeh5z5uvqz3FLHgPKVXJzdiVgCTnUeUQFYzcJRFhNtc2gv+ECZGji7HUicj1/6h85Y07DBRl1x2MGqlHWXUawD;hash|6cqYBa15ZK407m4VhFfJLw==" Some weird thing going on, maybe they are blocking-via-redirect based on our User-Agent? Seems like wget works, so funny that they don't block that. musewide.aip.de no-pdf-link koreascience.or.kr | no-pdf-link | 8867 SELECT domain, status, COUNT((domain, status)) FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain = 'osapublishing.org' GROUP BY domain, status ORDER BY COUNT DESC; SELECT * FROM ingest_file_result WHERE terminal_url LIKE '%osapublishing.org%' AND status = 'terminal-bad-status' ORDER BY updated DESC LIMIT 10; www.cairn.info | link-loop | 8717 easy.dans.knaw.nl | no-pdf-link | 8262 scielo.conicyt.cl | no-pdf-link | 7925 SELECT domain, status, COUNT((domain, status)) FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain = 'scielo.conicyt.cl' GROUP BY domain, status ORDER BY COUNT DESC; SELECT * FROM ingest_file_result WHERE terminal_url LIKE '%scielo.conicyt.cl%' AND status = 'terminal-bad-status' ORDER BY updated DESC LIMIT 10; domain | status | count -------------------+---------------------+------- scielo.conicyt.cl | no-pdf-link | 7926 scielo.conicyt.cl | success | 4972 scielo.conicyt.cl | terminal-bad-status | 1474 scielo.conicyt.cl | wrong-mimetype | 6 scielo.conicyt.cl | no-capture | 4 scielo.conicyt.cl | null-body | 1 pdf | https://doi.org/10.4067/s0370-41061980000300002 | 2020-02-22 23:55:56.235822+00 | f | terminal-bad-status | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0370-41061980000300002&lng=en&nrm=iso&tlng=en | 20200212201727 | 200 | pdf | https://doi.org/10.4067/s0718-221x2019005000201 | 2020-02-22 23:01:49.070104+00 | f | terminal-bad-status | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0718-221X2019005000201&lng=en&nrm=iso&tlng=en | 20200214105308 | 200 | pdf | https://doi.org/10.4067/s0717-75262011000200002 | 2020-02-22 22:49:36.429717+00 | f | terminal-bad-status | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0717-75262011000200002&lng=en&nrm=iso&tlng=en | 20200211205804 | 200 | pdf | https://doi.org/10.4067/s0717-95022006000400029 | 2020-02-22 22:33:07.761766+00 | f | terminal-bad-status | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0717-95022006000400029&lng=en&nrm=iso&tlng=en | 20200209044048 | 200 | These seem, on retry, like success? Maybe previous was a matter of warc/revisit not getting handled correctly? pdf | https://doi.org/10.4067/s0250-71611998007100009 | 2020-02-22 23:57:16.481703+00 | f | no-pdf-link | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0250-71611998007100009&lng=en&nrm=iso&tlng=en | 20200212122939 | 200 | pdf | https://doi.org/10.4067/s0716-27902005020300006 | 2020-02-22 23:56:01.247616+00 | f | no-pdf-link | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0716-27902005020300006&lng=en&nrm=iso&tlng=en | 20200214192151 | 200 | pdf | https://doi.org/10.4067/s0718-23762005000100015 | 2020-02-22 23:53:55.81526+00 | f | no-pdf-link | https://scielo.conicyt.cl/scielo.php?script=sci_arttext&pid=S0718-23762005000100015&lng=en&nrm=iso&tlng=en | 20200214173237 | 200 | Look like web/xml only. TODO: XML ingest (and replay?) support. These are as "
", not sure if that is JATS or what. www.kci.go.kr | no-pdf-link | 6842 www.m-hikari.com | no-pdf-link | 6763 cshprotocols.cshlp.org | no-pdf-link | 6553 www.bibliotekevirtual.org | no-pdf-link | 6309 data.hpc.imperial.ac.uk | no-pdf-link | 6071 projecteuclid.org | link-loop | 5970 SELECT domain, status, COUNT((domain, status)) FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain = 'projecteuclid.org' GROUP BY domain, status ORDER BY COUNT DESC; SELECT * FROM ingest_file_result WHERE terminal_url LIKE '%projecteuclid.org%' AND status = 'link-loop' ORDER BY updated DESC LIMIT 10; domain | status | count -------------------+-------------------------+------- projecteuclid.org | link-loop | 5985 projecteuclid.org | success | 26 projecteuclid.org | wayback-error | 26 projecteuclid.org | wrong-mimetype | 17 projecteuclid.org | spn2-cdx-lookup-failure | 4 projecteuclid.org | other-mimetype | 4 projecteuclid.org | no-capture | 3 projecteuclid.org | terminal-bad-status | 2 projecteuclid.org | spn2-error:job-failed | 1 projecteuclid.org | spn-remote-error | 1 (10 rows) Doing a cookie check and redirect. TODO: brozzler behavior to "click the link" instead? www.scielo.br | no-pdf-link | 5823 SELECT domain, status, COUNT((domain, status)) FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1 WHERE t1.domain = 'www.scielo.br' GROUP BY domain, status ORDER BY COUNT DESC; SELECT * FROM ingest_file_result WHERE terminal_url LIKE '%www.scielo.br%' AND status = 'no-pdf-link' ORDER BY updated DESC LIMIT 10; domain | status | count ---------------+-------------------------+------- www.scielo.br | success | 35150 www.scielo.br | no-pdf-link | 5839 www.scielo.br | terminal-bad-status | 429 www.scielo.br | no-capture | 189 www.scielo.br | wrong-mimetype | 7 www.scielo.br | spn2-cdx-lookup-failure | 2 (6 rows) Seems to just be the subset with no PDFs. get.iedadata.org | no-pdf-link | 5822 www.pdcnet.org | no-pdf-link | 5798 publications.rwth-aachen.de | no-pdf-link | 5323 www.sciencedomain.org | no-pdf-link | 5231 medicalforum.ch | terminal-bad-status | 4574 jrnl.nau.edu.ua | link-loop | 4145 ojs.academypublisher.com | no-pdf-link | 4017 ## MAG bulk ingest - dialnet.unirioja.es | redirect-loop | 240967 dialnet.unirioja.es | terminal-bad-status | 20320 => may be worth re-crawling via heritrix? - agupubs.onlinelibrary.wiley.com | no-pdf-link | 72639 => and other *.onlinelibrary.wiley.com - www.researchgate.net | redirect-loop | 42859 - www.redalyc.org:9081 | no-pdf-link | 10515 - www.repository.naturalis.nl | redirect-loop | 8213 - bjp.rcpsych.org | link-loop | 8045 - journals.tubitak.gov.tr | wrong-mimetype | 7159 - www.erudit.org | redirect-loop | 6819 - papers.ssrn.com | redirect-loop | 27328 => blocking is pretty aggressive, using cookies or referrer or something. maybe a brozzler behavior would work, but doesn't currently ## Out of Scope Datasets only? - plutof.ut.ee - www.gbif.org - doi.pangaea.de - www.plate-archive.org Historical non-paper content: - dhz.uni-passau.de (newspapers) - digital.ucd.ie (irish historical) Mostly datasets (some PDF content): - *.figshare.com - zenodo.com - data.mendeley.com