Summary of top large broken domains (2021-04-21 "30 day" snapshot): ## acervus.unicamp.br domain | status | count ---------------------------------------+-------------------------+-------- acervus.unicamp.br | | 1967 acervus.unicamp.br | no-pdf-link | 1853 select * from ingest_file_result where updated >= '2021-03-01' and terminal_url like '%acervus.unicamp.br%' and status = 'no-pdf-link' limit 5; http://acervus.unicamp.br/index.asp?codigo_sophia=963332 seems like many of these were captures with a blank page? or a redirect to the homepage? http://web.archive.org/web/20200129110523/http://acervus.unicamp.br/index.html messy, going to move on. ## apex.ipk-gatersleben.de apex.ipk-gatersleben.de | | 1253 apex.ipk-gatersleben.de | no-pdf-link | 1132 select * from ingest_file_result where updated >= '2021-03-01' and terminal_url like '%apex.ipk-gatersleben.de%' and status = 'no-pdf-link' limit 5; https://doi.org/10.25642/ipk/rescoll/4886 https://apex.ipk-gatersleben.de/apex/f?p=PGRDOI:RESOLVE:::NO:RP:DOI:10.25642/IPK/RESCOLL/7331 seem to be datasets/species, not articles. prefix: 10.25642/ipk ## crossref.org apps.crossref.org | | 4693 apps.crossref.org | no-pdf-link | 4075 https://doi.org/10.1515/9781501747045-013 https://apps.crossref.org/coaccess/coaccess.html?doi=10.1515%2F9781501747045-013 Derp, they are doing a dynamic/AJAX thing, so access links are not in the HTML. ## openeditiong books.openedition.org | | 1784 books.openedition.org | no-pdf-link | 1466 https://doi.org/10.4000/books.pul.34492 https://books.openedition.org/pul/34492 these are not actually OA books (or at least, not all are) ## chemrxiv.org (figshare) chemrxiv.org | | 857 chemrxiv.org | no-pdf-link | 519 https://doi.org/10.26434/chemrxiv.14411081 https://chemrxiv.org/articles/preprint/Prediction_and_Optimization_of_Ion_Transport_Characteristics_in_Nanoparticle-Based_Electrolytes_Using_Convolutional_Neural_Networks/14411081 these all seem to be *multi-file* entities, thus not good for single file ingest pipeline. ## direct.mit.edu direct.mit.edu | | 996 direct.mit.edu | no-pdf-link | 869 https://doi.org/10.7551/mitpress/14056.003.0004 https://direct.mit.edu/books/monograph/5111/chapter-abstract/3060134/Adding-Technology-to-Contact-Tracing?redirectedFrom=fulltext "not available" https://doi.org/10.7551/mitpress/12444.003.0004 "not available" ## dlc.library.columbia.edu dlc.library.columbia.edu | | 4225 dlc.library.columbia.edu | no-pdf-link | 2395 dlc.library.columbia.edu | spn2-wayback-error | 1568 https://doi.org/10.7916/d8-506w-kk49 https://dlc.library.columbia.edu/durst/cul:18931zcrk9 document repository. this one goes to IA! actually many seem to. added extractor, should re-ingest with: publisher:"Columbia University" doi_prefix:10.7916 !journal:* actually, that is like 600k+ results and many are not digitized, so perhaps not. ## doi.ala.org.au doi.ala.org.au | | 2570 doi.ala.org.au | no-pdf-link | 2153 https://doi.org/10.26197/ala.811d55e3-2ff4-4501-b3e7-e19249507052 https://doi.ala.org.au/doi/811d55e3-2ff4-4501-b3e7-e19249507052 this is a data repository, with filesets, not papers. datacite metadata is incorrect. ## fldeploc.dep.state.fl.us fldeploc.dep.state.fl.us | | 774 fldeploc.dep.state.fl.us | no-pdf-link | 718 https://doi.org/10.35256/ic29 http://fldeploc.dep.state.fl.us/geodb_query/fgs_doi.asp?searchCode=IC29 re-ingest with: # only ~800 works doi_prefix:10.35256 publisher:Florida ## geoscan.nrcan.gc.ca geoscan.nrcan.gc.ca | | 2056 geoscan.nrcan.gc.ca | no-pdf-link | 2019 https://doi.org/10.4095/295366 https://geoscan.nrcan.gc.ca/starweb/geoscan/servlet.starweb?path=geoscan/fulle.web&search1=R=295366 this is a geographic repository, not papers. ## kiss.kstudy.com kiss.kstudy.com | | 747 kiss.kstudy.com | no-pdf-link | 686 https://doi.org/10.22143/hss21.12.1.121 http://kiss.kstudy.com/thesis/thesis-view.asp?key=3862523 Korean. seems to not actually be theses? can't download. ## linkinghub.elsevier.com linkinghub.elsevier.com | | 5079 linkinghub.elsevier.com | forbidden | 2226 linkinghub.elsevier.com | spn2-wayback-error | 1625 linkinghub.elsevier.com | spn2-cdx-lookup-failure | 758 skipping for now, looks like mostly 'forbidden'? ## osf.io These are important! osf.io | | 3139 osf.io | not-found | 2288 osf.io | spn2-wayback-error | 582 https://doi.org/10.31219/osf.io/jux3w https://accounts.osf.io/login?service=https://osf.io/jux3w/download many of these are 404s by browser as well. what does that mean? ## peerj.com peerj.com | | 785 peerj.com | no-pdf-link | 552 https://doi.org/10.7287/peerj.11155v0.1/reviews/2 https://peerj.com/articles/11155/reviews/ these are HTML reviews, not papers ## preprints.jmir.org preprints.jmir.org | | 763 preprints.jmir.org | no-pdf-link | 611 https://doi.org/10.2196/preprints.22556 https://preprints.jmir.org/preprint/22556 UGH, looks simple, but javascript. could try to re-write URL into S3 format? meh. ## psyarxiv.com (OSF?) psyarxiv.com | | 641 psyarxiv.com | no-pdf-link | 546 https://doi.org/10.31234/osf.io/5jaqg https://psyarxiv.com/5jaqg/ Also infuriatingly Javascript, but can do URL hack. Should reingest, and potentially force-recrawl: # about 67k publisher:"Center for Open Science" in_ia:false ## publons.com publons.com | | 6998 publons.com | no-pdf-link | 6982 https://doi.org/10.1002/jmor.21338/v2/review1 https://publons.com/publon/40260824/ These are just HTML reviews, not papers. ## saemobilus.sae.org saemobilus.sae.org | | 795 saemobilus.sae.org | no-pdf-link | 669 https://doi.org/10.4271/as1426c https://saemobilus.sae.org/content/as1426c These seem to be standards, and are not open access (paywall) ## scholar.dkyobobook.co.kr scholar.dkyobobook.co.kr | | 1043 scholar.dkyobobook.co.kr | no-pdf-link | 915 https://doi.org/10.22471/crisis.2021.6.1.18 http://scholar.dkyobobook.co.kr/searchDetail.laf?barcode=4010028199536 Korean. complex javascript, skipping. ## unreserved.rba.gov.au unreserved.rba.gov.au | | 823 unreserved.rba.gov.au | no-pdf-link | 821 https://doi.org/10.47688/rba_archives_2006/04129 https://unreserved.rba.gov.au/users/login Don't need to login when I tried in browser? document repo, not papers. ## wayf.switch.ch wayf.switch.ch | | 1169 wayf.switch.ch | no-pdf-link | 809 https://doi.org/10.24451/arbor.11128 https://wayf.switch.ch/SWITCHaai/WAYF?entityID=https%3A%2F%2Farbor.bfh.ch%2Fshibboleth&return=https%3A%2F%2Farbor.bfh.ch%2FShibboleth.sso%2FLogin%3FSAMLDS%3D1%26target%3Dss%253Amem%253A5056fc0a97aeab16e5007ca63bede254cb5669d94173064d6c74c62a0f88b022 Loginwall ## www.bloomsburycollections.com | | 1745 www.bloomsburycollections.com | no-pdf-link | 1571 https://doi.org/10.5040/9781849664264.0008 https://www.bloomsburycollections.com/book/the-political-economies-of-media-the-transformation-of-the-global-media-industries/the-political-economies-of-media-and-the-transformation-of-the-global-media-industries These are primarily not OA/available. ## www.emc2020.eu | | 791 www.emc2020.eu | no-pdf-link | 748 https://doi.org/10.22443/rms.emc2020.146 https://www.emc2020.eu/abstract/evaluation-of-different-rectangular-scan-strategies-for-hrstem-imaging.html These are just abstracts, not papers. ## Emerald www.emerald.com | | 2420 www.emerald.com | no-pdf-link | 1986 https://doi.org/10.1108/ramj-11-2020-0065 https://www.emerald.com/insight/content/doi/10.1108/RAMJ-11-2020-0065/full/html Note that these URLs are already HTML fulltext. but the PDF is also available and easy. re-ingest: # only ~3k or so missing doi_prefix:10.1108 publisher:emerald in_ia:false is_oa:true ## www.humankineticslibrary.com | | 1122 www.humankineticslibrary.com | no-pdf-link | 985 https://doi.org/10.5040/9781718206625.ch-002 https://www.humankineticslibrary.com/encyclopedia-chapter?docid=b-9781718206625&tocid=b-9781718206625-chapter2 paywall ## www.inderscience.com | | 1532 www.inderscience.com | no-pdf-link | 1217 https://doi.org/10.1504/ijdmb.2020.10036342 https://www.inderscience.com/info/ingeneral/forthcoming.php?jcode=ijdmb paywall ## www.ingentaconnect.com | | 885 www.ingentaconnect.com | no-pdf-link | 783 https://doi.org/10.15258/sst.2021.49.1.07 https://www.ingentaconnect.com/content/ista/sst/pre-prints/content-7_sst.2021.49.1_63-71;jsessionid=1joc5mmi1juht.x-ic-live-02 Annoying javascript, but easy to work around. re-ingest: # only a couple hundred; also re-ingest doi_prefix:10.15258 in_ia:false year:>2018 ## www.nomos-elibrary.de | | 2235 www.nomos-elibrary.de | no-pdf-link | 1128 www.nomos-elibrary.de | spn2-wayback-error | 559 https://doi.org/10.5771/9783748907084-439 https://www.nomos-elibrary.de/10.5771/9783748907084-439/verzeichnis-der-autorinnen-und-autoren Javascript obfuscated download button? ## www.oecd-ilibrary.org | | 3046 www.oecd-ilibrary.org | no-pdf-link | 2869 https://doi.org/10.1787/543e84ed-en https://www.oecd-ilibrary.org/development/applying-evaluation-criteria-thoughtfully_543e84ed-en Paywall. ## www.osapublishing.org | | 821 www.osapublishing.org | no-pdf-link | 615 https://doi.org/10.1364/boe.422199 https://www.osapublishing.org/boe/abstract.cfm?doi=10.1364/BOE.422199 Some of these are "pre-registered" DOIs, not published yet. Many of the remaining are actually HTML articles, and/or have some stuff in the `citation_pdf_url`. A core problem is captchas. Have started adding support to fatcat for HTML crawl type based on container. re-ingest: container_twtpsm6ytje3nhuqfu3pa7ca7u (optica) container_cg4vcsfty5dfvgmat5wm62wgie (optics express) ## www.oxfordscholarlyeditions.com | | 759 www.oxfordscholarlyeditions.com | no-pdf-link | 719 https://doi.org/10.1093/oseo/instance.00266789 https://www.oxfordscholarlyeditions.com/view/10.1093/actrade/9780199593668.book.1/actrade-9780199593668-div1-27 loginwall/paywall ## www.schweizerbart.de | | 730 www.schweizerbart.de | no-pdf-link | 653 https://doi.org/10.1127/zfg/40/1996/461 https://www.schweizerbart.de/papers/zfg/detail/40/97757/Theoretical_model_of_surface_karstic_processes?af=crossref paywall ## www.sciencedirect.com | | 14757 www.sciencedirect.com | no-pdf-link | 12733 www.sciencedirect.com | spn2-wayback-error | 1503 https://doi.org/10.1016/j.landurbplan.2021.104104 https://www.sciencedirect.com/science/article/pii/S0169204621000670 Bunch of crazy new hacks, but seems to be working! re-ingest: # to start! about 50k doi_prefix:10.1016 is_oa:true year:2021 ## www.sciendo.com | | 1955 www.sciendo.com | no-pdf-link | 1176 https://doi.org/10.2478/awutm-2019-0012 https://www.sciendo.com/article/10.2478/awutm-2019-0012 uses lots of javascript, hard to scrape. ## Others (for reference) | | 725990 | no-pdf-link | 209933 | success | 206134 | spn2-wayback-error | 127015 | spn2-cdx-lookup-failure | 53384 | blocked-cookie | 35867 | link-loop | 25834 | too-many-redirects | 16430 | redirect-loop | 14648 | forbidden | 13794 | terminal-bad-status | 8055 | not-found | 6399 | remote-server-error | 2402 | wrong-mimetype | 2011 | spn2-error:unauthorized | 912 | bad-redirect | 555 | read-timeout | 530 ## Re-ingests All the above combined: container_twtpsm6ytje3nhuqfu3pa7ca7u (optica) container_cg4vcsfty5dfvgmat5wm62wgie (optics express) ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --ingest-type html container --container-id twtpsm6ytje3nhuqfu3pa7ca7u => Counter({'ingest_request': 1142, 'elasticsearch_release': 1142, 'estimate': 1142, 'kafka': 1142}) ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --ingest-type html container --container-id cg4vcsfty5dfvgmat5wm62wgie => Counter({'elasticsearch_release': 33482, 'estimate': 33482, 'ingest_request': 32864, 'kafka': 32864}) # only ~800 works doi_prefix:10.35256 publisher:Florida ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa query "doi_prefix:10.35256 publisher:Florida" => Counter({'ingest_request': 843, 'elasticsearch_release': 843, 'estimate': 843, 'kafka': 843}) # only ~3k or so missing doi_prefix:10.1108 publisher:emerald in_ia:false is_oa:true ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query "doi_prefix:10.1108 publisher:emerald" => Counter({'ingest_request': 3812, 'elasticsearch_release': 3812, 'estimate': 3812, 'kafka': 3812}) # only a couple hundred; also re-ingest doi_prefix:10.15258 in_ia:false year:>2018 ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa --force-recrawl query "doi_prefix:10.15258 year:>2018" => Counter({'ingest_request': 140, 'elasticsearch_release': 140, 'estimate': 140, 'kafka': 140}) # to start! about 50k doi_prefix:10.1016 is_oa:true year:2020 doi_prefix:10.1016 is_oa:true year:2021 ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query "doi_prefix:10.1016 year:2020" => Counter({'ingest_request': 75936, 'elasticsearch_release': 75936, 'estimate': 75936, 'kafka': 75936}) ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query "doi_prefix:10.1016 year:2021" => Counter({'ingest_request': 54824, 'elasticsearch_release': 54824, 'estimate': 54824, 'kafka': 54824}) pmcid:* year:2018 pmcid:* year:2019 ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --force-recrawl query "pmcid:* year:2018" => Counter({'ingest_request': 25366, 'elasticsearch_release': 25366, 'estimate': 25366, 'kafka': 25366}) ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --force-recrawl query "pmcid:* year:2019" => Counter({'ingest_request': 55658, 'elasticsearch_release': 55658, 'estimate': 55658, 'kafka': 55658})