## HTML `html-resource-no-capture` Fixes
Tracing down some `html-resource-no-capture` issues. Eg, `javascript:` resources causing errors.
SQL query:
select * from ingest_file_result where ingest_type = 'html' and status = 'html-resource-no-capture' limit 100;
select * from ingest_file_result where ingest_type = 'html' and status = 'html-resource-no-capture' order by random() limit 100;
select count(*) from ingest_file_result where ingest_type = 'html' and status = 'html-resource-no-capture';
=> 210,528
http://agroengineering.it/index.php/jae/article/view/568/609
- old capture, from `20171017204935`
- missing .css file; seems like an actual case of missing content?
- TODO: re-crawl/re-ingest when CDX is old
https://www.karger.com/Article/FullText/484130
- missing: https://www.karger.com/WebMaterial/ShowThumbnail/895999?imgType=2
- resource is live
- this was from DOI-LANDING crawl, no resources captured
- TODO: re-crawl
https://www.mdpi.com/1996-1073/13/21/5563/htm
- missing: https://www.mdpi.com/1996-1073/13/21/5563/htm
- common crawl capture; no/few resources?
- TODO: re-crawl
http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0100-736X2013000500011&lng=en&tlng=en
- missing: http://www.scielo.br/img/revistas/pvb/v33n5/a11tab01.jpg
not on live web
- old (2013) wide crawl
- TODO: re-crawl
http://g3journal.org/lookup/doi/10.1534/g3.116.027730
- missing: http://www.g3journal.org/sites/default/files/highwire/ggg/6/8/2553/embed/mml-math-4.gif
- old 2018 landing crawl (no resources)
- TODO: re-crawl
https://www.frontiersin.org/articles/10.3389/fimmu.2020.576134/full
- "error_message": "revisit record missing URI and/or DT: warc:abc.net.au-news-20220328-130654/IA-FOC-abc.net.au-news-20220618135308-00003.warc.gz offset:768320762"
- specific URL: https://www.frontiersin.org/areas/articles/js/app?v=uC9Es8wJ9fbTy8Rj4KipiyIXvhx7XEVhCTHvIrM4ShA1
- archiveteam crawl
- seems like a weird corner case. look at more 'frontiersin' articles, and re-crawl this page
https://www.frontiersin.org/articles/10.3389/fonc.2020.01386/full
- WORKING
https://doi.org/10.4000/trajectoires.2317
- redirect: https://journals.openedition.org/trajectoires/2317
- missing: "https://journals.openedition.org/trajectoires/Ce fichier n'existe pas" (note spaces)
- FIXED
http://www.scielosp.org/scielo.php?script=sci_arttext&pid=S1413-81232002000200008&lng=en&tlng=en
- WORKING
https://f1000research.com/articles/9-571/v2
- petabox-error on 'https://www.recaptcha.net/recaptcha/api.js'
- added recaptcha.net to blocklist
- still needs a re-crawl
- SPN capture, from 2020, but images were missing?
- re-capture has images (though JS still wonky)
- TODO: re-crawl with SPN2
http://bio.biologists.org/content/4/9/1163
- DOI LANDING crawl, no sub-resources
- TODO: recrawl
http://err.ersjournals.com/content/26/145/170039.full
- missing: http://err.ersjournals.com/sites/default/files/highwire/errev/26/145/170039/embed/graphic-5.gif
on live web
- 2017 targetted heritrix crawl
- TODO: recrawl
http://www.dovepress.com/synthesis-characterization-and-antimicrobial-activity-of-an-ampicillin-peer-reviewed-article-IJN
- missing: https://www.dovepress.com/cr_data/article_fulltext/s61000/61143/img/IJN-61143-F02-Thumb.jpg
- recent archiveteam crawl
- TODO: recrawl
http://journals.ed.ac.uk/lithicstudies/article/view/1444
- missing: http://journals.ed.ac.uk/lithicstudies/article/download/1444/2078/6081
- common crawl
- TODO: recrawl
http://medisan.sld.cu/index.php/san/article/view/495
- missing: http://ftp.scu.sld.cu/galen/medisan/logos/redib.jpg
- this single resource is legit missing
seems like it probably isn't a bad idea to just re-crawl all of these with fresh SPNv2 requests
request sources:
- fatcat-changelog (doi)
- fatcat-ingest (doi)
- doaj
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'html'
AND ingest_file_result.status = 'html-resource-no-capture'
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
)
) TO '/srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.rows.json';
=> COPY 210749
./scripts/ingestrequest_row2json.py --force-recrawl /srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.rows.json > /srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.json
Try a sample of 300:
shuf -n300 /srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
Seeing a bunch of:
["doaj","wayback-content-error","https://www.frontiersin.org/article/10.3389/fphys.2020.00454/full","https://www.frontiersin.org/articles/10.3389/fphys.2020.00454/full","revisit record missing URI and/or DT: warc:foxnews.com-20220402-051934/IA-FOC-foxnews.com-20220712070651-00000.warc.gz offset:937365431"]
["doaj","wayback-content-error","https://www.frontiersin.org/article/10.3389/fmicb.2019.02507/full","https://www.frontiersin.org/articles/10.3389/fmicb.2019.02507/full","revisit record missing URI and/or DT: warc:foxnews.com-20220402-051934/IA-FOC-foxnews.com-20220712070651-00000.warc.gz offset:937365431"]
["doaj","wayback-content-error","https://www.mdpi.com/2218-1989/10/9/366","https://www.mdpi.com/2218-1989/10/9/366/htm","revisit record missing URI and/or DT: warc:foxnews.com-20220402-051934/IA-FOC-foxnews.com-20220712070651-00000.warc.gz offset:964129887"]
"error_message": "revisit record missing URI and/or DT: warc:online.wsj.com-home-page-20220324-211958/IA-FOC-online.wsj.com-home-page-20220716075018-00001.warc.gz offset:751923069",
["doaj","wayback-content-error","https://www.frontiersin.org/article/10.3389/fnins.2020.00724/full","https://www.frontiersin.org/articles/10.3389/fnins.2020.00724/full","wayback payload sha1hex mismatch: 20220715222216 https://static.frontiersin.org/areas/articles/js/app?v=DfnFHSIgqDJBKQy2bbQ2S8vWyHe2dEMZ1Lg9o6vSS1g1"]
These seem to be transfer encoding issues; fixed?
["doaj","html-resource-no-capture","http://www.scielosp.org/scielo.php?script=sci_arttext&pid=S0021-25712013000400003&lng=en&tlng=en","https://scielosp.org/article/aiss/2013.v49n4/336-339/en/","HTML sub-resource not found: https://ssm.scielo.org/media/assets/css/scielo-print.css"]
Full batch:
# TODO: cat /srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
Not running the full batch for now, because there are almost all `wayback-content-error` issues.
cat /srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.json | rg -v frontiersin.org | wc -l
114935
cat /srv/sandcrawler/tasks/retry_html_resourcenocapture.2022-07-15.json | rg -v frontiersin.org | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
## Redirect Loops
Seems like there might have been a bug in how ingest pipeline dealt with
multiple redirects (eg, 301 to 302 or vice-versa), due to how CDX lookups and
normalization was happening.
This could be a really big deal because we have over 11 million such ingest
requests! and may even have stopped crawling domains on the basis of redirect
looping.
select * from ingest_file_result where ingest_type = 'pdf' and status = 'redirect-loop' limit 50;
http://ieeexplore.ieee.org/iel7/7259950/7275573/07275755.pdf
- 'skip-url-blocklist'
- paywall on live web
http://www.redjournal.org/article/S0360301616308276/pdf
- redirect to 'secure.jbs.elsevierhealth.com'
- ... but re-crawling with SPNv2 worked
- TODO: reingest this entire journal with SPNv2
http://www.jmirs.org/article/S1939865415001551/pdf
- blocked-cookie (secure.jbs.elsevierhealth.com)
- RECRAWL: success
http://www.cell.com/article/S0006349510026147/pdf
- blocked-cookie (secure.jbs.elsevierhealth.com)
- TODO: try SPNv2?
- RECRAWL: success
http://infoscience.epfl.ch/record/256431/files/SPL_2018.pdf
- FIXED: success
http://www.nature.com/articles/hdy1994143.pdf
- blocked-cookie (idp.nature.com / cookies_not_supported)
- RECRAWL: gateway-timeout
http://www.thelancet.com/article/S0140673619327606/pdf
- blocked-cookie (secure.jbs.elsevierhealth.com)
- RECRAWL: success
https://pure.mpg.de/pubman/item/item_2065970_2/component/file_2065971/Haase_2014.pdf
- FIXED: success
http://hdl.handle.net/21.11116/0000-0001-B1A2-F
- FIXED: success
http://repositorio.ufba.br/ri/bitstream/ri/6072/1/%2858%29v21n6a03.pdf
- FIXED: success
http://www.jto.org/article/S1556086416329999/pdf
- blocked-cookie (secure.jbs.elsevierhealth.com)
- RECRAWL spn2: success
http://www.jahonline.org/article/S1054139X16303020/pdf
- blocked-cookie (secure.jbs.elsevierhealth.com)
- RECRAWL spn2: success
So, wow wow wow, a few things to do here:
- just re-try all these redirect-loop attempts to update status
- re-ingest all these elsevierhealth blocked crawls with SPNv2. this could take a long time!
Possibly the elsevierhealth stuff will require some deeper fiddling to crawl
correctly.
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.status = 'redirect-loop'
-- AND ingest_request.ingest_type = 'pdf'
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
OR ingest_request.link_source = 'unpaywall'
)
) TO '/srv/sandcrawler/tasks/retry_redirectloop.2022-07-15.rows.json';
=> COPY 6611342
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_redirectloop.2022-07-15.rows.json > /srv/sandcrawler/tasks/retry_redirectloop.2022-07-15.json
Start with a sample:
shuf -n200 /srv/sandcrawler/tasks/retry_redirectloop.2022-07-15.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Wow that is a lot of ingest! And a healthy fraction of 'success', almost all
via unpaywall (maybe should have done DOAJ/DOI only first). Let's do this full
batch:
cat /srv/sandcrawler/tasks/retry_redirectloop.2022-07-15.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
TODO: repeat with broader query (eg, OAI-PMH, MAG, etc).
## Other
Revist resolution failed: \"Didn't get exact CDX url/datetime match. url:https://www.cairn.info/static/images//logo/logo-cairn-negatif.png dt:20220430145322 got:CdxRow(surt='info,cairn)/static/images/logo/logo-cairn-negatif.png', datetime='20220430145322', url='https://www.cairn.info/static/images/logo/logo-cairn-negatif.png', mimetype='image/png', status_code=200, sha1b32='Y3VQOPO2NFUR2EUWNXLYGYGNZPZLQYHU', sha1hex='c6eb073dda69691d12966dd78360cdcbf2b860f4', warc_csize=10875, warc_offset=2315284914, warc_path='archiveteam_archivebot_go_20220430212134_59230631/old.worldurbancampaign.org-inf-20220430-140628-acnq5-00000.warc.gz')\""
https://www.cairn.info/static/images//logo/logo-cairn-negatif.png 20220430145322
https://www.cairn.info/static/images/logo/logo-cairn-negatif.png 20220430145322
Fixed!
## Broken WARC Record?
cdx line:
net,cloudfront,d1bxh8uas1mnw7)/assets/embed.js 20220716084026 https://d1bxh8uas1mnw7.cloudfront.net/assets/embed.js warc/revisit - U5E5UA6DS5GGCHJ2IZSOIEGPN6P64JRB - - 660 751923069 online.wsj.com-home-page-20220324-211958/IA-FOC-online.wsj.com-home-page-20220716075018-00001.warc.gz
download WARC and run:
zcat IA-FOC-online.wsj.com-home-page-20220716075018-00001.warc.gz | rg d1bxh8uas1mnw7.cloudfront.net/assets/embed.js -a -C 20
the WARC record:
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: https://d1bxh8uas1mnw7.cloudfront.net/assets/embed.js
WARC-Date: 2022-07-16T08:40:26Z
WARC-Payload-Digest: sha1:U5E5UA6DS5GGCHJ2IZSOIEGPN6P64JRB
WARC-IP-Address: 13.227.21.220
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Record-ID:
Content-Type: application/http; msgtype=response
Content-Length: 493
HTTP/1.1 200 OK
Content-Type: application/javascript
Content-Length: 512
Connection: close
Last-Modified: Fri, 22 Apr 2022 08:45:38 GMT
Accept-Ranges: bytes
Server: AmazonS3
Date: Fri, 15 Jul 2022 16:36:08 GMT
ETag: "1c28db48d4012f0221b63224a3bb7137"
Vary: Accept-Encoding
X-Cache: Hit from cloudfront
Via: 1.1 5b475307685b5cecdd0df414286f5438.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: SFO20-C1
X-Amz-Cf-Id: SIRR_1LT8mkp3QVaiGYttPuomxyDfJ-vB6dh0Slg_qqyW0_WwnA1eg==
Age: 57859
where are the `WARC-Refers-To-Target-URI` and `WARC-Refers-To-Date` lines?
## osf.io
select status, terminal_status_code, count(*) from ingest_file_result where base_url LIKE 'https://doi.org/10.17605/osf.io/%' and ingest_type = 'pdf' group by status, terminal_status_code order by count(*) desc limit 30;
status | terminal_status_code | count
-------------------------+----------------------+-------
terminal-bad-status | 404 | 92110
no-pdf-link | 200 | 46932
not-found | 200 | 20212
no-capture | | 8599
success | 200 | 7604
redirect-loop | 301 | 2125
terminal-bad-status | 503 | 1657
cdx-error | | 1301
wrong-mimetype | 200 | 901
terminal-bad-status | 410 | 364
read-timeout | | 167
wayback-error | | 142
gateway-timeout | | 139
terminal-bad-status | 500 | 76
spn2-error | | 63
spn2-backoff | | 42
petabox-error | | 39
spn2-backoff | 200 | 27
redirect-loop | 302 | 19
terminal-bad-status | 400 | 15
terminal-bad-status | 401 | 15
remote-server-error | | 14
timeout | | 11
terminal-bad-status | | 11
petabox-error | 200 | 10
empty-blob | 200 | 8
null-body | 200 | 6
spn2-error:unknown | | 5
redirect-loop | 308 | 4
spn2-cdx-lookup-failure | | 4
(30 rows)
Many of these are now non-existant, or datasets/registrations not articles.
Hrm.
## Large DOAJ no-pdf-link Domains
SELECT
substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain,
COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result ON
ingest_request.ingest_type = ingest_file_result.ingest_type
AND ingest_request.base_url = ingest_file_result.base_url
WHERE
ingest_file_result.status = 'no-pdf-link'
AND ingest_request.link_source = 'doaj'
GROUP BY
domain
ORDER BY
COUNT(*) DESC
LIMIT 50;
domain | count
-------------------------------------------------------+--------
www.sciencedirect.com | 211090
auth.openedition.org | 20741
journal.frontiersin.org:80 | 11368
journal.frontiersin.org | 6494
ejde.math.txstate.edu | 4301
www.arkat-usa.org | 4001
www.scielo.br | 3736
www.lcgdbzz.org | 2892
revistas.uniandes.edu.co | 2715
scielo.sld.cu | 2612
www.egms.de | 2488
journals.lww.com | 2415
ter-arkhiv.ru | 2239
www.kitlv-journals.nl | 2076
www.degruyter.com | 2061
jwcn-eurasipjournals.springeropen.com | 1929
www.cjcnn.org | 1908
www.aimspress.com | 1885
vsp.spr-journal.ru | 1873
dx.doi.org | 1648
www.dlib.si | 1582
aprendeenlinea.udea.edu.co | 1548
www.math.u-szeged.hu | 1448
dergipark.org.tr | 1444
revistas.uexternado.edu.co | 1429
learning-analytics.info | 1419
drive.google.com | 1399
www.scielo.cl | 1326
www.economics-ejournal.org | 1267
www.jssm.org | 1240
html.rhhz.net | 1232
journalofinequalitiesandapplications.springeropen.com | 1214
revistamedicina.net | 1197
filclass.ru | 1154
ceramicayvidrio.revistas.csic.es | 1152
gynecology.orscience.ru | 1126
www.tobaccoinduceddiseases.org | 1090
www.tandfonline.com | 1046
www.querelles-net.de | 1038
www.swjpcc.com | 1032
microbiologyjournal.org | 1028
revistas.usal.es | 1027
www.medwave.cl | 1023
ijtech.eng.ui.ac.id | 1023
www.scielo.sa.cr | 1021
vestnik.szd.si | 986
www.biomedcentral.com:80 | 984
scielo.isciii.es | 983
bid.ub.edu | 970
www.meirongtv.com | 959
(50 rows)
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://ejde.math.txstate.edu%' limit 5;
http://ejde.math.txstate.edu/Volumes/2018/30/abstr.html
http://ejde.math.txstate.edu/Volumes/2012/137/abstr.html
http://ejde.math.txstate.edu/Volumes/2016/268/abstr.html
http://ejde.math.txstate.edu/Volumes/2015/194/abstr.html
http://ejde.math.txstate.edu/Volumes/2014/43/abstr.html
# plain HTML, not really parse-able
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.arkat-usa.org%' limit 5;
https://www.arkat-usa.org/arkivoc-journal/browse-arkivoc/ark.5550190.0006.913
https://www.arkat-usa.org/arkivoc-journal/browse-arkivoc/ark.5550190.0013.909
https://www.arkat-usa.org/arkivoc-journal/browse-arkivoc/ark.5550190.0007.717
https://www.arkat-usa.org/arkivoc-journal/browse-arkivoc/ark.5550190.p008.158
https://www.arkat-usa.org/arkivoc-journal/browse-arkivoc/ark.5550190.0014.216
# fixed (embed PDF)
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.scielo.br%' limit 5;
https://doi.org/10.5935/0034-7280.20200075
https://doi.org/10.5935/0004-2749.20200071
https://doi.org/10.5935/0034-7280.20200035
http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-44461999000400014
https://doi.org/10.5935/0034-7280.20200047
# need recrawls?
# then success
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.lcgdbzz.org%' limit 5;
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://revistas.uniandes.edu.co%' limit 5;
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://scielo.sld.cu%' limit 5;
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.egms.de%' limit 5;
https://doi.org/10.3205/16dgnc020
http://nbn-resolving.de/urn:nbn:de:0183-19degam1126
http://www.egms.de/en/meetings/dgpraec2019/19dgpraec032.shtml
http://www.egms.de/en/meetings/dkou2019/19dkou070.shtml
http://nbn-resolving.de/urn:nbn:de:0183-20nrwgu625
# mostly abstracts, don't have PDF versions
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://ter-arkhiv.ru%' limit 5;
https://doi.org/10.26442/terarkh201890114-47
https://doi.org/10.26442/00403660.2019.12.000206
https://journals.eco-vector.com/0040-3660/article/download/32246/pdf
https://journals.eco-vector.com/0040-3660/article/download/33578/pdf
https://doi.org/10.26442/00403660.2019.12.000163
# working, needed recrawls (some force re-crawls)
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.kitlv-journals.nl%' limit 5;
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.cjcnn.org%' limit 5;
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.dlib.si%' limit 5;
https://srl.si/ojs/srl/article/view/2910
https://srl.si/ojs/srl/article/view/3640
https://srl.si/ojs/srl/article/view/2746
https://srl.si/ojs/srl/article/view/2557
https://srl.si/ojs/srl/article/view/2583
# fixed? (dlib.si)
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.jssm.org%' limit 5;
http://www.jssm.org/vol4/n4/8/v4n4-8text.php
http://www.jssm.org/vol7/n1/19/v7n1-19text.php
http://www.jssm.org/vol9/n3/10/v9n3-10text.php
http://www.jssm.org/abstresearcha.php?id=jssm-14-347.xml
http://www.jssm.org/vol7/n2/11/v7n2-11text.php
# works as an HTML document? otherwise hard to select on PDF link
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://filclass.ru%' limit 5;
https://filclass.ru/en/archive/2018/2-52/the-chronicle-of-domestic-literary-criticism
https://filclass.ru/en/archive/2015/42/training-as-an-effective-form-of-preparation-for-the-final-essay
https://filclass.ru/en/archive/2020/vol-25-3/didaktizatsiya-literatury-rossijskikh-nemtsev-zanyatie-po-poeme-viktora-klyajna-jungengesprach
https://filclass.ru/en/archive/2015/40/the-communicative-behaviour-of-the-russian-intelligentsia-and-its-reflection-in-reviews-as-a-genre-published-in-online-literary-journals-abroad
https://filclass.ru/en/archive/2016/46/discoursive-means-of-implication-of-instructive-components-within-the-anti-utopia-genre
# fixed
# TODO: XXX: re-crawl/ingest
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://microbiologyjournal.org%' limit 5;
https://microbiologyjournal.org/the-relationship-between-the-type-of-infection-and-antibiotic-resistance/
https://microbiologyjournal.org/antimicrobial-resistant-shiga-toxin-producing-escherichia-coli-isolated-from-ready-to-eat-meat-products-and-fermented-milk-sold-in-the-formal-and-informal-sectors-in-harare-zimbabwe/
https://microbiologyjournal.org/emerging-antibiotic-resistance-in-mycoplasma-microorganisms-designing-effective-and-novel-drugs-therapeutic-targets-current-knowledge-and-futuristic-prospects/
https://microbiologyjournal.org/microbiological-and-physicochemicalpropertiesofraw-milkproduced-from-milking-to-delivery-to-milk-plant/
https://microbiologyjournal.org/association-of-insulin-based-insulin-resistance-with-liver-biomarkers-in-type-2-diabetes-mellitus/
# HTML article, no PDF
# ... but only sometimes
select base_url from ingest_file_result where ingest_type = 'pdf' and status = 'no-pdf-link' and terminal_url like 'https://www.medwave.cl%' limit 5;
http://www.medwave.cl/link.cgi/Medwave/Perspectivas/Cartas/6878
https://www.medwave.cl/link.cgi/Medwave/Revisiones/RevisionClinica/8037.act
http://dx.doi.org/10.5867/medwave.2012.03.5332
https://www.medwave.cl/link.cgi/Medwave/Estudios/Casos/7683.act
http://www.medwave.cl/link.cgi/Medwave/Revisiones/CAT/5964
# HTML article, no PDF
Re-ingest HTML:
https://fatcat.wiki/container/mafob4ewkzczviwipyul7knndu (DONE)
https://fatcat.wiki/container/6rgnsrp3rnexdoks3bxcmbleda (DONE)
Re-ingest PDF:
doi_prefix:10.5935 (DONE)
doi_prefix:10.26442
## More Scielo
More scielo? `doi_prefix:10.5935 in_ia:false`
http://revistaadmmade.estacio.br/index.php/reeduc/article/view/1910/47965873
# OJS? fixed
https://revistas.unicentro.br/index.php/repaa/article/view/2667/2240
# working, but needed re-crawl
http://www.rbcp.org.br/details/2804/piezoelectric-preservative-rhinoplasty--an-alternative-approach-for-treating-bifid-nose-in-tessier-no--0-facial-cleft
A few others, mostly now working
## Recent OA DOIs
fatcat-cli search release 'is_oa:true (type:article-journal OR type:article OR type:paper-conference) !doi_prefix:10.5281 !doi_prefix:10.6084 !doi_prefix:10.48550 !doi_prefix:10.25446 !doi_prefix:10.25384 doi:* date:>2022-06-15 date:<2022-07-15 in_ia:false !publisher_type:big5' --index-json --limit 0 | pv -l > recent_missing_oa.json
wc -l recent_missing_oa.json
24433
cat recent_missing_oa.json | jq .doi_prefix -r | sort | uniq -c | sort -nr | head
4968 10.3390
1261 10.1080
687 10.23668
663 10.1021
472 10.1088
468 10.4000
367 10.3917
357 10.1364
308 10.4230
303 10.17863
cat recent_missing_oa.json | jq .doi_registrar -r | sort | uniq -c | sort -nr
19496 crossref
4836 datacite
101 null
cat recent_missing_oa.json | jq .publisher_type -r | sort | uniq -c | sort -nr
9575 longtail
8419 null
3861 society
822 unipress
449 oa
448 scielo
430 commercial
400 repository
22 other
7 archive
cat recent_missing_oa.json | jq .publisher -r | sort | uniq -c | sort -nr | head
4871 MDPI AG
1107 Informa UK (Taylor & Francis)
665 EAG-Publikationen
631 American Chemical Society
451 IOP Publishing
357 The Optical Society
347 OpenEdition
309 CAIRN
308 Schloss Dagstuhl - Leibniz-Zentrum für Informatik
303 Apollo - University of Cambridge Repository
cat recent_missing_oa.json | jq .container_name -r | sort | uniq -c | sort -nr | head
4908 null
378 Sustainability
327 ACS Omega
289 Optics Express
271 International Journal of Environmental Research and Public Health
270 International Journal of Health Sciences
238 Sensors
223 International Journal of Molecular Sciences
207 Molecules
193 Proceedings of the National Academy of Sciences of the United States of America
cat recent_missing_oa.json \
| rg -v "(MDPI|Informa UK|American Chemical Society|IOP Publishing|CAIRN|OpenEdition)" \
| wc -l
16558
cat recent_missing_oa.json | rg -i mdpi | shuf -n10 | jq .doi -r
10.3390/molecules27144419
=> was a 404
=> recrawl was successful
10.3390/math10142398
=> was a 404
10.3390/smartcities5030039
=> was a 404
Huh, we need to re-try/re-crawl MDPI URLs every week or so? Or special-case this situation.
Could be just a fatcat script, or a sandcrawler query.
cat recent_missing_oa.json \
| rg -v "(MDPI|Informa UK|American Chemical Society|IOP Publishing|CAIRN|OpenEdition)" \
| shuf -n10 | jq .doi -r
https://doi.org/10.18452/24860
=> success (just needed quarterly retry?)
=> b8c6c86aebd6cd2d85515441bbce052bcff033f2 (not in fatcat.wiki)
=> current status is "bad-redirect"
https://doi.org/10.26181/20099540.v1
=> success
=> 3f9b1ff2a09f3ea9051dbbef277579e8a0b4df30
=> this is figshare, and versioned. PDF was already attached to another DOI: https://doi.org/10.26181/20099540
https://doi.org/10.4230/lipics.sea.2022.22
=> there is a bug resulting in trailing slash in `citation_pdf_url`
=> fixed as a quirks mode
=> emailed to report
https://doi.org/10.3897/aca.5.e89679
=> success
=> e6fd1e066c8a323dc56246631748202d5fb48808
=> current status is 'bad-redirect'
https://doi.org/10.1103/physrevd.105.115035
=> was 404
=> success after force-recrawl of the terminal URL (not base URL)
https://doi.org/10.1155/2022/4649660
=> was 404
=> success after force-recrawl (of base_url)
https://doi.org/10.1090/spmj/1719
=> paywall (not actually OA)
=> https://fatcat.wiki/container/x6jfhegb3fbv3bcbqn2i3espiu is on Szczepanski list, but isn't all OA?
https://doi.org/10.1139/as-2022-0011
=> was no-pdf-link
=> fixed fulltext URL extraction
=> still needed to re-crawl terminal PDF link? hrm
https://doi.org/10.31703/grr.2022(vii-ii).02
=> was no-pdf-link
=> fixed! success
https://doi.org/10.1128/spectrum.00154-22
=> was 404
=> now repeatably 503, via SPN
https://doi.org/10.51601/ijersc.v3i3.393
=> 503 server error
https://doi.org/10.25416/ntr.20137379.v1
=> is figshare
=> docx (not PDF)
https://doi.org/10.25394/pgs.20263698.v1
=> figshare
=> embargo'd
https://doi.org/10.24850/j-tyca-14-4-7
=> was no-pdf-link
=> docs.google.com/viewer (!)
=> now handle this (success)
https://doi.org/10.26267/unipi_dione/1832
=> was bad-redirect
=> success
https://doi.org/10.25560/98019
=> body-too-large
=> also, PDF metadata fails to parse
=> is actually like 388 MByte
https://doi.org/10.14738/abr.106.12511
=> max-hops-exceeded
=> bumped max-hops from 6 to 8
=> then success (via google drive)
https://doi.org/10.24350/cirm.v.19933803
=> video, not PDF
https://doi.org/10.2140/pjm.2022.317.67
=> link-loop
=> not actually OA
https://doi.org/10.26265/polynoe-2306
=> was bad-redirect
=> now success
https://doi.org/10.3389/fpls.2022.826875
=> frontiers
=> was terminal-bad-status (403)
=> success on retry (not sure why)
=> maybe this is also a date-of-publication thing?
=> not sure all these should be retried though
https://doi.org/10.14198/medcom.22240
=> was terminal-bad-status (404)
=> force-recrawl resulted in an actual landing page, but still no-pdf-link
=> but actual PDF is a real 404, it seems. oh well
https://doi.org/10.31729/jnma.7579
=> no-capture
https://doi.org/10.25373/ctsnet.20146931.v2
=> figshare
=> video, not document or PDF
https://doi.org/10.1007/s42600-022-00224-0
=> not yet crawled/attempted (!)
=> springer
=> not actually OA
https://doi.org/10.37391/ijeer.100207
=> some upstream issue (server not found)
https://doi.org/10.1063/5.0093946
=> aip.scitation.org, is actually OA (can download in browser)
=> cookie trap?
=> redirect-loop (seems like a true redirect loop)
=> retrying the terminal PDF URL seems to have worked
https://doi.org/10.18502/jchr.v11i2.9998
=> no actual fulltext on publisher site
https://doi.org/10.1128/spectrum.01144-22
=> this is a 503 error, even after retrying. weird!
DONE: check `publisher_type` in chocula for:
- "MDPI AG"
- "Informa UK (Taylor & Francis)"
cat recent_missing_oa.json | jq '[.publisher, .publisher_type]' -c | sort | uniq -c | sort -nr | head -n40
4819 ["MDPI AG","longtail"]
924 ["Informa UK (Taylor & Francis)",null]
665 ["EAG-Publikationen",null]
631 ["American Chemical Society","society"]
449 ["IOP Publishing","society"]
357 ["The Optical Society","society"]
336 ["OpenEdition","oa"]
309 ["CAIRN","repository"]
308 ["Schloss Dagstuhl - Leibniz-Zentrum für Informatik",null]
303 ["Apollo - University of Cambridge Repository",null]
292 ["Springer (Biomed Central Ltd.)",null]
275 ["Purdue University Graduate School",null]
270 ["Suryasa and Sons","longtail"]
257 ["La Trobe",null]
216 ["Frontiers Media SA","longtail"]
193 ["Proceedings of the National Academy of Sciences","society"]
182 ["Informa UK (Taylor & Francis)","longtail"]
176 ["American Physical Society","society"]
168 ["Institution of Electrical Engineers","society"]
166 ["Oxford University Press","unipress"]
153 ["Loughborough University",null]
chocula mostly seems to set these correctly. is the issue that the chocula
computed values aren't coming through or getting updated? probably. both
the release (from container) metadata update; and chocula importer not
doing updates based on this field; and some old/incorrect values.
did some cleanups of specific containers, and next chocula update should
result in a bunch more `publisher_type` getting populated on older
containers
TODO: verify URLs are actualy URLs... somewhere? in the ingest pipeline
TODO: fatcat: don't ingest figshare "work" DOIs, only the "versioned" ones (?)
doi_prefix:10.26181
WIP: sandcrawler: regularly (weekly?) re-try 404 errors (the terminal URL, not the base url?) (or, some kind of delay?)
doi_prefix:10.3390 (MDPI)
doi_prefix:10.1103
doi_prefix:10.1155
DONE: simply re-ingest all:
doi_prefix:10.4230
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-daily --ingest-type pdf query 'doi_prefix:10.4230'
# Counter({'ingest_request': 2096, 'elasticsearch_release': 2096, 'estimate': 2096, 'kafka': 2096})
container_65lzi3vohrat5nnymk3dqpoycy
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-daily --ingest-type pdf container --container-id 65lzi3vohrat5nnymk3dqpoycy
# Counter({'ingest_request': 187, 'elasticsearch_release': 187, 'estimate': 187, 'kafka': 187})
container_5vp2bio65jdc3blx6rfhp3chde
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-daily --ingest-type pdf container --container-id 5vp2bio65jdc3blx6rfhp3chde
# Counter({'ingest_request': 83, 'elasticsearch_release': 83, 'estimate': 83, 'kafka': 83})
DONE: verify and maybe re-ingest all:
is_oa:true publisher:"Canadian Science Publishing" in_ia:false
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc280.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc350.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-daily --allow-non-oa --ingest-type pdf --force-recrawl query 'year:>2010 is_oa:true publisher:"Canadian Science Publishing" in_ia:false !journal:print'
# Counter({'ingest_request': 1041, 'elasticsearch_release': 1041, 'estimate': 1041, 'kafka': 1041})
## Re-Ingest bad-redirect, max-hops-exceeded, and google drive
Similar to `redirect-loop`:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.status = 'bad-redirect'
-- AND ingest_request.ingest_type = 'pdf'
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
OR ingest_request.link_source = 'unpaywall'
)
) TO '/srv/sandcrawler/tasks/retry_badredirect.2022-07-20.rows.json';
# COPY 100011
# after first run: COPY 5611
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.status = 'max-hops-exceeded'
-- AND ingest_request.ingest_type = 'pdf'
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
OR ingest_request.link_source = 'unpaywall'
)
) TO '/srv/sandcrawler/tasks/retry_maxhops.2022-07-20.rows.json';
# COPY 3546
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.hit is false
AND ingest_file_result.terminal_url like 'https://docs.google.com/viewer%'
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
OR ingest_request.link_source = 'unpaywall'
)
) TO '/srv/sandcrawler/tasks/retry_googledocs.2022-07-20.rows.json';
# COPY 1082
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_badredirect.2022-07-20.rows.json > /srv/sandcrawler/tasks/retry_badredirect.2022-07-20.json
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_maxhops.2022-07-20.rows.json > /srv/sandcrawler/tasks/retry_maxhops.2022-07-20.json
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_googledocs.2022-07-20.rows.json > /srv/sandcrawler/tasks/retry_googledocs.2022-07-20.json
cat /srv/sandcrawler/tasks/retry_badredirect.2022-07-20.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
cat /srv/sandcrawler/tasks/retry_maxhops.2022-07-20.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
cat /srv/sandcrawler/tasks/retry_googledocs.2022-07-20.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
# DONE