diff options
Diffstat (limited to 'notes/ingest/2022-01-13_doi_crawl.md')
-rw-r--r-- | notes/ingest/2022-01-13_doi_crawl.md | 157 |
1 files changed, 157 insertions, 0 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md new file mode 100644 index 0000000..6f3b2c8 --- /dev/null +++ b/notes/ingest/2022-01-13_doi_crawl.md @@ -0,0 +1,157 @@ + +Could roll this in to current patch crawl instead of starting a new crawl from scratch. + +## KBART "almost complete" experimentation + +Random 10 releases: + + cat missing_releases.json | shuf -n10 | jq .ident -r | awk '{print "https://fatcat.wiki/release/" $1}' + https://fatcat.wiki/release/suggmo4fnfaave64frttaqqoja - domain gone + https://fatcat.wiki/release/uw2dq2p3mzgolk4alze2smv7bi - DOAJ, then OJS PDF link. sandcrawler failed, fixed + https://fatcat.wiki/release/fjamhzxxdndq5dcariobxvxu3u - OJS; sandcrawler fix works + https://fatcat.wiki/release/z3ubnko5ifcnbhhlegc24kya2u - OJS; sandcrawler failed, fixed (separate pattern) + https://fatcat.wiki/release/pysc3w2cdbehvffbyca4aqex3i - DOAJ, OJS bilingual, failed with 'redirect-loop'. force re-crawl worked for one copy + https://fatcat.wiki/release/am2m5agvjrbvnkstke3o3xtney - not attempted previously (?), success + https://fatcat.wiki/release/4zer6m56zvh6fd3ukpypdu7ita - cover page of journal (not an article). via crossref + https://fatcat.wiki/release/6njc4rdaifbg5jye3bbfdhkbsu - OJS; success + https://fatcat.wiki/release/jnmip3z7xjfsdfeex4piveshvu - OJS; not crawled previously; success + https://fatcat.wiki/release/wjxxcknnpjgtnpbzhzge6rkndi - no-pdf-link, fixed + +Try some more! + + https://fatcat.wiki/release/ywidvbhtfbettmfj7giu2htbdm - not attempted, success + https://fatcat.wiki/release/ou2kqv5k3rbk7iowfohpitelfa - OJS, not attempted, success? + https://fatcat.wiki/release/gv2glplmofeqrlrvfs524v5qa4 - scirp.org; 'redirect-loop'; HTML/PDF/XML all available; then 'gateway-timeout' on retry + https://fatcat.wiki/release/5r5wruxyyrf6jneorux3negwpe - gavinpublishers.com; broken site + https://fatcat.wiki/release/qk4atst6svg4hb73jdwacjcacu - horyzonty.ignatianum.edu.pl; broken DOI + https://fatcat.wiki/release/mp5ec3ycrjauxeve4n4weq7kqm - old cert; OJS; success + https://fatcat.wiki/release/sqnovcsmizckjdlwg3hipxrfqm - not attempted, success + https://fatcat.wiki/release/42ruewjuvbblxgnek6fpj5lp5m - OJS URL, but domain broken + https://fatcat.wiki/release/crg6aiypx5enveldvmwy5judp4 - volume/cover (stub) + https://fatcat.wiki/release/jzih3vvxj5ctxk3tbzyn5kokha - success + + +## Seeds: fixed OJS URLs + +Made some recent changes to sandcrawler, should re-attempt OJS URLs, particularly from DOI or DOAJ, with pattern like: + +- `no-pdf-link` with terminal URL like `/article/view/` +- `redirect-loop` with terminal URL like `/article/view/` + + COPY ( + SELECT row_to_json(ingest_request.*) + FROM ingest_request + LEFT JOIN ingest_file_result + ON ingest_file_result.ingest_type = ingest_request.ingest_type + AND ingest_file_result.base_url = ingest_request.base_url + WHERE + ingest_request.ingest_type = 'pdf' + AND ingest_file_result.status = 'no-pdf-link' + AND ( + ingest_file_result.terminal_url LIKE '%/article/view/%' + OR ingest_file_result.terminal_url LIKE '%/article/download/%' + ) + AND ( + ingest_request.link_source = 'doi' + OR ingest_request.link_source = 'doaj' + OR ingest_request.link_source = 'unpaywall' + ) + ) TO '/srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json'; + => COPY 326577 + + ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json > /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json + cat /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +Done/running. + + COPY ( + SELECT ingest_file_result.terminal_url + FROM ingest_request + LEFT JOIN ingest_file_result + ON ingest_file_result.ingest_type = ingest_request.ingest_type + AND ingest_file_result.base_url = ingest_request.base_url + WHERE + ingest_request.ingest_type = 'pdf' + AND ( + ingest_file_result.status = 'redirect-loop' + OR ingest_file_result.status = 'link-loop' + ) + AND ( + ingest_file_result.terminal_url LIKE '%/article/view/%' + OR ingest_file_result.terminal_url LIKE '%/article/download/%' + ) + ) TO '/srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt'; + => COPY 342415 + + cat /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt | awk '{print "F+ " $1}' > /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.schedule + +Done/seeded. + +## Seeds: scitemed.com + +Batch retry sandcrawler `no-pdf-link` with terminal URL like: `scitemed.com/article` + + COPY ( + SELECT row_to_json(ingest_request.*) + FROM ingest_request + LEFT JOIN ingest_file_result + ON ingest_file_result.ingest_type = ingest_request.ingest_type + AND ingest_file_result.base_url = ingest_request.base_url + WHERE + ingest_request.ingest_type = 'pdf' + AND ingest_file_result.status = 'no-pdf-link' + AND ingest_file_result.terminal_url LIKE '%/article/view/%' + AND ( + ingest_request.link_source = 'doi' + OR ingest_request.link_source = 'doaj' + OR ingest_request.link_source = 'unpaywall' + ) + ) TO '/srv/sandcrawler/tasks/retry_scitemed.2022-01-13.rows.json'; + # SKIPPED + +Actually there are very few of these. + +## Seeds: non-OA paper DOIs + +There are many DOIs out there which are likely to be from small publishers, on +the web, and would ingest just fine (eg, in OJS). + + fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' --count + 30,938,106 + + fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'preservation:none' --count + 6,664,347 + + fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'in_kbart:false' --count + 8,258,111 + +Do the 8 million first, then maybe try the 30.9 million later? Do sampling to +see how many are actually accessible? From experience with KBART generation, +many of these are likely to crawl successfully. + + ./fatcat_ingest.py --ingest-type pdf --allow-non-oa query 'in_ia:false is_oa:false doi:* release_type:article-journal container_id:* !publisher_type:big5 in_kbart:false' \ + | pv -l \ + | gzip \ + > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz + # Expecting 8255693 release objects in search queries + +## Seeds: not daily, but OA DOI + +There are a bunch of things we are no longer attempting daily, but should do +heritrix crawls of periodically. + +TODO: maybe in daily crawling, should check container coverage and see if most URLs are bright, and if so do ingest? hrm +TODO: What are they? zenodo.org? + +## Seeds: HTML and XML links from HTML biblio + + kafkacat -C -b wbgrp-svc284.us.archive.org:9092 -t sandcrawler-prod.ingest-file-results -e \ + | pv -l \ + | rg '"(html|xml)_fulltext_url"' \ + | rg '"no-pdf-link"' \ + | gzip \ + > ingest_file_result_fulltext_urls.2022-01-13.json.gz + +## Seeds: most doi.org terminal non-success + +Unless it is a 404, should retry. |