Could roll this in to current patch crawl instead of starting a new crawl from scratch. ## KBART "almost complete" experimentation Random 10 releases: cat missing_releases.json | shuf -n10 | jq .ident -r | awk '{print "https://fatcat.wiki/release/" $1}' https://fatcat.wiki/release/suggmo4fnfaave64frttaqqoja - domain gone https://fatcat.wiki/release/uw2dq2p3mzgolk4alze2smv7bi - DOAJ, then OJS PDF link. sandcrawler failed, fixed https://fatcat.wiki/release/fjamhzxxdndq5dcariobxvxu3u - OJS; sandcrawler fix works https://fatcat.wiki/release/z3ubnko5ifcnbhhlegc24kya2u - OJS; sandcrawler failed, fixed (separate pattern) https://fatcat.wiki/release/pysc3w2cdbehvffbyca4aqex3i - DOAJ, OJS bilingual, failed with 'redirect-loop'. force re-crawl worked for one copy https://fatcat.wiki/release/am2m5agvjrbvnkstke3o3xtney - not attempted previously (?), success https://fatcat.wiki/release/4zer6m56zvh6fd3ukpypdu7ita - cover page of journal (not an article). via crossref https://fatcat.wiki/release/6njc4rdaifbg5jye3bbfdhkbsu - OJS; success https://fatcat.wiki/release/jnmip3z7xjfsdfeex4piveshvu - OJS; not crawled previously; success https://fatcat.wiki/release/wjxxcknnpjgtnpbzhzge6rkndi - no-pdf-link, fixed Try some more! https://fatcat.wiki/release/ywidvbhtfbettmfj7giu2htbdm - not attempted, success https://fatcat.wiki/release/ou2kqv5k3rbk7iowfohpitelfa - OJS, not attempted, success? https://fatcat.wiki/release/gv2glplmofeqrlrvfs524v5qa4 - scirp.org; 'redirect-loop'; HTML/PDF/XML all available; then 'gateway-timeout' on retry https://fatcat.wiki/release/5r5wruxyyrf6jneorux3negwpe - gavinpublishers.com; broken site https://fatcat.wiki/release/qk4atst6svg4hb73jdwacjcacu - horyzonty.ignatianum.edu.pl; broken DOI https://fatcat.wiki/release/mp5ec3ycrjauxeve4n4weq7kqm - old cert; OJS; success https://fatcat.wiki/release/sqnovcsmizckjdlwg3hipxrfqm - not attempted, success https://fatcat.wiki/release/42ruewjuvbblxgnek6fpj5lp5m - OJS URL, but domain broken https://fatcat.wiki/release/crg6aiypx5enveldvmwy5judp4 - volume/cover (stub) https://fatcat.wiki/release/jzih3vvxj5ctxk3tbzyn5kokha - success ## Seeds: fixed OJS URLs Made some recent changes to sandcrawler, should re-attempt OJS URLs, particularly from DOI or DOAJ, with pattern like: - `no-pdf-link` with terminal URL like `/article/view/` - `redirect-loop` with terminal URL like `/article/view/` COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_file_result.status = 'no-pdf-link' AND ( ingest_file_result.terminal_url LIKE '%/article/view/%' OR ingest_file_result.terminal_url LIKE '%/article/download/%' ) AND ( ingest_request.link_source = 'doi' OR ingest_request.link_source = 'doaj' OR ingest_request.link_source = 'unpaywall' ) ) TO '/srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json'; => COPY 326577 ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json > /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json cat /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 Done/running. COPY ( SELECT ingest_file_result.terminal_url FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ( ingest_file_result.status = 'redirect-loop' OR ingest_file_result.status = 'link-loop' ) AND ( ingest_file_result.terminal_url LIKE '%/article/view/%' OR ingest_file_result.terminal_url LIKE '%/article/download/%' ) ) TO '/srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt'; => COPY 342415 cat /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt | awk '{print "F+ " $1}' > /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.schedule Done/seeded. ## Seeds: scitemed.com Batch retry sandcrawler `no-pdf-link` with terminal URL like: `scitemed.com/article` COPY ( SELECT row_to_json(ingest_request.*) FROM ingest_request LEFT JOIN ingest_file_result ON ingest_file_result.ingest_type = ingest_request.ingest_type AND ingest_file_result.base_url = ingest_request.base_url WHERE ingest_request.ingest_type = 'pdf' AND ingest_file_result.status = 'no-pdf-link' AND ingest_file_result.terminal_url LIKE '%/article/view/%' AND ( ingest_request.link_source = 'doi' OR ingest_request.link_source = 'doaj' OR ingest_request.link_source = 'unpaywall' ) ) TO '/srv/sandcrawler/tasks/retry_scitemed.2022-01-13.rows.json'; # SKIPPED Actually there are very few of these. ## Seeds: non-OA paper DOIs There are many DOIs out there which are likely to be from small publishers, on the web, and would ingest just fine (eg, in OJS). fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' --count 30,938,106 fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'preservation:none' --count 6,664,347 fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'in_kbart:false' --count 8,258,111 Do the 8 million first, then maybe try the 30.9 million later? Do sampling to see how many are actually accessible? From experience with KBART generation, many of these are likely to crawl successfully. ./fatcat_ingest.py --ingest-type pdf --allow-non-oa query 'in_ia:false is_oa:false doi:* release_type:article-journal container_id:* !publisher_type:big5 in_kbart:false' \ | pv -l \ | gzip \ > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz # Expecting 8255693 release objects in search queries ## Seeds: not daily, but OA DOI There are a bunch of things we are no longer attempting daily, but should do heritrix crawls of periodically. TODO: maybe in daily crawling, should check container coverage and see if most URLs are bright, and if so do ingest? hrm TODO: What are they? zenodo.org? ## Seeds: HTML and XML links from HTML biblio kafkacat -C -b wbgrp-svc284.us.archive.org:9092 -t sandcrawler-prod.ingest-file-results -e \ | pv -l \ | rg '"(html|xml)_fulltext_url"' \ | rg '"no-pdf-link"' \ | gzip \ > ingest_file_result_fulltext_urls.2022-01-13.json.gz ## Seeds: most doi.org terminal non-success Unless it is a 404, should retry.