1 files changed, 248 insertions, 0 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
new file mode 100644
index 0000000..a6f08dd
--- /dev/null
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -0,0 +1,248 @@
+
+Could roll this in to current patch crawl instead of starting a new crawl from scratch.
+
+This file is misnamed; these are mostly non-DOI-specific small updates.
+
+## KBART "almost complete" experimentation
+
+Random 10 releases:
+
+    cat missing_releases.json | shuf -n10 | jq .ident -r | awk '{print "https://fatcat.wiki/release/" $1}'
+    https://fatcat.wiki/release/suggmo4fnfaave64frttaqqoja - domain gone
+    https://fatcat.wiki/release/uw2dq2p3mzgolk4alze2smv7bi - DOAJ, then OJS PDF link. sandcrawler failed, fixed
+    https://fatcat.wiki/release/fjamhzxxdndq5dcariobxvxu3u - OJS; sandcrawler fix works
+    https://fatcat.wiki/release/z3ubnko5ifcnbhhlegc24kya2u - OJS; sandcrawler failed, fixed (separate pattern)
+    https://fatcat.wiki/release/pysc3w2cdbehvffbyca4aqex3i - DOAJ, OJS bilingual, failed with 'redirect-loop'. force re-crawl worked for one copy
+    https://fatcat.wiki/release/am2m5agvjrbvnkstke3o3xtney - not attempted previously (?), success
+    https://fatcat.wiki/release/4zer6m56zvh6fd3ukpypdu7ita - cover page of journal (not an article). via crossref
+    https://fatcat.wiki/release/6njc4rdaifbg5jye3bbfdhkbsu - OJS; success
+    https://fatcat.wiki/release/jnmip3z7xjfsdfeex4piveshvu - OJS; not crawled previously; success
+    https://fatcat.wiki/release/wjxxcknnpjgtnpbzhzge6rkndi - no-pdf-link, fixed
+
+Try some more!
+
+    https://fatcat.wiki/release/ywidvbhtfbettmfj7giu2htbdm - not attempted, success
+    https://fatcat.wiki/release/ou2kqv5k3rbk7iowfohpitelfa - OJS, not attempted, success?
+    https://fatcat.wiki/release/gv2glplmofeqrlrvfs524v5qa4 - scirp.org; 'redirect-loop'; HTML/PDF/XML all available; then 'gateway-timeout' on retry
+    https://fatcat.wiki/release/5r5wruxyyrf6jneorux3negwpe - gavinpublishers.com; broken site
+    https://fatcat.wiki/release/qk4atst6svg4hb73jdwacjcacu - horyzonty.ignatianum.edu.pl; broken DOI
+    https://fatcat.wiki/release/mp5ec3ycrjauxeve4n4weq7kqm - old cert; OJS; success
+    https://fatcat.wiki/release/sqnovcsmizckjdlwg3hipxrfqm - not attempted, success
+    https://fatcat.wiki/release/42ruewjuvbblxgnek6fpj5lp5m - OJS URL, but domain broken
+    https://fatcat.wiki/release/crg6aiypx5enveldvmwy5judp4 - volume/cover (stub)
+    https://fatcat.wiki/release/jzih3vvxj5ctxk3tbzyn5kokha - success
+
+
+## Seeds: fixed OJS URLs
+
+Made some recent changes to sandcrawler, should re-attempt OJS URLs, particularly from DOI or DOAJ, with pattern like:
+
+- `no-pdf-link` with terminal URL like `/article/view/`
+- `redirect-loop` with terminal URL like `/article/view/`
+
+    COPY (
+        SELECT row_to_json(ingest_request.*)
+        FROM ingest_request
+        LEFT JOIN ingest_file_result
+            ON ingest_file_result.ingest_type = ingest_request.ingest_type
+            AND ingest_file_result.base_url = ingest_request.base_url
+        WHERE
+            ingest_request.ingest_type = 'pdf'
+            AND ingest_file_result.status = 'no-pdf-link'
+            AND (
+                ingest_file_result.terminal_url LIKE '%/article/view/%'
+                OR ingest_file_result.terminal_url LIKE '%/article/download/%'
+            )
+            AND (
+                ingest_request.link_source = 'doi'
+                OR ingest_request.link_source = 'doaj'
+                OR ingest_request.link_source = 'unpaywall'
+            )
+    ) TO '/srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json';
+    => COPY 326577
+
+    ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json > /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json
+    cat /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Done/running.
+
+    COPY (
+        SELECT ingest_file_result.terminal_url
+        FROM ingest_request
+        LEFT JOIN ingest_file_result
+            ON ingest_file_result.ingest_type = ingest_request.ingest_type
+            AND ingest_file_result.base_url = ingest_request.base_url
+        WHERE
+            ingest_request.ingest_type = 'pdf'
+            AND (
+                ingest_file_result.status = 'redirect-loop'
+                OR ingest_file_result.status = 'link-loop'
+            )
+            AND (
+                ingest_file_result.terminal_url LIKE '%/article/view/%'
+                OR ingest_file_result.terminal_url LIKE '%/article/download/%'
+            )
+    ) TO '/srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt';
+    => COPY 342415
+
+    cat /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt | awk '{print "F+ " $1}' > /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.schedule
+
+Done/seeded.
+
+## Seeds: scitemed.com
+
+Batch retry sandcrawler `no-pdf-link` with terminal URL like: `scitemed.com/article`
+
+    COPY (
+        SELECT row_to_json(ingest_request.*)
+        FROM ingest_request
+        LEFT JOIN ingest_file_result
+            ON ingest_file_result.ingest_type = ingest_request.ingest_type
+            AND ingest_file_result.base_url = ingest_request.base_url
+        WHERE
+            ingest_request.ingest_type = 'pdf'
+            AND ingest_file_result.status = 'no-pdf-link'
+            AND ingest_file_result.terminal_url LIKE '%/article/view/%'
+            AND (
+                ingest_request.link_source = 'doi'
+                OR ingest_request.link_source = 'doaj'
+                OR ingest_request.link_source = 'unpaywall'
+            )
+    ) TO '/srv/sandcrawler/tasks/retry_scitemed.2022-01-13.rows.json';
+    # SKIPPED
+
+Actually there are very few of these.
+
+## Seeds: non-OA paper DOIs
+
+There are many DOIs out there which are likely to be from small publishers, on
+the web, and would ingest just fine (eg, in OJS).
+
+    fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' --count
+    30,938,106
+
+    fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'preservation:none' --count
+    6,664,347
+
+    fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'in_kbart:false' --count
+    8,258,111
+
+Do the 8 million first, then maybe try the 30.9 million later? Do sampling to
+see how many are actually accessible? From experience with KBART generation,
+many of these are likely to crawl successfully.
+
+    ./fatcat_ingest.py --ingest-type pdf --allow-non-oa query 'in_ia:false is_oa:false doi:* release_type:article-journal container_id:* !publisher_type:big5 in_kbart:false' \
+        | pv -l \
+        | gzip \
+        > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
+    # re-running 2022-02-08 after this VM was upgraded
+    # Expecting 8321448 release objects in search queries
+    # DONE
+
+This is large enough that it will probably be a bulk ingest, and then probably
+a follow-up crawl.
+
+## Seeds: HTML and XML links from HTML biblio
+
+    kafkacat -C -b wbgrp-svc284.us.archive.org:9092 -t sandcrawler-prod.ingest-file-results -e \
+        | pv -l \
+        | rg '"(html|xml)_fulltext_url"' \
+        | rg '"no-pdf-link"' \
+        | gzip \
+        > ingest_file_result_fulltext_urls.2022-01-13.json.gz
+
+    # cut this off at some point? gzip is terminated weird
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz | wc -l
+    # gzip: ingest_file_result_fulltext_urls.2022-01-13.json.gz: unexpected end of file
+    # 2,538,433
+
+Prepare seedlists (to include in heritrix patch crawl):
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
+        | jq .html_biblio.xml_fulltext_url -r \
+        | rg '://' \
+        | sort -u -S 4G \
+        | pv -l \
+        | gzip \
+        > ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz
+    # 1.24M 0:01:35 [12.9k/s]
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
+        | jq .html_biblio.html_fulltext_url -r \
+        | rg '://' \
+        | sort -u -S 4G \
+        | pv -l \
+        | gzip \
+        > ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz
+    # 549k 0:01:27 [6.31k/s]
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
+        | cut -f3 -d/ \
+        | sort -S 4G \
+        | uniq -c \
+        | sort -nr \
+        | head -n20
+
+     534005 dlc.library.columbia.edu
+     355319 www.degruyter.com
+     196421 zenodo.org
+     101450 serval.unil.ch
+     100631 biblio.ugent.be
+      47986 digi.ub.uni-heidelberg.de
+      39187 www.emerald.com
+      33195 www.cairn.info
+      25703 boris.unibe.ch
+      19516 journals.openedition.org
+      15911 academic.oup.com
+      11091 repository.dl.itc.u-tokyo.ac.jp
+       9847 oxfordworldsclassics.com
+       9698 www.thieme-connect.de
+       9552 www.idunn.no
+       9265 www.zora.uzh.ch
+       8030 www.scielo.br
+       6543 www.hanspub.org
+       6229 asmedigitalcollection.asme.org
+       5651 brill.com
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
+        | awk '{print "F+ " $1}' \
+        > ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+
+    wc -l ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+    1785901 ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+
+Added to `JOURNALS-PATCH-CRAWL-2022-01`
+
+## Seeds: most doi.org terminal non-success
+
+Unless it is a 404, should retry.
+
+TODO: generate this list
+
+## Non-OA DOI Bulk Ingest
+
+Had previously run:
+
+    cat ingest_nonoa_doi.json.gz \
+        | rg -v "doi.org/10.2139/" \
+        | rg -v "doi.org/10.1021/" \
+        | rg -v "doi.org/10.1121/" \
+        | rg -v "doi.org/10.1515/" \
+        | rg -v "doi.org/10.1093/" \
+        | rg -v "europepmc.org" \
+        | pv -l \
+        | gzip \
+        > nonoa_doi.filtered.ingests.json.gz
+    # 7.35M 0:01:13 [99.8k/s]
+
+Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
+entirely finished, but after almost all queues (domains) have been done for
+several days.
+
+    zcat nonoa_doi.filtered.ingests.json.gz \
+        | rg -v "\\\\" \
+        | jq . -c \
+        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Looks like many jstage `no-capture` status; these are still (slowly) crawling.