Could roll this in to current patch crawl instead of starting a new crawl from scratch.

This file is misnamed; these are mostly non-DOI-specific small updates.

## KBART "almost complete" experimentation

Random 10 releases:

    cat missing_releases.json | shuf -n10 | jq .ident -r | awk '{print "https://fatcat.wiki/release/" $1}'
    https://fatcat.wiki/release/suggmo4fnfaave64frttaqqoja - domain gone
    https://fatcat.wiki/release/uw2dq2p3mzgolk4alze2smv7bi - DOAJ, then OJS PDF link. sandcrawler failed, fixed
    https://fatcat.wiki/release/fjamhzxxdndq5dcariobxvxu3u - OJS; sandcrawler fix works
    https://fatcat.wiki/release/z3ubnko5ifcnbhhlegc24kya2u - OJS; sandcrawler failed, fixed (separate pattern)
    https://fatcat.wiki/release/pysc3w2cdbehvffbyca4aqex3i - DOAJ, OJS bilingual, failed with 'redirect-loop'. force re-crawl worked for one copy
    https://fatcat.wiki/release/am2m5agvjrbvnkstke3o3xtney - not attempted previously (?), success
    https://fatcat.wiki/release/4zer6m56zvh6fd3ukpypdu7ita - cover page of journal (not an article). via crossref
    https://fatcat.wiki/release/6njc4rdaifbg5jye3bbfdhkbsu - OJS; success
    https://fatcat.wiki/release/jnmip3z7xjfsdfeex4piveshvu - OJS; not crawled previously; success
    https://fatcat.wiki/release/wjxxcknnpjgtnpbzhzge6rkndi - no-pdf-link, fixed

Try some more!

    https://fatcat.wiki/release/ywidvbhtfbettmfj7giu2htbdm - not attempted, success
    https://fatcat.wiki/release/ou2kqv5k3rbk7iowfohpitelfa - OJS, not attempted, success?
    https://fatcat.wiki/release/gv2glplmofeqrlrvfs524v5qa4 - scirp.org; 'redirect-loop'; HTML/PDF/XML all available; then 'gateway-timeout' on retry
    https://fatcat.wiki/release/5r5wruxyyrf6jneorux3negwpe - gavinpublishers.com; broken site
    https://fatcat.wiki/release/qk4atst6svg4hb73jdwacjcacu - horyzonty.ignatianum.edu.pl; broken DOI
    https://fatcat.wiki/release/mp5ec3ycrjauxeve4n4weq7kqm - old cert; OJS; success
    https://fatcat.wiki/release/sqnovcsmizckjdlwg3hipxrfqm - not attempted, success
    https://fatcat.wiki/release/42ruewjuvbblxgnek6fpj5lp5m - OJS URL, but domain broken
    https://fatcat.wiki/release/crg6aiypx5enveldvmwy5judp4 - volume/cover (stub)
    https://fatcat.wiki/release/jzih3vvxj5ctxk3tbzyn5kokha - success


## Seeds: fixed OJS URLs

Made some recent changes to sandcrawler, should re-attempt OJS URLs, particularly from DOI or DOAJ, with pattern like:

- `no-pdf-link` with terminal URL like `/article/view/`
- `redirect-loop` with terminal URL like `/article/view/`

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_file_result.status = 'no-pdf-link'
            AND (
                ingest_file_result.terminal_url LIKE '%/article/view/%'
                OR ingest_file_result.terminal_url LIKE '%/article/download/%'
            )
            AND (
                ingest_request.link_source = 'doi'
                OR ingest_request.link_source = 'doaj'
                OR ingest_request.link_source = 'unpaywall'
            )
    ) TO '/srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json';
    => COPY 326577

    ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json > /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json
    cat /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Done/running.

    COPY (
        SELECT ingest_file_result.terminal_url
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND (
                ingest_file_result.status = 'redirect-loop'
                OR ingest_file_result.status = 'link-loop'
            )
            AND (
                ingest_file_result.terminal_url LIKE '%/article/view/%'
                OR ingest_file_result.terminal_url LIKE '%/article/download/%'
            )
    ) TO '/srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt';
    => COPY 342415

    cat /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt | awk '{print "F+ " $1}' > /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.schedule

Done/seeded.

## Seeds: scitemed.com

Batch retry sandcrawler `no-pdf-link` with terminal URL like: `scitemed.com/article`

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_file_result.status = 'no-pdf-link'
            AND ingest_file_result.terminal_url LIKE '%/article/view/%'
            AND (
                ingest_request.link_source = 'doi'
                OR ingest_request.link_source = 'doaj'
                OR ingest_request.link_source = 'unpaywall'
            )
    ) TO '/srv/sandcrawler/tasks/retry_scitemed.2022-01-13.rows.json';
    # SKIPPED

Actually there are very few of these.

## Seeds: non-OA paper DOIs

There are many DOIs out there which are likely to be from small publishers, on
the web, and would ingest just fine (eg, in OJS).

    fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' --count
    30,938,106

    fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'preservation:none' --count
    6,664,347

    fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'in_kbart:false' --count
    8,258,111

Do the 8 million first, then maybe try the 30.9 million later? Do sampling to
see how many are actually accessible? From experience with KBART generation,
many of these are likely to crawl successfully.

    ./fatcat_ingest.py --ingest-type pdf --allow-non-oa query 'in_ia:false is_oa:false doi:* release_type:article-journal container_id:* !publisher_type:big5 in_kbart:false' \
        | pv -l \
        | gzip \
        > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
    # re-running 2022-02-08 after this VM was upgraded
    # Expecting 8321448 release objects in search queries
    # DONE

This is large enough that it will probably be a bulk ingest, and then probably
a follow-up crawl.

## Seeds: HTML and XML links from HTML biblio

    kafkacat -C -b wbgrp-svc284.us.archive.org:9092 -t sandcrawler-prod.ingest-file-results -e \
        | pv -l \
        | rg '"(html|xml)_fulltext_url"' \
        | rg '"no-pdf-link"' \
        | gzip \
        > ingest_file_result_fulltext_urls.2022-01-13.json.gz

    # cut this off at some point? gzip is terminated weird

    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz | wc -l
    # gzip: ingest_file_result_fulltext_urls.2022-01-13.json.gz: unexpected end of file
    # 2,538,433

Prepare seedlists (to include in heritrix patch crawl):

    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
        | jq .html_biblio.xml_fulltext_url -r \
        | rg '://' \
        | sort -u -S 4G \
        | pv -l \
        | gzip \
        > ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz
    # 1.24M 0:01:35 [12.9k/s]

    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
        | jq .html_biblio.html_fulltext_url -r \
        | rg '://' \
        | sort -u -S 4G \
        | pv -l \
        | gzip \
        > ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz
    # 549k 0:01:27 [6.31k/s]

    zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
        | cut -f3 -d/ \
        | sort -S 4G \
        | uniq -c \
        | sort -nr \
        | head -n20

     534005 dlc.library.columbia.edu
     355319 www.degruyter.com
     196421 zenodo.org
     101450 serval.unil.ch
     100631 biblio.ugent.be
      47986 digi.ub.uni-heidelberg.de
      39187 www.emerald.com
      33195 www.cairn.info
      25703 boris.unibe.ch
      19516 journals.openedition.org
      15911 academic.oup.com
      11091 repository.dl.itc.u-tokyo.ac.jp
       9847 oxfordworldsclassics.com
       9698 www.thieme-connect.de
       9552 www.idunn.no
       9265 www.zora.uzh.ch
       8030 www.scielo.br
       6543 www.hanspub.org
       6229 asmedigitalcollection.asme.org
       5651 brill.com

    zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
        | awk '{print "F+ " $1}' \
        > ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule

    wc -l ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
    1785901 ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule

Added to `JOURNALS-PATCH-CRAWL-2022-01`

## Seeds: most doi.org terminal non-success

Unless it is a 404, should retry.

TODO: generate this list

## Non-OA DOI Bulk Ingest

Had previously run:

    cat ingest_nonoa_doi.json.gz \
        | rg -v "doi.org/10.2139/" \
        | rg -v "doi.org/10.1021/" \
        | rg -v "doi.org/10.1121/" \
        | rg -v "doi.org/10.1515/" \
        | rg -v "doi.org/10.1093/" \
        | rg -v "europepmc.org" \
        | pv -l \
        | gzip \
        > nonoa_doi.filtered.ingests.json.gz
    # 7.35M 0:01:13 [99.8k/s]

Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
entirely finished, but after almost all queues (domains) have been done for
several days.

    zcat nonoa_doi.filtered.ingests.json.gz \
        | rg -v "\\\\" \
        | jq . -c \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Looks like many jstage `no-capture` status; these are still (slowly) crawling.