A new snapshot was released in April 2020 (the snapshot is from 2020-02-25, but
not released for more than a month).

Primary goal is:

- generate ingest requests for only *new* URLs
- bulk ingest these new URLs
- crawl any no-capture URLs from that batch
- re-bulk-ingest the no-capture batch
- analytics on failed ingests. eg, any particular domains that are failing to crawl

This ingest pipeline was started on 2020-04-07 by bnewbold.

Ran through the first two steps again on 2020-05-03 after unpaywall had
released another dump (dated 2020-04-27).

## Transform and Load

    # in sandcrawler pipenv on aitio
    zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-02-25T115244.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json
    => 24.7M 5:17:03 [ 1.3k/s]

    cat /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => 24.7M
    => Worker: Counter({'total': 24712947, 'insert-requests': 4282167, 'update-requests': 0})

Second time:

    # in sandcrawler pipenv on aitio
    zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-04-27T153236.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json
    => 25.2M 3:16:28 [2.14k/s]

    cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => Worker: Counter({'total': 25189390, 'insert-requests': 1408915, 'update-requests': 0})
    => JSON lines pushed: Counter({'pushed': 25189390, 'total': 25189390})


## Dump new URLs and Bulk Ingest

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2020-04-01'
            AND ingest_file_result.status IS NULL
    ) TO '/grande/snapshots/unpaywall_noingest_2020-04-08.rows.json';
    => 3696189

    WARNING: forgot to transform from rows to ingest requests.

    cat /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Second time:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2020-05-01'
            AND ingest_file_result.status IS NULL
    ) TO '/grande/snapshots/unpaywall_noingest_2020-05-03.rows.json';
    => 1799760

    WARNING: forgot to transform from rows to ingest requests.

    cat /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

## Dump no-capture, Run Crawl

Make two ingest request dumps: one with "all" URLs, which we will have heritrix
attempt to crawl, and then one with certain domains filtered out, which we may
or may not bother trying to ingest (due to expectation of failure).

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2020-04-01'
            AND ingest_file_result.status = 'no-capture'
    ) TO '/grande/snapshots/unpaywall_nocapture_all_2020-05-04.rows.json';
    => 2734145

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2020-04-01'
            AND ingest_file_result.status = 'no-capture'
            AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
            AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
            AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
            AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
            AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
            AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
            AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
    ) TO '/grande/snapshots/unpaywall_nocapture_2020-05-04.rows.json';
    => 2602408

NOTE: forgot here to transform from "rows" to ingest requests.

Not actually a very significant size difference after all.

See `journal-crawls` repo for details on seedlist generation and crawling.

## Re-Ingest Post-Crawl

NOTE: if we *do* want to do cleanup eventually, could look for fatcat edits
between 2020-04-01 and 2020-05-25 which have limited "extra" metadata (eg, no
evidence or `oa_status`).

The earlier bulk ingests were done wrong (forgot to transform from rows to full
ingest request docs), so going to re-do those, which should be a superset of
the nocapture crawl URLs.:

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-04-08.json
    => 1.26M 0:00:58 [21.5k/s]
    => previously: 3,696,189

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-05-03.json
    => 1.26M 0:00:56 [22.3k/s]

Crap, looks like the 2020-04-08 segment got overwriten with 2020-05 data by
accident. Hrm... need to re-ingest *all* recent unpaywall URLs:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2020-04-01'
    ) TO '/grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json';
    => COPY 5691106

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json
    => 5.69M 0:04:26 [21.3k/s]
   
Start small:

    cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | head -n200 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Looks good (whew), run the full thing:

    cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

## Post-ingest stats (2020-08-28)

Overall status:

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'unpaywall'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;

                   status                |  count   
    -------------------------------------+----------
     success                             | 22063013
     no-pdf-link                         |  2192606
     redirect-loop                       |  1471135
     terminal-bad-status                 |   995106
     no-capture                          |   359440
     cdx-error                           |   358909
     wrong-mimetype                      |   111685
     wayback-error                       |    50705
     link-loop                           |    29359
     null-body                           |    13667
     gateway-timeout                     |     3689
     spn2-cdx-lookup-failure             |     1229
     petabox-error                       |     1007
     redirects-exceeded                  |      747
     invalid-host-resolution             |      464
     spn2-error                          |      107
     spn2-error:job-failed               |       91
     bad-redirect                        |       26
     spn2-error:soft-time-limit-exceeded |        9
     bad-gzip-encoding                   |        5
    (20 rows)

Failures by domain:

    SELECT domain, status, COUNT((domain, status))
    FROM (
        SELECT
            ingest_file_result.ingest_type,
            ingest_file_result.status,
            substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
        FROM ingest_file_result
        LEFT JOIN ingest_request
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE 
            ingest_file_result.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
    ) t1
    WHERE t1.domain != ''
        AND t1.status != 'success'
        AND t1.status != 'no-capture'
    GROUP BY domain, status
    ORDER BY COUNT DESC
    LIMIT 30;

                  domain               |       status        | count  
    -----------------------------------+---------------------+--------
     academic.oup.com                  | no-pdf-link         | 415441
     watermark.silverchair.com         | terminal-bad-status | 345937
     www.tandfonline.com               | no-pdf-link         | 262488
     journals.sagepub.com              | no-pdf-link         | 235707
     onlinelibrary.wiley.com           | no-pdf-link         | 225876
     iopscience.iop.org                | terminal-bad-status | 170783
     www.nature.com                    | redirect-loop       | 145522
     www.degruyter.com                 | redirect-loop       | 131898
     files-journal-api.frontiersin.org | terminal-bad-status | 126091
     pubs.acs.org                      | no-pdf-link         | 119223
     society.kisti.re.kr               | no-pdf-link         | 112401
     www.ahajournals.org               | no-pdf-link         | 105953
     dialnet.unirioja.es               | terminal-bad-status |  96505
     www.cell.com                      | redirect-loop       |  87560
     www.ncbi.nlm.nih.gov              | redirect-loop       |  49890
     ageconsearch.umn.edu              | redirect-loop       |  45989
     ashpublications.org               | no-pdf-link         |  45833
     pure.mpg.de                       | redirect-loop       |  45278
     www.degruyter.com                 | terminal-bad-status |  43642
     babel.hathitrust.org              | terminal-bad-status |  42057
     osf.io                            | redirect-loop       |  41119
     scialert.net                      | no-pdf-link         |  39009
     dialnet.unirioja.es               | redirect-loop       |  38839
     www.jci.org                       | redirect-loop       |  34209
     www.spandidos-publications.com    | redirect-loop       |  33167
     www.journal.csj.jp                | no-pdf-link         |  30915
     journals.openedition.org          | redirect-loop       |  30409
     www.valueinhealthjournal.com      | redirect-loop       |  30090
     dergipark.org.tr                  | no-pdf-link         |  29146
     journals.ametsoc.org              | no-pdf-link         |  29133
    (30 rows)

Enqueue internal failures for re-ingest:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND (
                ingest_file_result.status = 'cdx-error' OR
                ingest_file_result.status = 'wayback-error'
            )
    ) TO '/grande/snapshots/unpaywall_errors_2020-08-28.rows.json';
    => 409606

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_errors_2020-08-28.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_errors_2020-08-28.requests.json

    cat /grande/snapshots/unpaywall_errors_2020-08-28.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

And after *that* (which ran quickly):

                   status                |  count   
    -------------------------------------+----------
     success                             | 22281874
     no-pdf-link                         |  2258352
     redirect-loop                       |  1499251
     terminal-bad-status                 |  1004781
     no-capture                          |   401333
     wrong-mimetype                      |   112068
     cdx-error                           |    32259
     link-loop                           |    30137
     null-body                           |    13886
     wayback-error                       |    11653
     gateway-timeout                     |     3689
     spn2-cdx-lookup-failure             |     1229
     petabox-error                       |     1036
     redirects-exceeded                  |      749
     invalid-host-resolution             |      464
     spn2-error                          |      107
     spn2-error:job-failed               |       91
     bad-redirect                        |       26
     spn2-error:soft-time-limit-exceeded |        9
     bad-gzip-encoding                   |        5
    (20 rows)

22063013 -> 22281874 = + 218,861 success, not bad!