New snapshot released 2020-10-09. Want to do a mostly straight-forward
load/ingest/crawl.

Proposed changes this time around:

- have bulk ingest store missing URLs in a new sandcrawler-db for `no-capture`
  status, and to include those URLs in heritrix3 crawl
- tweak heritrix3 config for additional PDF URL extraction patterns,
  particularly to improve OJS yield


## Transform and Load

    # in sandcrawler pipenv on aitio
    zcat /schnell/unpaywall/unpaywall_snapshot_2020-10-09T153852.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-10-09.ingest_request.json
    => 28.3M 3:19:03 [2.37k/s]

    cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => 28.3M 1:11:29 [ 6.6k/s]
    => Worker: Counter({'total': 28298500, 'insert-requests': 4119939, 'update-requests': 0})
    => JSON lines pushed: Counter({'total': 28298500, 'pushed': 28298500})

## Dump new URLs, Transform, Bulk Ingest

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            -- AND date(ingest_request.created) > '2020-10-09'
            AND (ingest_file_result.status IS NULL
                OR ingest_file_result.status = 'no-capture')
    ) TO '/grande/snapshots/unpaywall_noingest_2020-10-09.rows.json';
    => COPY 4216339

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-10-09.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json
    => 4.22M 0:02:48 [  25k/s]

Start small, to test no-capture behavior:

    cat /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json | head -n1000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

`no-capture` change looks good. Enqueue the whole batch:

    cat /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1


## Check Pre-Crawl Status

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'unpaywall'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;


             status          |  count   
    -------------------------+----------
     success                 | 23661282
     no-capture              |  3015447
     no-pdf-link             |  2302102
     redirect-loop           |  1542566
     terminal-bad-status     |  1044676
     wrong-mimetype          |   114315
     link-loop               |    36358
     cdx-error               |    20150
     null-body               |    14513
     wayback-error           |    13644
     gateway-timeout         |     3776
     spn2-cdx-lookup-failure |     1260
     petabox-error           |     1171
     redirects-exceeded      |      752
     invalid-host-resolution |      464
     spn2-error              |      147
     bad-redirect            |      131
     spn2-error:job-failed   |       91
     wayback-content-error   |       45
     timeout                 |       19
    (20 rows)

## Dump Seedlist

Dump rows:

    COPY (
        SELECT row_to_json(t1.*)
        FROM (
            SELECT ingest_request.*, ingest_file_result as result
            FROM ingest_request
            LEFT JOIN ingest_file_result
                ON ingest_file_result.ingest_type = ingest_request.ingest_type
                AND ingest_file_result.base_url = ingest_request.base_url
            WHERE
                ingest_request.ingest_type = 'pdf'
                AND ingest_request.link_source = 'unpaywall'
                AND (ingest_file_result.status = 'no-capture'
                    OR ingest_file_result.status = 'cdx-error'
                    OR ingest_file_result.status = 'wayback-error'
                    OR ingest_file_result.status = 'gateway-timeout'
                    OR ingest_file_result.status = 'spn2-cdx-lookup-failure'
                )
                AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
                AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
                AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
                AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
                AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
        ) t1
    ) TO '/grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json';
    => 2,936,404

    # TODO: in the future also exclude "www.archive.org"

Prep ingest requests (for post-crawl use):

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | pv -l > /grande/snapshots/unpaywall_crawl_ingest_2020-11-02.json

And actually dump seedlist(s):

    cat /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | jq -r .base_url | sort -u -S 4G > /grande/snapshots/unpaywall_seedlist_2020-11-02.url.txt
    cat /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | rg '"no-capture"' | jq -r .result.terminal_url | rg -v ^null$ | sort -u -S 4G > /grande/snapshots/unpaywall_seedlist_2020-11-02.terminal_url.txt
    cat /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | rg -v '"no-capture"' | jq -r .base_url | sort -u -S 4G > /grande/snapshots/unpaywall_seedlist_2020-11-02.no_terminal_url.txt

    wc -l unpaywall_seedlist_2020-11-02.*.txt
     2701178 unpaywall_seedlist_2020-11-02.terminal_url.txt
     2713866 unpaywall_seedlist_2020-11-02.url.txt

With things like jsessionid, suspect that crawling just the terminal URLs is
going to work better than both full and terminal.

Finding a fraction of `no-capture` which have partial/stub URLs as terminal.

TODO: investigate scale of partial/stub `terminal_url` (eg, not HTTP/S or FTP).


## Bulk Ingest and Status

Note, removing archive.org links:

    cat /grande/snapshots/unpaywall_crawl_ingest_2020-11-02.json | rg -v www.archive.org | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Overall status (checked 2020-12-08):

    SELECT ingest_file_result.status, COUNT(*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
        GROUP BY status
        ORDER BY COUNT DESC
        LIMIT 20;

             status          |  count   
    -------------------------+----------
     success                 | 25004559
     no-pdf-link             |  2531841
     redirect-loop           |  1671375
     terminal-bad-status     |  1389463
     no-capture              |   893880
     wrong-mimetype          |   119332
     link-loop               |    66508
     wayback-content-error   |    30339
     cdx-error               |    21790
     null-body               |    20710
     wayback-error           |    13976
     gateway-timeout         |     3775
     petabox-error           |     2420
     spn2-cdx-lookup-failure |     1218
     redirects-exceeded      |      889
     invalid-host-resolution |      464
     bad-redirect            |      147
     spn2-error              |      112
     spn2-error:job-failed   |       91
     timeout                 |       21
    (20 rows)

Ingest stats broken down by publication stage:

    SELECT ingest_request.release_stage, ingest_file_result.status, COUNT(*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
        GROUP BY release_stage, status
        ORDER BY release_stage, COUNT DESC
        LIMIT 100;


     release_stage |               status                |  count
    ---------------+-------------------------------------+----------
     accepted      | success                             |  1101090
     accepted      | no-pdf-link                         |    28590
     accepted      | redirect-loop                       |    10923
     accepted      | no-capture                          |     9540
     accepted      | terminal-bad-status                 |     6339
     accepted      | cdx-error                           |      952
     accepted      | wrong-mimetype                      |      447
     accepted      | link-loop                           |      275
     accepted      | wayback-error                       |      202
     accepted      | petabox-error                       |      177
     accepted      | redirects-exceeded                  |      122
     accepted      | null-body                           |       27
     accepted      | wayback-content-error               |       14
     accepted      | spn2-cdx-lookup-failure             |        5
     accepted      | gateway-timeout                     |        4
     accepted      | bad-redirect                        |        1
     published     | success                             | 18595278
     published     | no-pdf-link                         |  2434935
     published     | redirect-loop                       |  1364110
     published     | terminal-bad-status                 |  1185328
     published     | no-capture                          |   718792
     published     | wrong-mimetype                      |   112923
     published     | link-loop                           |    63874
     published     | wayback-content-error               |    30268
     published     | cdx-error                           |    17302
     published     | null-body                           |    15209
     published     | wayback-error                       |    10782
     published     | gateway-timeout                     |     1966
     published     | petabox-error                       |     1611
     published     | spn2-cdx-lookup-failure             |      879
     published     | redirects-exceeded                  |      760
     published     | invalid-host-resolution             |      453
     published     | bad-redirect                        |      115
     published     | spn2-error:job-failed               |       77
     published     | spn2-error                          |       75
     published     | timeout                             |       21
     published     | bad-gzip-encoding                   |        5
     published     | spn2-error:soft-time-limit-exceeded |        4
     published     | spn2-error:pending                  |        1
     published     | blocked-cookie                      |        1
     published     |                                     |        1
     published     | pending                             |        1
     submitted     | success                             |  5308166
     submitted     | redirect-loop                       |   296322
     submitted     | terminal-bad-status                 |   197785
     submitted     | no-capture                          |   165545
     submitted     | no-pdf-link                         |    68274
     submitted     | wrong-mimetype                      |     5962
     submitted     | null-body                           |     5474
     submitted     | cdx-error                           |     3536
     submitted     | wayback-error                       |     2992
     submitted     | link-loop                           |     2359
     submitted     | gateway-timeout                     |     1805
     submitted     | petabox-error                       |      632
     submitted     | spn2-cdx-lookup-failure             |      334
     submitted     | wayback-content-error               |       57
     submitted     | spn2-error                          |       37
     submitted     | bad-redirect                        |       31
     submitted     | spn2-error:job-failed               |       14
     submitted     |                                     |       12
     submitted     | invalid-host-resolution             |       11
     submitted     | redirects-exceeded                  |        7
     submitted     | spn2-error:soft-time-limit-exceeded |        5
     submitted     | bad-gzip-encoding                   |        1
     submitted     | skip-url-blocklist                  |        1
                   | no-pdf-link                         |       42
                   | success                             |       25
                   | redirect-loop                       |       20
                   | terminal-bad-status                 |       11
                   | no-capture                          |        3
    (70 rows)