Primary Goal: start large crawl of OAI landing pages that we haven't seen

Fields of interest for ingest:
- oai identifer
- doi
- formats
- urls (maybe also "relations")
- types (type+stage)

## Other Tasks

About 150 million total lines.

Types coverage

    zstdcat oai.ndjson.zst | pv -l | jq "select(.types != null) | .types[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > types_counts.txt

Dump all ISSNs, with counts, quick check how many are in chocula/fatcat

    zstdcat oai.ndjson.zst | pv -l | jq "select(.issn != null) | .issn[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > issn_counts.txt

Language coverage

    zstdcat oai.ndjson.zst | pv -l | jq "select(.languages != null) | .languages[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > languages_counts.txt

Format coverage

    zstdcat oai.ndjson.zst | pv -l | jq "select(.formats != null) | .formats[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > formats_counts.txt
    => 150M 0:56:14 [44.7k/s]

Have a DOI?

    zstdcat oai.ndjson.zst | pv -l | rg '"doi":' | rg '"10.' | wc -l
    => 16,013,503

    zstdcat oai.ndjson.zst | pv -l | jq "select(.doi != null) | .doi[]" -r | sort -u -S 5G > doi_raw.txt
    => 11,940,950

## Transform, Load, Bulk Ingest

    zstdcat oai.ndjson.zst | ./oai2ingestrequest.py - | pv -l | gzip > oai.202002.requests.json.gz
    => 80M 6:36:55 [3.36k/s]

    time zcat /schnell/oai-pmh/oai.202002.requests.json.gz | pv -l | ./persist_tool.py ingest-request -
    => 80M 4:00:21 [5.55k/s]
    => Worker: Counter({'total': 80013963, 'insert-requests': 51169081, 'update-requests': 0})
    => JSON lines pushed: Counter({'pushed': 80013963, 'total': 80013963})

    => real    240m21.207s
    => user    85m12.576s
    => sys     3m29.580s

    select count(*) from ingest_request where ingest_type = 'pdf' and link_source = 'oai';
    => 51,185,088

Why so many (30 million) skipped? Not unique?

    zcat oai.202002.requests.json.gz | jq '[.link_source_id, .base_url]' -c | sort -u -S 4G | wc -l
    => 51,185,088

    zcat oai.202002.requests.json.gz | jq .base_url -r | pv -l | sort -u -S 4G > request_url.txt
    wc -l request_url.txt
    => 50,002,674 request_url.txt

    zcat oai.202002.requests.json.gz | jq .link_source_id -r | pv -l | sort -u -S 4G > requires_oai.txt
    wc -l requires_oai.txt
    => 34,622,083 requires_oai.txt

Yup, tons of duplication. And remember this is exact URL, not SURT or similar.

How many of these are URLs we have seen and ingested already?

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'oai'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;

             status          |  count
    -------------------------+----------
                             | 49491452
     success                 |  1469113
     no-capture              |   134611
     redirect-loop           |    59666
     no-pdf-link             |     8947
     cdx-error               |     7561
     terminal-bad-status     |     6704
     null-body               |     5042
     wrong-mimetype          |      879
     wayback-error           |      722
     petabox-error           |      198
     gateway-timeout         |       86
     link-loop               |       51
     invalid-host-resolution |       24
     spn2-cdx-lookup-failure |       22
     spn2-error              |        4
     bad-gzip-encoding       |        4
     spn2-error:job-failed   |        2
    (18 rows)

Dump ingest requests:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'oai'
            AND date(ingest_request.created) > '2020-05-01'
            AND ingest_file_result.status IS NULL
    ) TO '/grande/snapshots/oai_noingest_20200506.rows.json';
    => COPY 49491452

    WARNING: should have transformed from rows to requests here

    cat /grande/snapshots/oai_noingest_20200506.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

## Crawl and re-ingest

Updated stats after ingest (NOTE: ingest requests not really formed correctly,
but doesn't matter because fatcat wasn't importing these anyways):

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'oai'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;

             status          |  count
    -------------------------+----------
     no-capture              | 42565875
     success                 |  5227609
     no-pdf-link             |  2156341
     redirect-loop           |   559721
     cdx-error               |   260446
     wrong-mimetype          |   148871
     terminal-bad-status     |   109725
     link-loop               |    92792
     null-body               |    30688
                             |    15287
     petabox-error           |    11109
     wayback-error           |     6261
     skip-url-blocklist      |      184
     gateway-timeout         |       86
     bad-gzip-encoding       |       25
     invalid-host-resolution |       24
     spn2-cdx-lookup-failure |       22
     bad-redirect            |       15
     spn2-error              |        4
     spn2-error:job-failed   |        2
    (20 rows)

Dump again for crawling:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'oai'
            AND date(ingest_request.created) > '2020-05-01'
            AND (ingest_file_result.status = 'no-capture' or ingest_file_result.status = 'cdx-error')
    ) TO '/grande/snapshots/oai_tocrawl_20200526.rows.json';

Notes about crawl setup are in `journal-crawls` repo. Excluded the following domains:

        4876135 www.kb.dk               REMOVE: too large and generic
        3110009 kb-images.kb.dk         REMOVE: dead?
        1274638 mdz-nbn-resolving.de    REMOVE: maybe broken
         982312 aggr.ukm.um.si          REMOVE: maybe broken

And went from about 42,826,313 rows to 31,773,874 unique URLs to crawl, so
expecting at least 11,052,439 `no-capture` ingest results (and should probably
filter for these or even delete from the ingest request table).

## Post-ingest stats

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'oai'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;

             status          |  count   
    -------------------------+----------
     no-capture              | 16804277
     no-pdf-link             | 14895249
     success                 | 13898603
     redirect-loop           |  2709730
     cdx-error               |   827024
     terminal-bad-status     |   740037
     wrong-mimetype          |   604242
     link-loop               |   532553
     null-body               |    95721
     wayback-error           |    41864
     petabox-error           |    19204
                             |    15287
     gateway-timeout         |      510
     bad-redirect            |      318
     skip-url-blocklist      |      184
     bad-gzip-encoding       |      114
     timeout                 |       78
     spn2-cdx-lookup-failure |       59
     invalid-host-resolution |       19
     blocked-cookie          |        6
    (20 rows)

Hrm, +8 million or so 'success', but that is a lot of no-capture. May be worth
dumping the full kafka result topic, filter to OAI requests, and extracting the
missing URLs.

Top counts by OAI prefix:

    SELECT
        oai_prefix,
        COUNT(CASE WHEN status = 'success' THEN 1 END) as success,
        COUNT(*) as total
    FROM (
        SELECT
            ingest_file_result.status as status,
            -- eg "oai:cwi.nl:4881"
            substring(ingest_request.link_source_id FROM 'oai:([^:]+):.*') AS oai_prefix
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE 
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'oai'
    ) t1
    GROUP BY oai_prefix
    ORDER BY total DESC
    LIMIT 25;

            oai_prefix        | success |  total
    --------------------------+---------+---------
     kb.dk                    |       0 | 7989412 (excluded)
     repec                    | 1118591 | 2783448
     bnf.fr                   |       0 | 2187277
     hispana.mcu.es           |   19404 | 1492639
     bdr.oai.bsb-muenchen.de  |      73 | 1319882 (excluded?)
     hal                      |  564700 | 1049607
     ukm.si                   |       0 |  982468 (excluded)
     hsp.org                  |       0 |  810281
     www.irgrid.ac.cn         |   17578 |  748828
     cds.cern.ch              |   72811 |  688091
     americanae.aecid.es      |   69678 |  572792
     biodiversitylibrary.org  |    2121 |  566154
     juser.fz-juelich.de      |   22777 |  518551
     espace.library.uq.edu.au |    6494 |  508960
     igi.indrastra.com        |   58689 |  478577
     archive.ugent.be         |   63654 |  424014
     hrcak.srce.hr            |  395031 |  414897
     zir.nsk.hr               |  153889 |  397200
     renati.sunedu.gob.pe     |   78399 |  388355
     hypotheses.org           |       3 |  374296
     rour.neicon.ru           |    7963 |  354529
     generic.eprints.org      |  261221 |  340470
     invenio.nusl.cz          |    6184 |  325867
     evastar-karlsruhe.de     |   62044 |  317952
     quod.lib.umich.edu       |       5 |  309135
    (25 rows)

Top counts by OAI prefix and status:

    SELECT
        oai_prefix,
        status,
        COUNT((oai_prefix,status))
    FROM (
        SELECT
            ingest_file_result.status as status,
            -- eg "oai:cwi.nl:4881"
            substring(ingest_request.link_source_id FROM 'oai:([^:]+):.*') AS oai_prefix
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE 
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'oai'
    ) t1
    GROUP BY oai_prefix, status
    ORDER BY COUNT DESC
    LIMIT 30;


            oai_prefix        |    status     |  count  
    --------------------------+---------------+---------
     kb.dk                    | no-capture    | 7955231 (excluded)
     bdr.oai.bsb-muenchen.de  | no-capture    | 1270209 (excluded?)
     repec                    | success       | 1118591
     hispana.mcu.es           | no-pdf-link   | 1118092
     bnf.fr                   | no-capture    | 1100591
     ukm.si                   | no-capture    |  976004 (excluded)
     hsp.org                  | no-pdf-link   |  773496
     repec                    | no-pdf-link   |  625629
     bnf.fr                   | no-pdf-link   |  607813
     hal                      | success       |  564700
     biodiversitylibrary.org  | no-pdf-link   |  531409
     cds.cern.ch              | no-capture    |  529842
     repec                    | redirect-loop |  504393
     juser.fz-juelich.de      | no-pdf-link   |  468813
     bnf.fr                   | redirect-loop |  436087
     americanae.aecid.es      | no-pdf-link   |  409954
     hrcak.srce.hr            | success       |  395031
     www.irgrid.ac.cn         | no-pdf-link   |  362087
     hal                      | no-pdf-link   |  352111
     www.irgrid.ac.cn         | no-capture    |  346963
     espace.library.uq.edu.au | no-pdf-link   |  315302
     igi.indrastra.com        | no-pdf-link   |  312087
     repec                    | no-capture    |  309882
     invenio.nusl.cz          | no-pdf-link   |  302657
     hypotheses.org           | no-pdf-link   |  298750
     rour.neicon.ru           | redirect-loop |  291922
     renati.sunedu.gob.pe     | no-capture    |  276388
     t2r2.star.titech.ac.jp   | no-pdf-link   |  264109
     generic.eprints.org      | success       |  261221
     quod.lib.umich.edu       | no-pdf-link   |  253937
    (30 rows)

If we remove excluded prefixes, and some large/generic prefixes (bnf.fr,
hispana.mcu.es, hsp.org), then the aggregate counts are:

     no-capture              | 16,804,277 -> 5,502,242
     no-pdf-link             | 14,895,249 -> 12,395,848

Top status by terminal domain:

    SELECT domain, status, COUNT((domain, status))
    FROM (
        SELECT
            ingest_file_result.ingest_type,
            ingest_file_result.status,
            substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
        FROM ingest_file_result
        LEFT JOIN ingest_request
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE 
            ingest_file_result.ingest_type = 'pdf'
            AND ingest_request.link_source = 'oai'
    ) t1
    WHERE t1.domain != ''
    GROUP BY domain, status
    ORDER BY COUNT DESC
    LIMIT 30;

                  domain              |    status     | count  
    ----------------------------------+---------------+--------
     hispana.mcu.es                   | no-pdf-link   | 709701 (national scope)
     gallica.bnf.fr                   | no-pdf-link   | 601193 (national scope)
     discover.hsp.org                 | no-pdf-link   | 524212 (historical)
     www.biodiversitylibrary.org      | no-pdf-link   | 479288
     gallica.bnf.fr                   | redirect-loop | 435981 (national scope)
     hrcak.srce.hr                    | success       | 389673
     hemerotecadigital.bne.es         | no-pdf-link   | 359243
     juser.fz-juelich.de              | no-pdf-link   | 345112
     espace.library.uq.edu.au         | no-pdf-link   | 304299
     invenio.nusl.cz                  | no-pdf-link   | 302586
     igi.indrastra.com                | no-pdf-link   | 292006
     openrepository.ru                | redirect-loop | 291555
     hal.archives-ouvertes.fr         | success       | 278134
     t2r2.star.titech.ac.jp           | no-pdf-link   | 263971
     bib-pubdb1.desy.de               | no-pdf-link   | 254879
     quod.lib.umich.edu               | no-pdf-link   | 250382
     encounters.hsp.org               | no-pdf-link   | 248132
     americanae.aecid.es              | no-pdf-link   | 245295
     www.irgrid.ac.cn                 | no-pdf-link   | 242496
     publikationen.bibliothek.kit.edu | no-pdf-link   | 222041
     www.sciencedirect.com            | no-pdf-link   | 211756
     dialnet.unirioja.es              | redirect-loop | 203615
     edoc.mpg.de                      | no-pdf-link   | 195526
     bibliotecadigital.jcyl.es        | no-pdf-link   | 184671
     hal.archives-ouvertes.fr         | no-pdf-link   | 183809
     www.sciencedirect.com            | redirect-loop | 173439
     lup.lub.lu.se                    | no-pdf-link   | 165788
     orbi.uliege.be                   | no-pdf-link   | 158313
     www.erudit.org                   | success       | 155986
     lib.dr.iastate.edu               | success       | 153384
    (30 rows)

Follow-ups are TBD but could include:
- crawling the ~5m no-capture links directly (eg, not `base_url`) from the
  ingest result JSON, while retaining the ingest request for later re-ingest
- investigating and iterating on PDF link extraction, both for large platforms
  and randomly sampled from long tail
- classifying OAI prefixes by type (subject repository, institutional
  repository, journal, national-library, historical docs, greylit, law, etc)
- running pdftrio over some/all of this corpus