notes/ingest/2020-12-08_patch_crawl_notes.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111


Notes here about re-ingesting or re-crawling large batches. Goal around end of
2020 is to generate a broad patch crawl of terminal no-capture attempts for all
major sources crawled thus far. Have already tried run this process for unpaywall.

For each, want filtered ingest request JSON objects (filtering out platforms
that don't crawl well, and possibly things like figshare+zenodo), and a broader
seedlist (including terminal URLs). Will de-dupe all the seedlist URLs and do a
heritrix crawl with new config, then re-ingest all the requests individually.

Summary of what to do here:

    OA DOI: expecting some 2.4 million seeds
    OAI-PMH: expecting some 5 million no-capture URLs, plus more from missing PDF URL not found
    Unpaywall: another ~900k no-capture URLs (maybe filtered?)

For all, re-attempt for these status codes:

     no-capture
     cdx-error
     wayback-error
     petabox-error
     gateway-timeout (?)

And at least do bulk re-ingest for these, if updated before 2020-11-20 or so:

     no-pdf-link

## OAI-PMH

Need to re-ingest all of the (many!) no-capture and no-pdf-link

TODO: repec-specific URL extraction?

Skip these OAI prefixes:

     kb.dk
     bnf.fr
     hispana.mcu.es
     bdr.oai.bsb-muenchen.de
     ukm.si
     hsp.org

Skip these domains:

    www.kb.dk (kb.dk)
    kb-images.kb.dk (kb.dk)
    mdz-nbn-resolving.de (TODO: what prefix?)
    aggr.ukm.um.si (ukm.si)

Check PDF link extraction for these prefixes, or skip them (TODO):

    repec (mixed success)
    biodiversitylibrary.org
    juser.fz-juelich.de
    americanae.aecid.es
    www.irgrid.ac.cn
    hal
    espace.library.uq.edu.au
    igi.indrastra.com
    invenio.nusl.cz
    hypotheses.org
    t2r2.star.titech.ac.jp
    quod.lib.umich.edu

    domain: hemerotecadigital.bne.es
    domain: bib-pubdb1.desy.de
    domain: publikationen.bibliothek.kit.edu
    domain: edoc.mpg.de
    domain: bibliotecadigital.jcyl.es
    domain: lup.lub.lu.se
    domain: orbi.uliege.be

TODO:
- consider deleting ingest requests from skipped prefixes (large database use)


## Unpaywall

About 900k `no-pdf-link`, and up to 2.5 million more `no-pdf-link`.

Re-bulk-ingest filtered requests which hit `no-pdf-link` before 2020-11-20:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) < '2020-11-20'
            AND ingest_file_result.status = 'no-pdf-link'
            AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
            AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
            AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
            AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
            AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
            AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
            AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
            AND ingest_request.base_url NOT LIKE '%://archive.org/%'
            AND ingest_request.base_url NOT LIKE '%://web.archive.org/%'
            AND ingest_request.base_url NOT LIKE '%://www.archive.org/%'
    ) TO '/grande/snapshots/unpaywall_nopdflink_2020-12-08.rows.json';
    => COPY 1309990

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_nopdflink_2020-12-08.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_nopdflink_2020-12-08.ingest_request.json
    => 1.31M 0:00:51 [25.6k/s]

    cat /grande/snapshots/unpaywall_nopdflink_2020-12-08.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1