aboutsummaryrefslogtreecommitdiffstats
path: root/notes/tasks/2020-07-22_processing_holes.md
blob: 363989a5b1fc3c1edfcdee86b0866effbf3469f2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

Want to clean up missing/partial processing (GROBID, `pdf_meta`, `file_meta`)
in sandcrawler database.


## `pdf_meta` for petabox rows

Ran `dump_unextracted_pdf_petabox.sql` SQL, which resulted in a .json file.

    wc -l dump_unextracted_pdf_petabox.2020-07-22.json
    1503086 dump_unextracted_pdf_petabox.2020-07-22.json

Great, 1.5 million, not too many. Start small:

    head -n1000 dump_unextracted_pdf_petabox.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.unextracted -p -1

Full batch:

    cat dump_unextracted_pdf_petabox.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.unextracted -p -1

## `pdf_meta` missing CDX rows

First, the GROBID-ized rows but only if has a fatcat file as well.

10,755,365! That is a lot still to process.

    cat dump_unextracted_pdf.fatcat.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.unextracted -p -1

## `GROBID` missing petabox rows

    wc -l /grande/snapshots/dump_ungrobided_pdf_petabox.2020-07-22.json 
    972221 /grande/snapshots/dump_ungrobided_pdf_petabox.2020-07-22.json

Start small:

    head -n1000 dump_ungrobided_pdf_petabox.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ungrobided-pg -p -1

Full batch:

    cat dump_ungrobided_pdf_petabox.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ungrobided-pg -p -1

## `GROBID` for missing CDX rows in fatcat

    wc -l dump_ungrobided_pdf.fatcat.2020-07-22.json
    1808580 dump_ungrobided_pdf.fatcat.2020-07-22.json

Full batch:

    cat dump_ungrobided_pdf.fatcat.2020-07-22.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ungrobided-pg -p -1

## `GROBID` for bad status

Eg, wayback errors.

TODO

## `pdf_trio` for OA journal crawls

TODO

## `pdf_trio` for "included by heuristic", not in fatcat

TODO

## Live-ingest missing arxiv papers

    ./fatcat_ingest.py --allow-non-oa --limit 10000 query arxiv_id:* > /srv/fatcat/snapshots/arxiv_10k_ingest_requests.json
    => Expecting 1505184 release objects in search queries

    cat /srv/fatcat/snapshots/arxiv_10k_ingest_requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p 22

Repeating this every few days should (?) result in all the backlog of arxiv
papers getting indexed. Could focus on recent years to start (with query
filter).

## re-ingest spn2 errors (all time)

Eg:

    spn2-cdx-lookup-failure: 143963
    spn-error: 101773
    spn2-error: 16342

TODO

## re-try CDX errors

Eg, for unpaywall only, bulk ingest all `cdx-error`.

TODO

## live ingest unpaywall `no-capture` URLs

After re-trying the CDX errors for unpaywall URLs (see above), count all the
no-capture URLs, and if reasonable recrawl them all in live more ("reasonable"
meaning fewer than 200k or so URLs).

Could also force recrawl (not using CDX lookups) for some publisher platforms
if that made sense.

TODO