extra/cleanups/file_meta.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152


Over 500k file entities still lack complete metadata. For example, SHA-256
checksums and verified mimetypes.

Presumably these also lack GROBID processing. It seems that most or all of
these are simply wayback captures with no CDX metadata in sandcrawler-db, so
they didn't get update in prior cleanups.

Current plan, re-using existing tools and processes, is to:

1. create stub ingest requests containing file idents
2. process them "locally" on a large VM, in 'bulk' mode; writing output to stdout but using regular grobid and pdfextract "sinks" to Kafka
3. transform ingest results to a form for existing `file_meta` importer
4. run imports

The `file_meta` importer requires just the `file_meta` dict from sandcrawler.

## Prep

    zcat file_hashes.tsv.gz | pv -l | rg '\t\t' | wc -l
    # 521,553

    zcat file_export.json.gz \
        | rg -v '"sha256":' \
        | pv -l \
        | pigz \
        > files_missing_sha256.json.gz
    # 521k 0:10:21 [ 839 /s]

Want ingest requests with:

    base_url: str
    ingest_type: "pdf"
    link_source: "fatcat"
    link_source_id: file ident (with "file_" prefix)
    ingest_request_source: "file-backfill"
    ext_ids:
        sha1: str

Use `file2ingestrequest.py` helper:

    zcat files_missing_sha256.json.gz \
        | ./file2ingestrequest.py \
        | pv -l \
        | pigz \
        > files_missing_sha256.ingest_request.json.gz
    # 519k 0:00:19 [26.5k/s]

So about 2k filtered out, will investigate later.

    zcat files_missing_sha256.ingest_request.json.gz \
        | shuf -n1000 \
        > files_missing_sha256.ingest_request.sample.json

    head -n100 files_missing_sha256.ingest_request.sample.json | ./ingest_tool.py requests --no-spn2 - > sample_results.json
         4 "no-capture"
         1 "no-pdf-link"
        95 "success"

Seems like this is going to be a good start, but will need iteration.

Dev testing:

    head files_missing_sha256.ingest_request.sample.json \
        | ./ingest_tool.py file-requests-backfill - --kafka-env qa --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 \
        > out_sample.json


## Commands

Production warm-up:

    cat /srv/sandcrawler/tasks/files_missing_sha256.ingest_request.sample.json \
        | ./ingest_tool.py file-requests-backfill - --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 \
        > /srv/sandcrawler/tasks/files_missing_sha256.ingest_results.sample.json

Production parallel run:

    zcat /srv/sandcrawler/tasks/files_missing_sha256.ingest_request.json \
        | parallel -j24 --linebuffer --round-robin --pipe ./ingest_tool.py file-requests-backfill - --kafka-env qa --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 \
        > /srv/sandcrawler/tasks/files_missing_sha256.ingest_results.json

Filter and select file meta for import:

    head files_missing_sha256.ingest_results.json \
        | rg '"sha256hex"' \
        | jq 'select(.request.ext_ids.sha1 == .file_meta.sha1hex) | .file_meta' -c \
        > files_missing_sha256.file_meta.json
    # Worker: Counter({'total': 20925, 'success': 20003, 'no-capture': 545, 'link-loop': 115, 'wrong-mimetype': 104, 'redirect-loop': 46, 'wayback-error': 25, 'null-body': 20, 'no-pdf-link': 18, 'skip-url-blocklist': 17, 'terminal-bad-status': 16, 'cdx-error': 9, 'wayback-content-error': 4, 'blocked-cookie': 3})
    # [etc]


Had some GROBID issues, so are not going to be able to get everything in first
pass. Merge our partial results, as just `file_meta`:

    cat files_missing_sha256.ingest_results.batch1.json files_missing_sha256.ingest_results.json \
        | jq .file_meta -c \
        | rg '"sha256hex"' \
        | pv -l \
        > files_missing_sha256.file_meta.json
    # 386k 0:00:41 [9.34k/s]

A bunch of these will need to be re-run once GROBID is in a healthier place.

Check that we don't have (many) dupes:

    cat files_missing_sha256.file_meta.json \
        | jq .sha1hex -r \
        | sort \
        | uniq -D \
        | wc -l
    # 86520

Huh, seems like a weirdly large number. Maybe related to re-crawling? Will need
to dedupe by sha1hex.

Check how many dupes in original:

    zcat files_missing_sha256.ingest_request.json.gz | jq .ext_ids.sha1 -r | sort | uniq -D | wc -l

That lines up with dupes expected before SHA-1 de-dupe run.

    cat files_missing_sha256.file_meta.json \
        | sort -u -S 4G \
        | pv -l \
        > files_missing_sha256.file_meta.uniq.json

    cat files_missing_sha256.file_meta.uniq.json \
        | jq .sha1hex -r \
        | sort \
        | uniq -D \
        | wc -l
    # 0

Have seen a lot of errors like:

    %4|1637808915.562|TERMINATE|rdkafka#producer-1| [thrd:app]: Producer terminating with 1 message (650 bytes) still in queue or transit: use flush() to wait for outstanding message delivery

TODO: add manual `finish()` calls on sinks in tool `run` function

## QA Testing

    export FATCAT_API_AUTH_TOKEN... # sandcrawler-bot

    cat /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.sample.json \
        | ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
    # Counter({'total': 1000, 'update': 503, 'skip-existing-complete': 403, 'skip-no-match': 94, 'skip': 0, 'insert': 0, 'exists': 0})

    head -n1000 /srv/fatcat/datasets/files_missing_sha256.file_meta.uniq.json \
        | parallel -j8 --round-robin --pipe -q ./fatcat_import.py --editgroup-description-override 'backfill of full file-level metadata for early-imported papers' file-meta -
    # Counter({'total': 1000, 'update': 481, 'skip-existing-complete': 415, 'skip-no-match': 104, 'skip': 0, 'insert': 0, 'exists': 0})