aboutsummaryrefslogtreecommitdiffstats
path: root/extra/cleanups/file_commoncrawl_truncated.md
blob: f74c5b22c3edc2e00baa751aaa3921a6c78d1c59 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218

There are a bunch of PDF captures in wayback, crawled by common crawl, which
have been truncated at 128 KBytes (130775 bytes).

Most of these were presumably imported before GROBID success was required for
fatcat ingest.

Fixup should wait until `fatcat_meta` cleanup has completed.

## Fatcat Files

Using an old snapshot, found 553 hits in elasticsearch. Index has not been
updated in a long time. Filtering to only those with a wayback URL, found just
415.

Here are some broken examples:

    https://fatcat.wiki/file/2e64rh3rc5gbvjzy2zux3qo36y
    sha1:44b54e9d272620f4e0641cadc1aa496fced5a950
    CDX warc_path:1224043144048_15-c/1224043303833_53.arc.gz
    CDX dt:20081006225231

    https://fatcat.wiki/file/n7ydoj6b6rhdhe4sc24lb7licq
    sha1:5916db9e338f2d6845af47a3e19b82cc59079249
    CDX warc_path:1216931712771_5-c/1216932408179_9.arc.gz
    CDX dt:20080708202110

    https://fatcat.wiki/file/iazux5lur5bfveraq6m7iewf5m
    645ff8b602a0ea94fd28ce93bfea4ff2f65aa124
    CDX warc_path:1216743864162_13-c/1216744138445_15.arc.gz
    CDX dt:20080706111123

This example seems fine (not broken):

    https://fatcat.wiki/file/lww5omk3grejhb2mlml3tikywq

Should be able to write a small function which can match on the Common Crawl
`warc_path` format.


## Sandcrawler SQL Exploration

    SELECT COUNT(*)
    FROM file_meta
    WHERE
        size_bytes = 130775;
    # 4904

## Proposed Cleanup

There are not many of these in fatcat. Propose:

0. wait until `file_meta` updates are complete, and re-dump metadata
1. filter `file_meta` dump to entities having this file size (using `rg` and `jq` filter)
2. for each, do a sandcrawler-db CDX read and check `warc_path` for common crawl pattern
3. if it matches, update file with `file_scope=corrupt` or
   `file_scope=truncated`, and remove `release_ids`, then print out file entity
4. use fatcat-cli to update file entity

## Future Mitigation

Should have sandcrawler, and possibly fatcat ingest worker, check for
size=130775 before accepting files. Probably in sandcrawler, so it can check
for common crawl WARC item name.

Could run a patch crawl to ensure we have some copy of all these URLs.

## Commands

    zcat ../2021-11-25/file_export.json.gz \
        | pv -l \
        | rg '"size":130775,' \
        > possibly_truncated.json

    wc -l possibly_truncated.json 
    # 655

Pretty few! At least one is not corrupt:

    https://web.archive.org/web/20050909185221/http://www.nature.com:80/cgi-taf/DynaPage.taf?file=/bmt/journal/v31/n4/full/1703836a.html&filetype=pdf

Fetching from sandcrawler-db:

    http get http://wbgrp-svc506.us.archive.org:3030/pdf_meta sha1hex==eq.7d5093fa09dc174471e590aab252d875bdecc7ed

    cat possibly_truncated.json \
        | jq .sha1 -r \
        | parallel -j8 curl 'http://wbgrp-svc506.us.archive.org:3030/file_meta?sha1hex=eq.{}' \
        | jq . -c \
        | pv -l \
        > possibly_truncated.file_meta.json

    rg '"sha1hex"' possibly_truncated.file_meta.json | wc -l
    # 556

    cat possibly_truncated.json \
        | jq .sha1 -r \
        | parallel -j8 curl 'http://wbgrp-svc506.us.archive.org:3030/pdf_meta?sha1hex=eq.{}' \
        | jq . -c \
        | pv -l \
        > possibly_truncated.pdf_meta.json

    cat possibly_truncated.pdf_meta.json | rg '"success"' | wc -l
    # 66

    rg '"status"' possibly_truncated.pdf_meta.json | wc -l
    # 427

    cat possibly_truncated.json \
        | jq .sha1 -r \
        | parallel -j8 curl 'http://wbgrp-svc506.us.archive.org:3030/grobid?sha1hex=eq.{}' \
        | jq . -c \
        | pv -l \
        > possibly_truncated.grobid.json

    cat possibly_truncated.grobid.json | rg '"success"' | wc -l
    # 67

    rg '"status"' possibly_truncated.grobid.json | wc -l
    # 430

    cat possibly_truncated.pdf_meta.json \
        | rg '"parse-error"' \
        | jq '.[0].sha1hex' -r \
        | sort \
        > truncated_sha1.txt

    wc -l truncated_sha1.txt
    # 360 truncated_sha1.txt

    cat possibly_truncated.json \
        | jq .sha1 -r \
        | sort \
        > possibly_sha1.txt

    cat possibly_truncated.pdf_meta.json \
        | rg '"success"' \
        | jq '.[0].sha1hex' -r \
        | sort \
        > pdf_fine_sha1.txt

    cat possibly_truncated.json \
        | rg -v '"rel"' \
        | jq .sha1 -r \
        | sort \
        > nourl_sha1.txt

    comm -23 possibly_sha1.txt truncated_sha1.txt \
        | comm -23 - pdf_fine_sha1.txt \
        | comm -23 - nourl_sha1.txt \
        > unknown_sha1.txt

Randomly sampled 10 `truncated_sha1.txt` and all were broken PDFs.

For the others, will need to re-run this after finishing `file_meta` work?

## Prod Commands

Configure CLI:

    export FATCAT_API_HOST=https://api.fatcat.wiki
    export FATCAT_AUTH_WORKER_CLEANUP=[...]
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP

    fatcat-cli --version
    fatcat-cli 0.1.6

    fatcat-cli status
         API Version: 0.5.0 (local)
            API host: https://api.fatcat.wiki [successfully connected]
      Last changelog: 5636508
      API auth token: [configured]
             Account: cleanup-bot [bot] [admin] [active]
                      editor_vvnmtzskhngxnicockn4iavyxq

Start small and review:

    cat /srv/fatcat/datasets/truncated_sha1.txt \
        | awk '{print "sha1:" $0}' \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 130775 \
        | head -n10 \
        | fatcat-cli batch update file release_ids= content_scope=truncated --description 'Flag truncated/corrupt PDFs (due to common crawl truncation)'
    # editgroup_3mviue5zebge3d2lqafgkfgwqa

Reviewed and all were corrupt. Running the rest of the batch:

    cat /srv/fatcat/datasets/truncated_sha1.txt \
        | awk '{print "sha1:" $0}' \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 130775 \
        | fatcat-cli batch update file release_ids= content_scope=truncated --description 'Flag truncated/corrupt PDFs (due to common crawl truncation)' --auto-accept

And then the other batch for review (no `--auto-accept`):

    cat /srv/fatcat/datasets/unknown_sha1.txt \
        | awk '{print "sha1:" $0}' \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 130775 \
        | fatcat-cli batch update file release_ids= content_scope=truncated --description 'Flag truncated/corrupt PDFs (due to common crawl truncation)'
    # editgroup_7l32piag7vho5d6gz6ee6zbtgi
    # editgroup_cnoheod4jjbevdzez7m5z4o64i
    # editgroup_w3fdmv4ffjeytnfjh32t5yovsq

These files were *not* truncated:

    file_2unklhykw5dwpotmslsldhlofy / 68821b6042a0a15fc788e99a400a1e7129d651a3
    file_xyfssct5pvde3ebg64bxoudcri / 7f307ebc2ce71c5f8a32ea4a77319403c8b87d95
    file_rrwbddwwrjg5tk4zyr5g63p2xi / ebe9f9d5d9def885e5a1ef220238f9ea9907dde1
    file_5bbhyx6labaxniz3pm2lvkr3wq / e498ee945b57b2d556bb2ed4a7d83e103bb3cc07

All the other "unknown" were, and were updated (editgroups accepted).