extra/bulk_edits/2021-11-29_file_commoncrawl_truncated.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74


Some PDF files in wayback were crawled externally by the Common Crawl project,
and are truncted to 128 KB (130775 bytes). This corrupts the PDF files, and
they can not be read. We want to remove these from the catalog, or at least,
mark them as non-archival and "truncated" (in `content_scope` field).

See the cleanup file for prep, counts, and other background.

There were over 600 file entities with the exact file size. Some are valid, but
over 400 were found to be corrupt, and updated in the catalog. The majority
seemed to have alternate captures and distinct file entities (not truncated).

## Prod Commands

Configure CLI:

    export FATCAT_API_HOST=https://api.fatcat.wiki
    export FATCAT_AUTH_WORKER_CLEANUP=[...]
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP

    fatcat-cli --version
    fatcat-cli 0.1.6

    fatcat-cli status
         API Version: 0.5.0 (local)
            API host: https://api.fatcat.wiki [successfully connected]
      Last changelog: 5636508
      API auth token: [configured]
             Account: cleanup-bot [bot] [admin] [active]
                      editor_vvnmtzskhngxnicockn4iavyxq

Start small and review:

    cat /srv/fatcat/datasets/truncated_sha1.txt \
        | awk '{print "sha1:" $0}' \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 130775 \
        | head -n10 \
        | fatcat-cli batch update file release_ids= content_scope=truncated --description 'Flag truncated/corrupt PDFs (due to common crawl truncation)'
    # editgroup_3mviue5zebge3d2lqafgkfgwqa

Reviewed and all were corrupt. Running the rest of the batch:

    cat /srv/fatcat/datasets/truncated_sha1.txt \
        | awk '{print "sha1:" $0}' \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 130775 \
        | fatcat-cli batch update file release_ids= content_scope=truncated --description 'Flag truncated/corrupt PDFs (due to common crawl truncation)' --auto-accept

And then the other batch for review (no `--auto-accept`):

    cat /srv/fatcat/datasets/unknown_sha1.txt \
        | awk '{print "sha1:" $0}' \
        | parallel -j1 fatcat-cli get {} --json \
        | jq . -c \
        | rg -v '"content_scope"' \
        | rg 130775 \
        | fatcat-cli batch update file release_ids= content_scope=truncated --description 'Flag truncated/corrupt PDFs (due to common crawl truncation)'
    # editgroup_7l32piag7vho5d6gz6ee6zbtgi
    # editgroup_cnoheod4jjbevdzez7m5z4o64i
    # editgroup_w3fdmv4ffjeytnfjh32t5yovsq

These files were *not* truncated:

    file_2unklhykw5dwpotmslsldhlofy / 68821b6042a0a15fc788e99a400a1e7129d651a3
    file_xyfssct5pvde3ebg64bxoudcri / 7f307ebc2ce71c5f8a32ea4a77319403c8b87d95
    file_rrwbddwwrjg5tk4zyr5g63p2xi / ebe9f9d5d9def885e5a1ef220238f9ea9907dde1
    file_5bbhyx6labaxniz3pm2lvkr3wq / e498ee945b57b2d556bb2ed4a7d83e103bb3cc07

All the other "unknown" were, and were updated (editgroups accepted).