aboutsummaryrefslogtreecommitdiffstats
path: root/extra/cleanups/file_sha1_dedupe.md
blob: 0829bc7966674ee38754b2ca60d65962d04a13ae (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


## Prep

Using `check_hashes.sh`:

    zcat $HASH_FILE \
        | awk '{print $3 "\t" $1}' \
        | rg -v '^\t' \
        | sort -S 4G \
        | uniq -D -w 40 \
        > sha1_ident.dupes.tsv

    wc -l sha1_ident.dupes.tsv 
    # 6,350

    cut -f1 sha1_ident.dupes.tsv | uniq | wc -l
    # 2,039

Want to create JSON for each group, like:

    entity_type: "file"
    primary_id: str or None
    duplicate_ids: [str]
    evidence:
        extid: str
        extid_type: "sha1"

Run transform script:

    cat sha1_ident.dupes.tsv | ./file_dupe_to_json.py | pv -l > file_sha1_dupes.json
    # 2.04k 0:00:00 [9.16k/s]


## QA Testing

    export FATCAT_AUTH_API_TOKEN=[...]

    head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
        | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -

Hit some small bugs running in QA; test coverage isn't great, but I think hits
the important parts.

    head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
        | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
    # Running in dry-run mode!
    # Counter({'updated-entities': 60, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})

Dry-run mode didn't actually work, and edits actually happened (!).

Edits do look good.

Try again, not dry-run, to ensure that case is handled:

    head -n25 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
    # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'merged': 0, 'updated-total': 0})

And then run 500 through for more testing:

    head -n500 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
    # Counter({'updated-entities': 1341, 'lines': 500, 'merged': 474, 'skip': 26, 'skip-not-active-entity': 25, 'skip-entity-not-found': 1, 'updated-total': 0})

The majority of merges seem to be cases where there are multiple articles in the same PDF.