diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-23 18:58:37 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-23 18:58:37 -0800 |
commit | 8080eef139b5dcf6201e4f27076a879d0df20096 (patch) | |
tree | 4a19cb90122865bc55f172b5115e9ac052cfdf30 /notes/cleanups/file_sha1_dedupe.md | |
parent | f6c4bd65c104e3a728e94561be80242cf35cbea3 (diff) | |
download | fatcat-8080eef139b5dcf6201e4f27076a879d0df20096.tar.gz fatcat-8080eef139b5dcf6201e4f27076a879d0df20096.zip |
file de-dupe: notes on prep and QA testing
Diffstat (limited to 'notes/cleanups/file_sha1_dedupe.md')
-rw-r--r-- | notes/cleanups/file_sha1_dedupe.md | 64 |
1 files changed, 64 insertions, 0 deletions
diff --git a/notes/cleanups/file_sha1_dedupe.md b/notes/cleanups/file_sha1_dedupe.md new file mode 100644 index 00000000..0829bc79 --- /dev/null +++ b/notes/cleanups/file_sha1_dedupe.md @@ -0,0 +1,64 @@ + + +## Prep + +Using `check_hashes.sh`: + + zcat $HASH_FILE \ + | awk '{print $3 "\t" $1}' \ + | rg -v '^\t' \ + | sort -S 4G \ + | uniq -D -w 40 \ + > sha1_ident.dupes.tsv + + wc -l sha1_ident.dupes.tsv + # 6,350 + + cut -f1 sha1_ident.dupes.tsv | uniq | wc -l + # 2,039 + +Want to create JSON for each group, like: + + entity_type: "file" + primary_id: str or None + duplicate_ids: [str] + evidence: + extid: str + extid_type: "sha1" + +Run transform script: + + cat sha1_ident.dupes.tsv | ./file_dupe_to_json.py | pv -l > file_sha1_dupes.json + # 2.04k 0:00:00 [9.16k/s] + + +## QA Testing + + export FATCAT_AUTH_API_TOKEN=[...] + + head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \ + | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files - + +Hit some small bugs running in QA; test coverage isn't great, but I think hits +the important parts. + + head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \ + | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files - + # Running in dry-run mode! + # Counter({'updated-entities': 60, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0}) + +Dry-run mode didn't actually work, and edits actually happened (!). + +Edits do look good. + +Try again, not dry-run, to ensure that case is handled: + + head -n25 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files - + # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'merged': 0, 'updated-total': 0}) + +And then run 500 through for more testing: + + head -n500 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files - + # Counter({'updated-entities': 1341, 'lines': 500, 'merged': 474, 'skip': 26, 'skip-not-active-entity': 25, 'skip-entity-not-found': 1, 'updated-total': 0}) + +The majority of merges seem to be cases where there are multiple articles in the same PDF. |