aboutsummaryrefslogtreecommitdiffstats
path: root/notes/cleanups/file_sha1_dedupe.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-23 18:58:37 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-23 18:58:37 -0800
commit8080eef139b5dcf6201e4f27076a879d0df20096 (patch)
tree4a19cb90122865bc55f172b5115e9ac052cfdf30 /notes/cleanups/file_sha1_dedupe.md
parentf6c4bd65c104e3a728e94561be80242cf35cbea3 (diff)
downloadfatcat-8080eef139b5dcf6201e4f27076a879d0df20096.tar.gz
fatcat-8080eef139b5dcf6201e4f27076a879d0df20096.zip
file de-dupe: notes on prep and QA testing
Diffstat (limited to 'notes/cleanups/file_sha1_dedupe.md')
-rw-r--r--notes/cleanups/file_sha1_dedupe.md64
1 files changed, 64 insertions, 0 deletions
diff --git a/notes/cleanups/file_sha1_dedupe.md b/notes/cleanups/file_sha1_dedupe.md
new file mode 100644
index 00000000..0829bc79
--- /dev/null
+++ b/notes/cleanups/file_sha1_dedupe.md
@@ -0,0 +1,64 @@
+
+
+## Prep
+
+Using `check_hashes.sh`:
+
+ zcat $HASH_FILE \
+ | awk '{print $3 "\t" $1}' \
+ | rg -v '^\t' \
+ | sort -S 4G \
+ | uniq -D -w 40 \
+ > sha1_ident.dupes.tsv
+
+ wc -l sha1_ident.dupes.tsv
+ # 6,350
+
+ cut -f1 sha1_ident.dupes.tsv | uniq | wc -l
+ # 2,039
+
+Want to create JSON for each group, like:
+
+ entity_type: "file"
+ primary_id: str or None
+ duplicate_ids: [str]
+ evidence:
+ extid: str
+ extid_type: "sha1"
+
+Run transform script:
+
+ cat sha1_ident.dupes.tsv | ./file_dupe_to_json.py | pv -l > file_sha1_dupes.json
+ # 2.04k 0:00:00 [9.16k/s]
+
+
+## QA Testing
+
+ export FATCAT_AUTH_API_TOKEN=[...]
+
+ head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
+ | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
+
+Hit some small bugs running in QA; test coverage isn't great, but I think hits
+the important parts.
+
+ head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
+ | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
+ # Running in dry-run mode!
+ # Counter({'updated-entities': 60, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
+
+Dry-run mode didn't actually work, and edits actually happened (!).
+
+Edits do look good.
+
+Try again, not dry-run, to ensure that case is handled:
+
+ head -n25 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
+ # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'merged': 0, 'updated-total': 0})
+
+And then run 500 through for more testing:
+
+ head -n500 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
+ # Counter({'updated-entities': 1341, 'lines': 500, 'merged': 474, 'skip': 26, 'skip-not-active-entity': 25, 'skip-entity-not-found': 1, 'updated-total': 0})
+
+The majority of merges seem to be cases where there are multiple articles in the same PDF.