summaryrefslogtreecommitdiffstats
path: root/notes/cleanups/file_sha1_dedupe.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:33:14 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:33:14 -0800
commitc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (patch)
tree7d3934e4922439402f882a374fe477906fd41aae /notes/cleanups/file_sha1_dedupe.md
parentec2809ef2ac51c992463839c1e3451927f5e1661 (diff)
downloadfatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.tar.gz
fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.zip
move 'cleanups' directory from notes to extra/
Diffstat (limited to 'notes/cleanups/file_sha1_dedupe.md')
-rw-r--r--notes/cleanups/file_sha1_dedupe.md64
1 files changed, 0 insertions, 64 deletions
diff --git a/notes/cleanups/file_sha1_dedupe.md b/notes/cleanups/file_sha1_dedupe.md
deleted file mode 100644
index 0829bc79..00000000
--- a/notes/cleanups/file_sha1_dedupe.md
+++ /dev/null
@@ -1,64 +0,0 @@
-
-
-## Prep
-
-Using `check_hashes.sh`:
-
- zcat $HASH_FILE \
- | awk '{print $3 "\t" $1}' \
- | rg -v '^\t' \
- | sort -S 4G \
- | uniq -D -w 40 \
- > sha1_ident.dupes.tsv
-
- wc -l sha1_ident.dupes.tsv
- # 6,350
-
- cut -f1 sha1_ident.dupes.tsv | uniq | wc -l
- # 2,039
-
-Want to create JSON for each group, like:
-
- entity_type: "file"
- primary_id: str or None
- duplicate_ids: [str]
- evidence:
- extid: str
- extid_type: "sha1"
-
-Run transform script:
-
- cat sha1_ident.dupes.tsv | ./file_dupe_to_json.py | pv -l > file_sha1_dupes.json
- # 2.04k 0:00:00 [9.16k/s]
-
-
-## QA Testing
-
- export FATCAT_AUTH_API_TOKEN=[...]
-
- head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
- | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
-
-Hit some small bugs running in QA; test coverage isn't great, but I think hits
-the important parts.
-
- head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \
- | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files -
- # Running in dry-run mode!
- # Counter({'updated-entities': 60, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0})
-
-Dry-run mode didn't actually work, and edits actually happened (!).
-
-Edits do look good.
-
-Try again, not dry-run, to ensure that case is handled:
-
- head -n25 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
- # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'merged': 0, 'updated-total': 0})
-
-And then run 500 through for more testing:
-
- head -n500 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files -
- # Counter({'updated-entities': 1341, 'lines': 500, 'merged': 474, 'skip': 26, 'skip-not-active-entity': 25, 'skip-entity-not-found': 1, 'updated-total': 0})
-
-The majority of merges seem to be cases where there are multiple articles in the same PDF.