diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:33:14 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:33:14 -0800 |
commit | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (patch) | |
tree | 7d3934e4922439402f882a374fe477906fd41aae /notes/cleanups/file_sha1_dedupe.md | |
parent | ec2809ef2ac51c992463839c1e3451927f5e1661 (diff) | |
download | fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.tar.gz fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.zip |
move 'cleanups' directory from notes to extra/
Diffstat (limited to 'notes/cleanups/file_sha1_dedupe.md')
-rw-r--r-- | notes/cleanups/file_sha1_dedupe.md | 64 |
1 files changed, 0 insertions, 64 deletions
diff --git a/notes/cleanups/file_sha1_dedupe.md b/notes/cleanups/file_sha1_dedupe.md deleted file mode 100644 index 0829bc79..00000000 --- a/notes/cleanups/file_sha1_dedupe.md +++ /dev/null @@ -1,64 +0,0 @@ - - -## Prep - -Using `check_hashes.sh`: - - zcat $HASH_FILE \ - | awk '{print $3 "\t" $1}' \ - | rg -v '^\t' \ - | sort -S 4G \ - | uniq -D -w 40 \ - > sha1_ident.dupes.tsv - - wc -l sha1_ident.dupes.tsv - # 6,350 - - cut -f1 sha1_ident.dupes.tsv | uniq | wc -l - # 2,039 - -Want to create JSON for each group, like: - - entity_type: "file" - primary_id: str or None - duplicate_ids: [str] - evidence: - extid: str - extid_type: "sha1" - -Run transform script: - - cat sha1_ident.dupes.tsv | ./file_dupe_to_json.py | pv -l > file_sha1_dupes.json - # 2.04k 0:00:00 [9.16k/s] - - -## QA Testing - - export FATCAT_AUTH_API_TOKEN=[...] - - head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \ - | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files - - -Hit some small bugs running in QA; test coverage isn't great, but I think hits -the important parts. - - head -n25 /srv/fatcat/datasets/file_sha1_dupes.json \ - | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" --dry-run merge-files - - # Running in dry-run mode! - # Counter({'updated-entities': 60, 'lines': 25, 'merged': 25, 'skip': 0, 'updated-total': 0}) - -Dry-run mode didn't actually work, and edits actually happened (!). - -Edits do look good. - -Try again, not dry-run, to ensure that case is handled: - - head -n25 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files - - # Counter({'lines': 25, 'skip': 25, 'skip-not-active-entity': 25, 'merged': 0, 'updated-total': 0}) - -And then run 500 through for more testing: - - head -n500 /srv/fatcat/datasets/file_sha1_dupes.json | python -m fatcat_tools.mergers.files --editgroup-description-override "Automated merging of file entities with duplicate SHA-1 hashes" merge-files - - # Counter({'updated-entities': 1341, 'lines': 500, 'merged': 474, 'skip': 26, 'skip-not-active-entity': 25, 'skip-entity-not-found': 1, 'updated-total': 0}) - -The majority of merges seem to be cases where there are multiple articles in the same PDF. |