diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 15:24:44 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 15:24:44 -0800 |
commit | b47aba853da8ad127fb6d33933d763e5d64d436b (patch) | |
tree | b8aaaa99cec31d78145ea31eeecd855e3e0c1cc1 /extra/bulk_edits/2021-11-11_wayback_short_ts.md | |
parent | edfcf4b0d56e4ee9a7a77345a49d18fb698e1533 (diff) | |
download | fatcat-b47aba853da8ad127fb6d33933d763e5d64d436b.tar.gz fatcat-b47aba853da8ad127fb6d33933d763e5d64d436b.zip |
update to truncated wayback timestamp issue
Diffstat (limited to 'extra/bulk_edits/2021-11-11_wayback_short_ts.md')
-rw-r--r-- | extra/bulk_edits/2021-11-11_wayback_short_ts.md | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/extra/bulk_edits/2021-11-11_wayback_short_ts.md b/extra/bulk_edits/2021-11-11_wayback_short_ts.md index 20349f0c..c6b284ed 100644 --- a/extra/bulk_edits/2021-11-11_wayback_short_ts.md +++ b/extra/bulk_edits/2021-11-11_wayback_short_ts.md @@ -50,3 +50,27 @@ Looks good! Run the full batch. Counter({'total': 1203309, 'update': 1199782, 'skip-bad-wayback-timestamp': 2556, 'skip': 971, 'skip-status': 923, 'skip-bad-replacement': 48, 'insert': 0, 'exists': 0}) On the order of 99.7% were updated/fixed, over 9.5 million file entities, taking almost 13 hours. + +## Production Follow-up (2021-11-29) + +Fixed a small bug in `fetch_full_cdx_ts.py` helper script, and running import +again: + + git log | head -n1 + # commit ec2809ef2ac51c992463839c1e3451927f5e1661 + + export FATCAT_AUTH_WORKER_CLEANUP=[...] + + zcat /srv/fatcat/datasets/files_20211127_moreshortts.fetched.json.gz | wc -l + # 29494 + + zcat /srv/fatcat/datasets/files_20211127_moreshortts.fetched.json.gz \ + | pv -l \ + | python -m fatcat_tools.cleanups.file_short_wayback_ts - + # Counter({'total': 29494, 'update': 21358, 'skip': 8126, 'skip-status': 7677, 'skip-bad-replacement': 449, 'skip-bad-wayback': 9, 'skip-bad-wayback-timestamp': 1, 'insert': 0, 'exists': 0}) + +That caught 72% of the outstanding files. At this point would almost be willing +to just remove the outstanding bad URLs (possibly leaving the files with no +access options), but might also be worth revisiting in the future to trace down +exactly what is going on. + |