aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2019-10-08_file_cleanups.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /notes/bulk_edits/2019-10-08_file_cleanups.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes/bulk_edits/2019-10-08_file_cleanups.md')
-rw-r--r--notes/bulk_edits/2019-10-08_file_cleanups.md59
1 files changed, 0 insertions, 59 deletions
diff --git a/notes/bulk_edits/2019-10-08_file_cleanups.md b/notes/bulk_edits/2019-10-08_file_cleanups.md
deleted file mode 100644
index 2eebb363..00000000
--- a/notes/bulk_edits/2019-10-08_file_cleanups.md
+++ /dev/null
@@ -1,59 +0,0 @@
-
-These cleanups are primarily intended to fix bogus 'None' datetime links to
-wayback for files that are actually in petabox (archive.org not
-web.archive.org). These URLs were created accidentally during fatcat
-boostrapping; there are about 300k such file enties to fix.
-
-Will also update archive.org link reltype to 'archive' (instead of
-'repository'), which is the new preferred style.
-
-Generated the set of files to update like:
-
- zcat file_export.2019-07-07.json.gz | rg 'web.archive.org/web/None' | gzip > file_export.2019-07-07.None.json.gz
-
- zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | wc -l
- 304308
-
-## QA
-
-Running at git rev:
-
- 984a1b157990f42f8c57815f4b3c00f6455a114f
-
-Created a new 'cleanup-bot' account and credentials. Put token in local env.
-
-Ran with a couple hundred entities first; edits look good.
-
- zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
-
-Then the full command, with batchsize=100:
-
- time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
-
-Should finish in a couple hours.
-
- 304k 1:05:19 [77.6 /s]
-
- Counter({'cleaned': 304308, 'lines': 304308, 'updated': 297308, 'skip-revision': 7000})
-
- real 65m20.613s
- user 20m40.828s
- sys 0m34.492s
-
-## Production
-
-Again ran with a couple hundred entities first; edits look good.
-
- zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
-
-Then the full command, with batchsize=100:
-
- time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
- [...]
- 304k 1:03:10 [80.3 /s]
- Counter({'cleaned': 304308, 'lines': 304308, 'updated': 304107, 'skip-revision': 201})
-
- real 63m11.631s
- user 21m8.504s
- sys 0m31.888s
-