diff options
-rw-r--r-- | notes/bulk_edits/2019-10-08_file_cleanups.md | 59 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 6 |
2 files changed, 64 insertions, 1 deletions
diff --git a/notes/bulk_edits/2019-10-08_file_cleanups.md b/notes/bulk_edits/2019-10-08_file_cleanups.md new file mode 100644 index 00000000..b61b37f0 --- /dev/null +++ b/notes/bulk_edits/2019-10-08_file_cleanups.md @@ -0,0 +1,59 @@ + +These cleanups are primarily intended to fix bogus 'None' datetime links to +wayback for files that are actually in petabox (archive.org not +web.archive.org). These URLs were created accidentally during fatcat +boostrapping; there are about 300k such file enties to fix. + +Will also update archive.org link reltype to 'archive' (instead of +'repository'), which is the new prefered style. + +Generated the set of files to update like: + + zcat file_export.2019-07-07.json.gz | rg 'web.archive.org/web/None' | gzip > file_export.2019-07-07.None.json.gz + + zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | wc -l + 304308 + +## QA + +Running at git rev: + + 984a1b157990f42f8c57815f4b3c00f6455a114f + +Created a new 'cleanup-bot' account and credentials. Put token in local env. + +Ran with a couple hundred entities first; edits look good. + + zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files - + +Then the full command, with batchsize=100: + + time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files - + +Should finish in a couple hours. + + 304k 1:05:19 [77.6 /s] + + Counter({'cleaned': 304308, 'lines': 304308, 'updated': 297308, 'skip-revision': 7000}) + + real 65m20.613s + user 20m40.828s + sys 0m34.492s + +## Production + +Again ran with a couple hundred entities first; edits look good. + + zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files - + +Then the full command, with batchsize=100: + + time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files - + [...] + 304k 1:03:10 [80.3 /s] + Counter({'cleaned': 304308, 'lines': 304308, 'updated': 304107, 'skip-revision': 201}) + + real 63m11.631s + user 21m8.504s + sys 0m31.888s + diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index 97b8f8a2..e1d11817 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -9,12 +9,16 @@ this file should probably get merged into the guide at some point. This file should not turn in to a TODO list! +## 2019-10 + +Updated 304,308 file entities to remove broken +"https://web.archive.org/web/None/*" URLs. + ## 2019-09 Created and updated metadata for tens of thousands of containers, using "chocula" pipeline. - ## 2019-08 Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg, |