summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-10-09 13:11:33 -0700
committerBryan Newbold <bnewbold@robocracy.org>2019-10-09 13:11:33 -0700
commit5748f3241117b52f5295dc589374ec0c219534e4 (patch)
tree969bd2f4c2ff08331beccc02b5675905ab1752b0
parent5808b06162263dee7e7d86d7369d19f299ddf4a9 (diff)
downloadfatcat-5748f3241117b52f5295dc589374ec0c219534e4.tar.gz
fatcat-5748f3241117b52f5295dc589374ec0c219534e4.zip
note file fixup pushed in prod
-rw-r--r--notes/bulk_edits/2019-10-08_file_cleanups.md59
-rw-r--r--notes/bulk_edits/CHANGELOG.md6
2 files changed, 64 insertions, 1 deletions
diff --git a/notes/bulk_edits/2019-10-08_file_cleanups.md b/notes/bulk_edits/2019-10-08_file_cleanups.md
new file mode 100644
index 00000000..b61b37f0
--- /dev/null
+++ b/notes/bulk_edits/2019-10-08_file_cleanups.md
@@ -0,0 +1,59 @@
+
+These cleanups are primarily intended to fix bogus 'None' datetime links to
+wayback for files that are actually in petabox (archive.org not
+web.archive.org). These URLs were created accidentally during fatcat
+boostrapping; there are about 300k such file enties to fix.
+
+Will also update archive.org link reltype to 'archive' (instead of
+'repository'), which is the new prefered style.
+
+Generated the set of files to update like:
+
+ zcat file_export.2019-07-07.json.gz | rg 'web.archive.org/web/None' | gzip > file_export.2019-07-07.None.json.gz
+
+ zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | wc -l
+ 304308
+
+## QA
+
+Running at git rev:
+
+ 984a1b157990f42f8c57815f4b3c00f6455a114f
+
+Created a new 'cleanup-bot' account and credentials. Put token in local env.
+
+Ran with a couple hundred entities first; edits look good.
+
+ zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
+
+Then the full command, with batchsize=100:
+
+ time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
+
+Should finish in a couple hours.
+
+ 304k 1:05:19 [77.6 /s]
+
+ Counter({'cleaned': 304308, 'lines': 304308, 'updated': 297308, 'skip-revision': 7000})
+
+ real 65m20.613s
+ user 20m40.828s
+ sys 0m34.492s
+
+## Production
+
+Again ran with a couple hundred entities first; edits look good.
+
+ zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
+
+Then the full command, with batchsize=100:
+
+ time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
+ [...]
+ 304k 1:03:10 [80.3 /s]
+ Counter({'cleaned': 304308, 'lines': 304308, 'updated': 304107, 'skip-revision': 201})
+
+ real 63m11.631s
+ user 21m8.504s
+ sys 0m31.888s
+
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index 97b8f8a2..e1d11817 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -9,12 +9,16 @@ this file should probably get merged into the guide at some point.
This file should not turn in to a TODO list!
+## 2019-10
+
+Updated 304,308 file entities to remove broken
+"https://web.archive.org/web/None/*" URLs.
+
## 2019-09
Created and updated metadata for tens of thousands of containers, using
"chocula" pipeline.
-
## 2019-08
Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg,