blob: b61b37f0445e6e444373225a1a3f61fc11f3badc (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
|
These cleanups are primarily intended to fix bogus 'None' datetime links to
wayback for files that are actually in petabox (archive.org not
web.archive.org). These URLs were created accidentally during fatcat
boostrapping; there are about 300k such file enties to fix.
Will also update archive.org link reltype to 'archive' (instead of
'repository'), which is the new prefered style.
Generated the set of files to update like:
zcat file_export.2019-07-07.json.gz | rg 'web.archive.org/web/None' | gzip > file_export.2019-07-07.None.json.gz
zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | wc -l
304308
## QA
Running at git rev:
984a1b157990f42f8c57815f4b3c00f6455a114f
Created a new 'cleanup-bot' account and credentials. Put token in local env.
Ran with a couple hundred entities first; edits look good.
zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
Then the full command, with batchsize=100:
time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
Should finish in a couple hours.
304k 1:05:19 [77.6 /s]
Counter({'cleaned': 304308, 'lines': 304308, 'updated': 297308, 'skip-revision': 7000})
real 65m20.613s
user 20m40.828s
sys 0m34.492s
## Production
Again ran with a couple hundred entities first; edits look good.
zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -
Then the full command, with batchsize=100:
time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
[...]
304k 1:03:10 [80.3 /s]
Counter({'cleaned': 304308, 'lines': 304308, 'updated': 304107, 'skip-revision': 201})
real 63m11.631s
user 21m8.504s
sys 0m31.888s
|