aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2019-10-08_file_cleanups.md
blob: 2eebb3637cc1bd2bcdf34c0bba0d6b16b3fcd200 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

These cleanups are primarily intended to fix bogus 'None' datetime links to
wayback for files that are actually in petabox (archive.org not
web.archive.org). These URLs were created accidentally during fatcat
boostrapping; there are about 300k such file enties to fix.

Will also update archive.org link reltype to 'archive' (instead of
'repository'), which is the new preferred style.

Generated the set of files to update like:

    zcat file_export.2019-07-07.json.gz | rg 'web.archive.org/web/None' | gzip > file_export.2019-07-07.None.json.gz

    zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | wc -l
    304308

## QA

Running at git rev:

    984a1b157990f42f8c57815f4b3c00f6455a114f

Created a new 'cleanup-bot' account and credentials. Put token in local env.

Ran with a couple hundred entities first; edits look good.

    zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -

Then the full command, with batchsize=100:

    time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -

Should finish in a couple hours.

    304k 1:05:19 [77.6 /s]

    Counter({'cleaned': 304308, 'lines': 304308, 'updated': 297308, 'skip-revision': 7000})

    real    65m20.613s
    user    20m40.828s
    sys     0m34.492s

## Production

Again ran with a couple hundred entities first; edits look good.

    zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | head -n200 | ./fatcat_cleanup.py files -

Then the full command, with batchsize=100:

    time zcat /srv/fatcat/datasets/file_export.2019-07-07.None.json.gz | pv -l | ./fatcat_cleanup.py --batch-size 100 files -
    [...]
    304k 1:03:10 [80.3 /s]
    Counter({'cleaned': 304308, 'lines': 304308, 'updated': 304107, 'skip-revision': 201})

    real    63m11.631s
    user    21m8.504s
    sys     0m31.888s