notes/bootstrap/import_timing_20190129.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135


This is the first attempt at a clean final production import. Running in QA; if
all goes well would dump and import in prod.

Made a number of changes since yesterday's import, so won't be surprised if run
in to problems. Plan is to make any fixes and push through to the end to turn
up any additional issues/bugs, then iterate yet again if needed.

NOTE: this import ended up being abandoned (too slow) in lieu of 2019-01-30.

## Service up/down

    sudo service fatcat-web stop
    sudo service fatcat-api stop

    # shutdown all the import/export/etc
    # delete any snapshots and /tmp/fatcat*
    sudo rm /srv/fatcat/snapshots/*
    sudo rm /tmp/fatcat_*

    # git pull
    # ansible playbook push
    # re-build fatcat-api to ensure that worked

    sudo service fatcat-web stop
    sudo service fatcat-api stop

    # as postgres user:
    DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
    sudo service postgresql restart

    http delete :9200/fatcat_release
    http delete :9200/fatcat_container
    http delete :9200/fatcat_changelog
    http put :9200/fatcat_release < release_schema.json
    http put :9200/fatcat_container < container_schema.json
    http put :9200/fatcat_changelog < changelog_schema.json
    sudo service elasticsearch stop
    sudo service kibana stop

    sudo service fatcat-api start

    # ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
    wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json

    # create new auth keys via bootstrap (edit debug -> release first)
    # update config/env/ansible/etc with new tokens
    # delete existing entities

    # run the imports!

    # after running below imports
    sudo service fatcat-web start
    sudo service elasticsearch start
    sudo service kibana start

## Import commands

    rust version (as webcrawl): 1.32.0
    git commit: 586458cacabd1d2f4feb0d0f1a9558f229f48f5e

    export LC_ALL=C.UTF-8
    export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
    time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json

        hit a bug (see below) but safe to continue

        Counter({'total': 107869, 'insert': 102623, 'exists': 5200, 'skip': 46, 'update': 0})
        real    4m43.635s
        user    1m55.904s
        sys     0m5.376s

    export FATCAT_AUTH_WORKER_ORCID="..."
    time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -

        hit another bug (see below), again safe to continue
        
        Counter({'total': 48888, 'insert': 48727, 'skip': 161, 'exists': 0, 'update': 0}) (etc)
        real    29m56.773s
        user    89m2.532s
        sys     5m11.104s

    export FATCAT_AUTH_WORKER_CROSSREF="..."
    time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode

        running very slow; maybe batch size of 100 was too large? pushing 80+
        MB/sec, but very little CPU utilization. some ISSN lookups taking up to
        a second each (!). no vacuum in progress. at xzcat, only '2.1 MiB/s'

        at current rate will take more than 48 hours. hrm.

        after 3.5 hours or so, cancelled and restarted in non-bezerk mode, with batch size of 50.

    xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 50 crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3

        at xzcat, about 5 Mib/s. just after citation efficiency, full import was around 20 hours.

        if slows down again, may be due to some threads failing and not
        dumping. if that's the case, should try with 'head -n200000' or so to
        catch output errors.

        ps aux | rg fatcat_import.py | rg -v perl | wc -l   => 22

        at 7.2% (beyond earlier progress), and now inserting (not just
        lookups), pushing 5.6 MiB/sec, 17 hours (estimated) to go, seems to be
        running fine.

        at 12 hours in, at 20% and down to 1.9 MiB/sec again. Lots of disk I/O
        (80 MB/sec write), seems to be bottleneck, not sure why.

        would take... about an hour to restart, might save 20+ hours, might waste 14?

        Counter({'total': 5005785, 'insert': 4319312, 'exists': 457819, 'skip': 228654, 'update': 0})
        531544.60user 13597.32system 60:38:43elapsed 249%CPU (0avgtext+0avgdata 448748maxresident)k
        124037840inputs+395235552outputs (140major+41973732minor)pagefaults 0swaps

        real    3638m43.712s => 60 hours (!!!)
        user    8944m37.944s
        sys     232m25.200s

    export FATCAT_AUTH_SANDCRAWLER="..."
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
    time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched --bezerk-mode -

    time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -

    time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa

## Bugs encountered

x broke a constraint or made an otherwise invalid request: name is required for all Container entities
    => wasn't bezerk mode, so should be fine to continue
x {"success":false,"error":"BadRequest","message":"broke a constraint or made an otherwise invalid request: display_name is required for all Creator entities"}
    => wasn't bezerk mode, so should be fine to continue