summaryrefslogtreecommitdiffstats
path: root/notes/bootstrap/import_timing_20190130.txt
blob: 92f98163bb1bf92d9d235fd2c5546679edef5f4b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376

The QA import is running really slow; this is a parallel attempt in case things
are faster on the fatcat-prod2-vm machine, with 50 batch size and bezerk mode.

NOTE: this ended up being the successful/"final" bootstrap import.

## Service up/down

    sudo service fatcat-web stop
    sudo service fatcat-api stop

    # shutdown all the import/export/etc
    # delete any snapshots and /tmp/fatcat*
    sudo rm /srv/fatcat/snapshots/*
    sudo rm /tmp/fatcat_*

    # git pull
    # ansible playbook push
    # re-build fatcat-api to ensure that worked

    sudo service fatcat-web stop
    sudo service fatcat-api stop

    # as postgres user:
    DATABASE_URL=postgres://postgres@/fatcat_prod /opt/cargo/bin/diesel database reset
    sudo service postgresql restart

    http delete :9200/fatcat_release
    http delete :9200/fatcat_container
    http delete :9200/fatcat_changelog
    http put :9200/fatcat_release < release_schema.json
    http put :9200/fatcat_container < container_schema.json
    http put :9200/fatcat_changelog < changelog_schema.json
    sudo service elasticsearch stop
    sudo service kibana stop

    sudo service fatcat-api start

    # ensure rust/.env -> /srv/fatcat/config/fatcat_api.env
    wget https://archive.org/download/ia_journal_metadata/journal_metadata.2019-01-25.json

    # if necessary:
    #  ALTER USER fatcat WITH SUPERUSER;
    #  ALTER USER fatcat WITH PASSWORD '...';
    # create new auth keys via bootstrap (edit debug -> release first)
    # update config/env/ansible/etc with new tokens
    # delete existing entities

    # run the imports!

    # after running below imports
    sudo service fatcat-web start
    sudo service elasticsearch start
    sudo service kibana start

## Import commands

    rust version (as webcrawl): 1fe371288daf417cdf44b94e372b485426b47134
    git commit: 1.32.0

    export LC_ALL=C.UTF-8
    export FATCAT_AUTH_WORKER_JOURNAL_METADATA="..."
    time ./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-01-25.json

        Counter({'total': 107869, 'insert': 107823, 'skip': 46, 'update': 0, 'exists': 0})
        real    6m2.287s
        user    2m4.612s
        sys     0m5.664s

    export FATCAT_AUTH_WORKER_ORCID="..."
    time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py orcid -

        98% 79:1=22s
        Counter({'total': 48097, 'insert': 47908, 'skip': 189, 'exists': 0, 'update': 0})
        100% 80:0=0s                                                                                                 

        real    33m9.211s
        user    93m33.040s
        sys     5m32.176s

    export FATCAT_AUTH_WORKER_CROSSREF="..."
    time xzcat /srv/fatcat/datasets/crossref-works.2018-09-05.json.xz --verbose | time parallel -j20 --round-robin --pipe ./fatcat_import.py crossref - /srv/fatcat/datasets/20181203.ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3 --bezerk-mode

        seems to be maintaining 9.1 MiB/sec and estimates 15 hours. 200 M/sec disk write. we'll see!

        100 %        33.2 GiB / 331.9 GiB = 0.100   3.6 MiB/s   26:16:57

        Counter({'total': 5001477, 'insert': 4784708, 'skip': 216769, 'update': 0, 'exists': 0})
        395971.48user 8101.15system 26:17:07elapsed 427%CPU (0avgtext+0avgdata 431560maxresident)k
        232972688inputs+477055792outputs (334645major+39067735minor)pagefaults 0swaps

        real    1577m7.908s
        user    6681m58.948s
        sys     141m25.560s

    export FATCAT_AUTH_SANDCRAWLER="..."
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_SANDCRAWLER
    time zcat /srv/fatcat/datasets/ia_papers_manifest_2018-01-25.matched.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched - --bezerk-mode

        (accidentally lost, but took about 3 hours)

    time zcat /srv/fatcat/datasets/2018-12-18-2237.09-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 matched -

        Counter({'total': 827944, 'insert': 555359, 'exists': 261441, 'update': 11129, 'skip': 15})
        32115.82user 1370.12system 4:30:25elapsed 206%CPU (0avgtext+0avgdata 37312maxresident)k
        28200inputs+3767112outputs (108major+471069minor)pagefaults 0swaps

        real    270m25.288s
        user    535m52.908s
        sys     22m56.328s

    time zcat /srv/fatcat/datasets/2018-09-23-0405.30-dumpgrobidmetainsertable.longtail_join.filtered.tsv.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py --batch-size 50 grobid-metadata - --longtail-oa

    1.6M 2:02:05 [ 218 /s]
    Counter({'total': 133095, 'insert': 120176, 'inserted.release': 120176, 'exists': 12919, 'skip': 0, 'update': 0})
    20854.82user 422.09system 2:02:12elapsed 290%CPU (0avgtext+0avgdata 63816maxresident)k
    29688inputs+21057912outputs (118major+809972minor)pagefaults 0swaps

    real    122m12.533s
    user    350m14.824s
    sys     7m29.820s

## After Import Stats

    bnewbold@wbgrp-svc503$ df -h .
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/vda1       1.8T  591G  1.1T  36% /

    Size:  294.82G

    select count(*) from changelog => 2,306,900


                          table_name                          | table_size | indexes_size | total_size 
--------------------------------------------------------------+------------+--------------+------------
 "public"."refs_blob"                                         | 70 GB      | 1896 MB      | 72 GB
 "public"."release_rev"                                       | 36 GB      | 32 GB        | 68 GB
 "public"."release_contrib"                                   | 25 GB      | 23 GB        | 48 GB
 "public"."release_edit"                                      | 9342 MB    | 10 GB        | 19 GB
 "public"."work_edit"                                         | 9342 MB    | 10 GB        | 19 GB
 "public"."release_ident"                                     | 6334 MB    | 10235 MB     | 16 GB
 "public"."work_ident"                                        | 6333 MB    | 10235 MB     | 16 GB
 "public"."file_rev_url"                                      | 6085 MB    | 2251 MB      | 8337 MB
 "public"."work_rev"                                          | 4092 MB    | 3795 MB      | 7887 MB
 "public"."file_rev"                                          | 1706 MB    | 2883 MB      | 4589 MB
 "public"."abstracts"                                         | 4089 MB    | 300 MB       | 4390 MB
 "public"."file_edit"                                         | 1403 MB    | 1560 MB      | 2963 MB
 "public"."file_ident"                                        | 944 MB     | 1529 MB      | 2473 MB
 "public"."file_rev_release"                                  | 889 MB     | 1558 MB      | 2447 MB
 "public"."release_rev_abstract"                              | 404 MB     | 536 MB       | 941 MB
 "public"."creator_rev"                                       | 371 MB     | 457 MB       | 827 MB
 "public"."creator_edit"                                      | 377 MB     | 420 MB       | 797 MB
 "public"."editgroup"                                         | 480 MB     | 285 MB       | 766 MB
 "public"."creator_ident"                                     | 255 MB     | 412 MB       | 667 MB
 "public"."changelog"                                         | 135 MB     | 139 MB       | 274 MB
 "public"."container_rev"                                     | 31 MB      | 11 MB        | 42 MB
 "public"."container_edit"                                    | 10 MB      | 12 MB        | 22 MB
 "public"."container_ident"                                   | 7216 kB    | 12 MB        | 19 MB

       relname        | too_much_seq | case |  rel_size   | seq_scan | idx_scan  
----------------------+--------------+------+-------------+----------+-----------
 creator_edit         |       -94655 | OK   |   395558912 |        2 |     94657
 container_edit       |       -94655 | OK   |    10911744 |        2 |     94657
 file_edit            |       -94655 | OK   |  1470627840 |        2 |     94657
 work_edit            |       -94655 | OK   |  9793445888 |        2 |     94657
 release_edit         |       -94655 | OK   |  9793241088 |        2 |     94657
 container_rev        |     -1168077 | OK   |    32546816 |        3 |   1168080
 file_rev_release     |     -3405015 | OK   |   931627008 |        2 |   3405017
 file_rev_url         |     -3405015 | OK   |  6379298816 |        2 |   3405017
 changelog            |     -3883131 | OK   |   141934592 |      382 |   3883513
 abstracts            |     -8367919 | OK   |  4011868160 |        1 |   8367920
 creator_ident        |     -9066121 | OK   |   267124736 |        5 |   9066126
 creator_rev          |    -14129509 | OK   |   388431872 |        3 |  14129512
 release_contrib      |    -17121962 | OK   | 26559053824 |        3 |  17121965
 release_rev_abstract |    -17123930 | OK   |   423878656 |        3 |  17123933
 file_ident           |    -18428366 | OK   |   989888512 |        5 |  18428371
 refs_blob            |    -50251199 | OK   | 15969484800 |        1 |  50251200
 container_ident      |    -74332007 | OK   |     7364608 |        5 |  74332012
 file_rev             |    -99555196 | OK   |  1788166144 |        4 |  99555200
 release_ident        |   -132347345 | OK   |  6639624192 |        5 | 132347350
 work_rev             |   -193625747 | OK   |  4289314816 |        1 | 193625748
 work_ident           |   -196604815 | OK   |  6639476736 |        5 | 196604820
 editgroup            |   -214491911 | OK   |   503414784 |        3 | 214491914
 release_rev          |   -482813156 | OK   | 38609838080 |       11 | 482813167
(23 rows)

## Dump Stats / Process

    DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /tmp

        postgres@wbgrp-svc503:/srv/fatcat/src/extra/sql_dumps$ DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /tmp
        Will move output to '/tmp'
        Running SQL (from 'fatcat_prod')...
        BEGIN
        COPY 1
        COPY 3906704 -> creators
        COPY 107826 -> containers
        COPY 14378465 -> files
        COPY 3 -> filesets
        COPY 3 -> webcaptures
        COPY 96812903 -> releases
        COPY 96812903 -> works
        COPY 2306900 -> changelog
        ROLLBACK

        Done: /tmp/fatcat_idents.2019-02-01.214959.r2306900.tar.gz

    fatcat-export:
        x files
        x containers
        - releases_extended (TODO: estimate time to dump based on file timestamps)

        cat /tmp/fatcat_ident_releases.tsv | ./target/release/fatcat-export release --expand files,filesets,webcaptures,container -j8 | pv -l | gzip > /srv/fatcat/snapshots/release_export_expanded.json.gz

        96.8M 7:37:51 [3.52k/s]

        -rw-rw-r-- 1 webcrawl webcrawl  64G Feb  2 05:45 release_export_expanded.json.gz

    sql dumps:

        time sudo -u postgres pg_dump --verbose --format=tar fatcat_prod | pigz > /srv/fatcat/snapshots/fatcat_private_dbdump_${DATESLUG}.tar.gz

        real    112m34.310s
        user    296m46.112s
        sys     22m35.004s

        -rw-rw-r-- 1 bnewbold bnewbold  81G Feb  2 04:15 fatcat_private_dbdump_2019-02-02.022209.tar.gz

Looking for repeated SHA-1 and DOI:

    zcat file_hashes.tsv.gz | cut -f 3 | sort -S 8G | uniq -cd | sort -n > repeated_sha1.tsv
    => none

    zcat release_extid.tsv.gz | cut -f 3 | sort -S 8G | uniq -cd | sort -n > repeated_doi.tsv
    => a few million repeated *blank* lines... could filter out?

## Load Stats / Progress

    export LC_ALL=C.UTF-8
    time zcat /srv/fatcat/snapshots/release_export_expanded.json.gz | pv -l | ./fatcat_export.py transform-releases - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_release -type release
    time zcat /srv/fatcat/snapshots/container_export.json.gz | pv -l | ./fatcat_export.py transform-containers - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_container -type container


    time zcat /srv/fatcat/snapshots/2019-01-30/container_export.json.gz | pv -l | ./fatcat_export.py transform-containers - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_container -type container

        real    0m58.528s
        user    1m0.396s
        sys     0m2.412s

    # very python-CPU-limited, so crank that -j20
    # hadn't used '--linebuffer' with parallel before, but otherwise it holds
    # on to all the output lines before passing on to the next pipe program
    time zcat /srv/fatcat/snapshots/2019-01-30/release_export_expanded.json.gz | pv -l | parallel -j20 --linebuffer --round-robin --pipe ./fatcat_export.py transform-releases - - | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat_release -type release

        165k 0:00:10 [18.4k/s]

        2019/02/02 09:30:49 96812900 docs in 2h27m32.835681602s at 10935.807 docs/s with 8 workers
        2019/02/02 09:30:49 applied setting: {"index": {"refresh_interval": "1s"}} with status 200 OK
        2019/02/02 09:30:49 applied setting: {"index": {"number_of_replicas": "1"}} with status 200 OK
        2019/02/02 09:31:03 index flushed: 200 OK

        real    147m46.387s
        user    2621m40.420s
        sys     56m11.456s

    sudo su postgres
    dropdb fatcat_prod
    #zcat fatcat_private_dbdump_2019-02-02.022209.tar.gz | pg_restore --clean --if-exists --create --exit-on-error -d fatcat_prod
    createdb fatcat_prod
    time zcat fatcat_private_dbdump_2019-02-02.022209.tar.gz  | pg_restore --exit-on-error --clean --if-exists --dbname fatcat_prod

        seems to go pretty fast, so multiple jobs probably not needed

        real    284m40.448s
        user    58m45.240s
        sys     7m33.600s

DONE: delete old elastic index

## Bugs/Issues encountered

x in_ia_sim is broken; not passing through
x elastic port (9200) was not open to cluster
    => but should close; should be over HTTP
x elasticsearch host wrong (should be search.fatcat.wiki)
    => search.fatcat.wiki
x postgres config wasn't actually getting installed in the right place by
  ansible (!!!), which probably had crazy effects on performance, etc
x postgres version confusion was because both versions (server and client) can
  be installed in parallel, and older version "wins". wiping VM would solve this.
x should try pigz for things like ident_table_snapshot and exports? these seem to be gzip-limited
- fatcat-export and pg_dump seem to mutually lock (transaction-wise), which is
  unexpected. fatcat-export should have very loose (low-priority) transaction
  scope, because it already has the full release_rev id, and pg_dump should
  also be in background/non-linear mode (except for "public" dumps?)
    => this was somewhat subtle; didn't completely lock
- this machine is postgres 10, not postgres 11. same with fatcat-prod1-vm.

Added to TODO:
- want a better "write lock" flag (on database) other than clearing auth key
- KBART CLOCKSS reports (and maybe LOCKSS?) have repeated lines, need to be merged
- empty AUTH_ALT_KEYS should just be ignored (not try to parse)

## Metadata Quality Notes

- crossref references look great!
- extra/crossref/alternative-id often includes exact full DOI
        10.1158/1538-7445.AM10-3529
        10.1158/1538-7445.am10-3529
    => but not always? publisher-specific
- contribs[]/extra/seq often has "first" from crossref
    => is this helpful?
- abstracts content is fine, but should probably check for "jats:" when setting
  mimetype
x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
    => https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave
       10.26891/jik.v10i2.2016.92-97
- original title works, yay!
    https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im
    10.2504/kds.26.358
- new license: https://www.karger.com/Services/SiteLicenses
- not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7
    "9780080373027"
    could at least put in alternative-id?
- BUG: subtitle coming through as an array, not string
- `license_slug` does get set
    eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/
- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude
- BUG (?): file missing size:
    https://fatcat.wiki/file/wpvkiqx2w5celc3ajyfsh3cfsa
- webface BUG: file-to-release links missing
- webface meh: still need to collapse links by domain better, and also vs. www.x/x

I think this is good (enough)!

Possible other KBART sources: Hathitrust, PKP preservation net (open, OJS), scholars portal (?), british library

Nature mag kbart clocks in empty (?)
    ISSN-L: 0028-0836
    https://fatcat.wiki/container/drfdii35rzaibj3aml5uhvr5xm

Missing DOIs (out of scope?):

    DOI not found: 10.1023/a:1009888907797
    DOI not found: 10.1186/1471-2148-4-49
    DOI not found: 10.1023/a:1026471016927
    DOI not found: 10.1090/s0002-9939-04-07569-0
    DOI not found: 10.1186/1742-4682-1-11
    DOI not found: 10.1186/1477-3163-2-5
    DOI not found: 10.1186/gb-2003-4-4-210
    DOI not found: 10.1186/gb-2004-5-9-r63
    DOI not found: 10.13188/2330-2178.1000008
    DOI not found: 10.4135/9781473960749
    DOI not found: 10.1252/kakoronbunshu1953.36.479
    DOI not found: 10.2320/materia.42.461
    DOI not found: 10.1186/1742-4933-3-3
    DOI not found: 10.14257/ijsh
    DOI not found: 10.1023/a:1016008714781
    DOI not found: 10.1023/a:1016648722322
    DOI not found: 10.1787/5k990rjhvtlv-en
    DOI not found: 10.4064/fm
    DOI not found: 10.1090/s0002-9947-98-01992-8
    DOI not found: 10.1186/1475-925x-2-16
    DOI not found: 10.1186/1479-5868-3-9
    DOI not found: 10.1090/s0002-9939-03-07205-8
    DOI not found: 10.1023/a:1008111923880
    DOI not found: 10.1090/s0002-9939-98-04322-6
    DOI not found: 10.1186/gb-2005-6-11-r93
    DOI not found: 10.5632/jila1925.2.236
    DOI not found: 10.1023/a:1011359428672
    DOI not found: 10.1090/s0002-9947-97-01844-8
    DOI not found: 10.1155/4817
    DOI not found: 10.1186/1472-6807-1-5
    DOI not found: 10.1002/(issn)1542-0981
    DOI not found: 10.1186/rr115