aboutsummaryrefslogtreecommitdiffstats
path: root/notes/import_timing_20180920.txt
blob: a57ffd77300c58410c2b8c01c0e763b02e85999f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

This was on fatcat-prod-vm (2TB disk).

    time ./fatcat_import.py import-issn /srv/fatcat/datasets/journal_extra_metadata.csv

    Processed 53300 lines, inserted 53283, updated 0.
    real    0m32.463s
    user    0m8.716s
    sys     0m0.284s

    time parallel --bar --pipepart -j8 -a /srv/fatcat/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py import-orcid -

    Processed 48900 lines, inserted 48731, updated 0. <= these numbers times 80x
    100% 80:0=0s

    real    10m20.598s
    user    26m16.544s
    sys     1m40.284s

    time xzcat /srv/fatcat/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py import-crossref - /srv/fatcat/datasets/20180216.ISSN-to-ISSN-L.txt /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3

    Processed 4679900 lines, inserted 3755867, updated 0.
    107730.08user 4110.22system 16:31:25elapsed 188%CPU (0avgtext+0avgdata 447496maxresident)k
    77644160inputs+361948352outputs (105major+49094767minor)pagefaults 0swaps

    => 16.5 hours, faster!

    select count(id) from release_ident; => 75106713

    kernel/system crashed after first file import (!), so don't have numbers from that.

Table sizes at this point:

    select count(id) from file_ident; => 6334606

    Size:  389.25G

                          table_name                          | table_size | indexes_size | total_size 
--------------------------------------------------------------+------------+--------------+------------
 "public"."release_ref"                                       | 170 GB     | 47 GB        | 217 GB
 "public"."release_rev"                                       | 44 GB      | 21 GB        | 65 GB
 "public"."release_contrib"                                   | 19 GB      | 20 GB        | 39 GB
 "public"."release_edit"                                      | 6671 MB    | 6505 MB      | 13 GB
 "public"."work_edit"                                         | 6671 MB    | 6505 MB      | 13 GB
 "public"."release_ident"                                     | 4892 MB    | 5875 MB      | 11 GB
 "public"."work_ident"                                        | 4892 MB    | 5874 MB      | 11 GB
 "public"."work_rev"                                          | 3174 MB    | 2936 MB      | 6109 MB
 "public"."file_rev_url"                                      | 3634 MB    | 1456 MB      | 5090 MB
 "public"."file_rev"                                          | 792 MB     | 1281 MB      | 2073 MB
 "public"."abstracts"                                         | 1665 MB    | 135 MB       | 1800 MB
 "public"."file_edit"                                         | 565 MB     | 561 MB       | 1126 MB
 "public"."file_release"                                      | 380 MB     | 666 MB       | 1045 MB
 "public"."file_ident"                                        | 415 MB     | 496 MB       | 911 MB
 "public"."creator_rev"                                       | 371 MB     | 457 MB       | 828 MB
 "public"."creator_edit"                                      | 347 MB     | 353 MB       | 700 MB
 "public"."creator_ident"                                     | 255 MB     | 305 MB       | 559 MB
 "public"."release_rev_abstract"                              | 183 MB     | 237 MB       | 421 MB
 "public"."changelog"                                         | 122 MB     | 126 MB       | 247 MB
 "public"."editgroup"                                         | 138 MB     | 81 MB        | 219 MB
 "public"."container_rev"                                     | 52 MB      | 38 MB        | 89 MB
 "public"."container_edit"                                    | 32 MB      | 30 MB        | 62 MB
 "public"."container_ident"                                   | 24 MB      | 28 MB        | 52 MB


       relname        | too_much_seq | case |   rel_size   | seq_scan | idx_scan 
----------------------+--------------+------+--------------+----------+----------
 release_edit         |            0 | OK   |   6993084416 |        0 |        0
 container_rev        |            0 | OK   |     54124544 |        0 |        0
 creator_ident        |            0 | OK   |    266919936 |        0 |        0
 creator_edit         |            0 | OK   |    363921408 |        0 |        0
 work_rev             |            0 | OK   |   3327262720 |        0 |        0
 creator_rev          |            0 | OK   |    388726784 |        0 |        0
 work_ident           |            0 | OK   |   5128560640 |        0 |        0
 work_edit            |            0 | OK   |   6993108992 |        0 |        0
 container_ident      |            0 | OK   |     25092096 |        0 |        0
 file_edit            |            0 | OK   |    598278144 |        0 |        0
 container_edit       |            0 | OK   |     33857536 |        0 |        0
 changelog            |            0 | OK   |    127549440 |        0 |        0
 abstracts            |        -4714 | OK   |   1706713088 |        0 |     4714
 file_release         |       -13583 | OK   |    401752064 |        0 |    13583
 file_rev_url         |       -13583 | OK   |   3832389632 |        0 |    13583
 editgroup            |       -74109 | OK   |    144277504 |        0 |    74109
 release_contrib      |       -76699 | OK   |  20357849088 |        0 |    76699
 release_ref          |       -76700 | OK   | 183009157120 |        3 |    76703
 release_rev_abstract |       -76939 | OK   |    192102400 |        0 |    76939
 release_rev          |       -77965 | OK   |  47602647040 |        0 |    77965
 file_ident           |      -100089 | OK   |    438255616 |        3 |   100092
 release_ident        |      -152809 | OK   |   5128617984 |        0 |   152809
 file_rev             |      -440780 | OK   |    837705728 |        0 |   440780
(23 rows)

Continuing imports:

    zcat /srv/fatcat/datasets/2018-08-27-2352.17-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py import-matched -
    => HTTP response body: {"message":"duplicate key value violates unique constraint \"file_edit_editgroup_id_ident_id_key\""}