notes/bootstrap/import_timing_20180910.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319


Changes since last time:
- removed "WHERE IS NOT NULL" from lookup indexes
- some cleanups
- JSONB

## QA Timing

master branch b15fa552e288bec5bbc2b07a3e11bab9235a1e7c

    time ./fatcat_import.py import-issn /srv/datasets/journal_extra_metadata.csv # --batch-size 100
    real    0m30.249s
    user    0m9.364s
    sys     0m0.268s

    time parallel --bar --pipepart -j8 -a /srv/datasets/public_profiles_1_2_json.all.json ./fatcat_import.py import-orcid - # --batch-size 100
    => something went sort of wrong? seemed to hang, with no activity.
    => may have sent a signal or something accidentally?

    xzcat /srv/datasets/crossref-works.2018-01-21.json.xz | time parallel -j20 --round-robin --pipe ./fatcat_import.py import-crossref - /srv/datasets/20180216.ISSN-to-ISSN-L.txt --batch-size 100
    => ran for a few seconds before restarting with increased batch size
    => increased batch size seems to be too much for releases? stick with 50
    => still lots of long autovacuum (ANALYZE)

    67017.73user 2054.96system 25:59:38elapsed 73%CPU (0avgtext+0avgdata 424692maxresident)k
    2085872inputs+219678920outputs (100457major+30837042minor)pagefaults 0swaps
    # 25:59:38 elapsed

    time ./fatcat_import.py import-manifest /srv/datasets/idents_files_urls.sqlite # --batch-size 100
    => taking longer to start than I remembered...

    Done! Inserted 6607075
    real    1521m37.856s
    user    305m38.444s
    sys     18m13.916s
    => 25.5 hours

Disk/table usage:

    Size:  233.95G

                              table_name                          | table_size | indexes_size | total_size 
    --------------------------------------------------------------+------------+--------------+------------
     "public"."release_ref"                                       | 92 GB      | 26 GB        | 118 GB
     "public"."release_rev"                                       | 30 GB      | 14 GB        | 45 GB
     "public"."release_contrib"                                   | 12 GB      | 13 GB        | 26 GB
     "public"."release_edit"                                      | 4710 MB    | 3353 MB      | 8062 MB
     "public"."work_edit"                                         | 4710 MB    | 3352 MB      | 8062 MB
     "public"."work_ident"                                        | 3454 MB    | 4153 MB      | 7608 MB
     "public"."release_ident"                                     | 3454 MB    | 4152 MB      | 7606 MB
     "public"."work_rev"                                          | 2241 MB    | 2076 MB      | 4317 MB
     "public"."file_rev_url"                                      | 1994 MB    | 778 MB       | 2772 MB
     "public"."file_rev"                                          | 827 MB     | 1215 MB      | 2042 MB
     "public"."file_ident"                                        | 501 MB     | 533 MB       | 1034 MB
     "public"."file_edit"                                         | 590 MB     | 411 MB       | 1001 MB
     "public"."creator_rev"                                       | 371 MB     | 455 MB       | 826 MB
     "public"."abstracts"                                         | 649 MB     | 52 MB        | 701 MB
     "public"."file_release"                                      | 251 MB     | 435 MB       | 686 MB
     "public"."creator_edit"                                      | 347 MB     | 242 MB       | 590 MB
     "public"."creator_ident"                                     | 255 MB     | 305 MB       | 560 MB
     "public"."release_rev_abstract"                              | 70 MB      | 90 MB        | 160 MB
     "public"."changelog"                                         | 52 MB      | 53 MB        | 106 MB
     "public"."editgroup"                                         | 60 MB      | 35 MB        | 94 MB
     "public"."container_rev"                                     | 40 MB      | 27 MB        | 67 MB
     "public"."container_edit"                                    | 24 MB      | 16 MB        | 39 MB
     "public"."container_ident"                                   | 17 MB      | 20 MB        | 37 MB

I think this is before the JSONB schema change, so won't reflect actual/final.

Perf:

           relname        | too_much_seq | case |  rel_size   | seq_scan | idx_scan  
    ----------------------+--------------+------+-------------+----------+-----------
     file_rev_url         |          -18 | OK   |  2089836544 |        2 |        20
     file_release         |        -4001 | OK   |   263249920 |        2 |      4003
     container_edit       |      -132140 | OK   |    24608768 |        2 |    132142
     release_edit         |      -132140 | OK   |  4937408512 |        2 |    132142
     work_edit            |      -132140 | OK   |  4937351168 |        2 |    132142
     file_edit            |      -132140 | OK   |   618545152 |        2 |    132142
     creator_edit         |      -132140 | OK   |   363954176 |        2 |    132142
     changelog            |      -134456 | OK   |    54919168 |        5 |    134461
     container_rev        |     -1276670 | OK   |    41680896 |   592993 |   1869663
     abstracts            |     -1353093 | OK   |   669425664 |        3 |   1353096
     release_ref          |     -3429019 | OK   | 99223470080 |        3 |   3429022
     release_contrib      |     -3429026 | OK   | 13406437376 |        3 |   3429029
     release_rev_abstract |     -3435251 | OK   |    73359360 |        3 |   3435254
     creator_ident        |     -5360858 | OK   |   266977280 |       11 |   5360869
     creator_rev          |     -8071846 | OK   |   388734976 |   257781 |   8329627
     file_ident           |    -13478348 | OK   |   524746752 |        7 |  13478355
     file_rev             |    -30368193 | OK   |   867401728 |        4 |  30368197
     container_ident      |    -47140767 | OK   |    18219008 |        9 |  47140776
     release_ident        |    -61021070 | OK   |  3620864000 |       24 |  61021094
     work_rev             |   -106056502 | OK   |  2349146112 |        1 | 106056503
     work_ident           |   -106320780 | OK   |  3620954112 |        6 | 106320786
     editgroup            |   -117742455 | OK   |    62398464 |        2 | 117742457
     release_rev          |   -563415274 | OK   | 32455860224 |        8 | 563415282

Thought I had resoled those scans... hrm. Pretty sure (but not 100%) that I
cleared the counters.

Dumps!

    time psql fatcat_prod < ../extra/quick_dump.sql
    real    4m5.729s
    user    0m0.056s
    sys     0m0.016s
    => while other stuff running

    time ./ident_table_snapshot.sh
    => seems very dominated by tar/xz compression time; should switch back to gzip

    DATABASE_URL=$DATABASE_URL time ./ident_table_snapshot.sh /tmp
    614.80user 31.74system 13:23.51elapsed 80%CPU (0avgtext+0avgdata 10124maxresident)k
    8311456inputs+9761200outputs (2major+3792minor)pagefaults 0swaps
    => 13.5 minutes
    => fatcat_idents.2018-09-13.215147.r874951.tar.gz: 4.7GB
    => uncompressed, 8.3GB total

    cat ../extra/sql_dumps/fatcat_ident_releases.tsv | time ./target/release/fatcat-export release -j8 -q | pv -l | gzip > fatcat_release_dump.json.gz
    => can not run with more than -j8, it seems (postgres client limit)
    => TPS: 12522 (!)
    => 101k 0:00:34 [2.99k/s]
    => estimate 6.5 hours

    # NOTE AFTER THE FACT: had "contaner" typo in the below, so wasn't expanding containers
    cat ../extra/sql_dumps/fatcat_ident_releases.tsv | time ./target/release/fatcat-export release --expand files,contaner -j8 -q | pv -l | gzip > fatcat_release_dump_expanded.json.gz
    => TPS: 29605 (!)
    => 183k 0:00:35 [ 5.8k/s] (how is this faster than the above? other disk stuff finished?)
    => 341k 0:01:07 [3.48k/s] (ah, must have been in disk cache)
    => 4.12M 0:16:11 (averaging 4.2k/s; estimate 4 hours total)
    => fatcat_release_dump_expanded.json.gz: 3.4G with 6.76M lines, so expecting 35GB or so

    ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
    usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
     14   3  73   8   0   2|  43M   21M| 143k  326B|   0     0 |  64k  113k
     18   4  72   5   0   2|  55M 4096k| 246k  326B|   0     0 |  84k  157k
     17   4  72   6   0   2|  53M 9964k| 428k  908B|   0     0 |  84k  155k
     14   3  73   8   0   1|  46M   20M| 143k   87k|   0     0 |  71k  132k
     18   3  72   5   0   2|  52M 8880k| 318k   95k|   0     0 |  82k  149k

    => fatcat-export using about 200% CPU in a single thread; other 8 threads at about 25% each
        => might be interesting to profile... too many copies? switch to tsv library?
    => seems like postgres only about 50% utilized

Benchmarks!

    # stub entity
    $ wrk -t8 -c100 -d30s http://localhost:9411/v0/work/aaaaaaaaaaaaavkvaaaaaaaaai
    Running 30s test @ http://localhost:9411/v0/work/aaaaaaaaaaaaavkvaaaaaaaaai
      8 threads and 100 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     7.44ms    5.26ms  67.53ms   60.07%
        Req/Sec     1.71k   113.55     2.07k    68.79%
      408986 requests in 30.01s, 117.01MB read
    Requests/sec:  13627.62
    Transfer/sec:      3.90MB
    => TPS: 39958
    => lots of "BEGIN" and "SELECT 1", implying new connections? hrm.
    => should run longer for more stable results

    wrk -t12 -c400 -d2m --latency http://localhost:9411/v0/work/aaaaaaaaaaaaavkvaaaaaaaaai

    Running 2m test @ http://localhost:9411/v0/work/aaaaaaaaaaaaavkvaaaaaaaaai
      12 threads and 400 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency    25.75ms   17.02ms 263.74ms   65.11%
        Req/Sec     1.05k   377.38     1.82k    74.81%
      Latency Distribution
         50%   25.40ms
         75%   26.56ms
         90%   49.57ms
         99%   76.47ms
      1128253 requests in 2.00m, 322.80MB read
      Socket errors: connect 0, read 10, write 0, timeout 0
    Requests/sec:   9396.03
    Transfer/sec:      2.69MB

    => single 470% fatcatd thread; diesel?
    => 50% systemd-journald

    # rich entity
    $ wrk -t8 -c100 -d30s https://api.qa.fatcat.wiki/v0/release/aaaaaaaaaaaaarceaaaaaaaaam?expand=files,container
    Running 30s test @ https://api.qa.fatcat.wiki/v0/release/aaaaaaaaaaaaarceaaaaaaaaam?expand=files,container
      8 threads and 100 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency    33.35ms   14.01ms 152.05ms   79.98%
        Req/Sec   363.14     30.36   464.00     76.48%
      86719 requests in 30.04s, 234.46MB read
    Requests/sec:   2886.55
    Transfer/sec:      7.80MB

    wrk -t12 -c400 -d2m --latency https://api.qa.fatcat.wiki/v0/release/aaaaaaaaaaaaarceaaaaaaaaam?expand=files,container

    Running 2m test @ https://api.qa.fatcat.wiki/v0/release/aaaaaaaaaaaaarceaaaaaaaaam?expand=files,container
      12 threads and 400 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   149.19ms  125.30ms   1.93s    95.06%
        Req/Sec   237.89     34.60   650.00     73.54%
      Latency Distribution
         50%  129.86ms
         75%  134.14ms
         90%  210.42ms
         99%  904.26ms
      340625 requests in 2.00m, 0.90GB read
      Socket errors: connect 0, read 0, write 0, timeout 303
    Requests/sec:   2836.17
    Transfer/sec:      7.67MB

    => TPS: 9090
    ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
     usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
      4   0  93   3   0   0|2401k   13M|   0     0 |2317B 4812B|4274  3824 
     22   6  68   0   0   3|4096B    0 |  57k  956B|4096B    0 | 135k  255k
     21   7  68   0   0   3|4096B    0 |  61k   66B|4096B    0 | 137k  259k
     21   7  68   0   0   3|  16k 8192B| 179k  636B|  16k    0 | 134k  252k
     21   7  68   0   0   3|   0     0 |  53k   66B|   0     0 | 135k  255k
     22   6  68   0   0   3|   0    17M|  51k  326B|   0     0 | 137k  261k
     22   6  68   0   0   3|   0     0 |  82k  903B|   0     0 | 136k  257k

    # static (404)
    wrk -t12 -c400 -d2m --latency http://localhost:9411/v0/asdf
    => full 2600+% CPU in a single fatcat process (?)
    => lots of load in wrk also
    => system at a whole basically 100% CPU on all cores, though mostly kernel

    ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
    usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
      4   0  93   3   0   0|2399k   13M|   0     0 |2315B 4808B|4359  3976 
     26  66   2   0   0   6|   0     0 |4191k  956B|   0     0 | 465k  536k
     26  67   1   0   0   6|   0     0 | 511k   66B|   0     0 | 466k  543k
     27  64   3   0   0   6|   0     0 |2648k  318B|   0     0 | 481k  555k
     25  67   2   0   0   6|   0     0 |1909k  768B|   0     0 | 481k  550k

    Running 2m test @ http://localhost:9411/v0/asdf
      12 threads and 400 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     1.40ms    1.65ms 219.48ms   93.33%
        Req/Sec    17.71k     8.97k   65.69k    63.84%
      Latency Distribution
         50%    1.13ms
         75%    1.89ms
         90%    2.71ms
         99%    4.76ms
      21145472 requests in 2.00m, 1.61GB read
      Socket errors: connect 0, read 12, write 0, timeout 0
      Non-2xx or 3xx responses: 21145472
    Requests/sec: 176071.79
    Transfer/sec:     13.77MB

Noting pretty non-trival traffic to 207.241.229.230 and
es-worker-02.us.archive.org (207.241.229.246)... probably because this is a
datanode?

## Misc

Postgres really wanted to log: <https://wiki.postgresql.org/wiki/Per-user_log_settings>

    ALTER DATABASE fatcat_prod SET log_statement = 'ddl';
    ALTER USER fatcat SET log_statement = 'ddl';


## Later Imports

    zcat /srv/datasets/2018-08-27-2352.17-matchcrossref.insertable.json.gz | pv -l | time parallel -j12 --round-robin --pipe ./fatcat_import.py import-matched -
    => only ~40 TPS
    => looks like almost all selects... probably skipping? huge postgres CPU churn, huh. missing index?

           relname        | too_much_seq |      case      |  rel_size   | seq_scan | idx_scan 
    ----------------------+--------------+----------------+-------------+----------+----------
     file_rev             |     -5343173 | OK             |   867483648 |      693 |  5343866

    EXPLAIN SELECT "file_ident"."id", "file_ident"."is_live", "file_ident"."rev_id", "file_ident"."redirect_id", "file_rev"."id", "file_rev"."extra_json", "file_rev"."size", "file_rev"."sha1", "file_rev"."sha256", "file_rev"."md5", "file_rev"."mimetype" FROM ("file_ident" INNER JOIN "file_rev" ON "file_ident"."rev_id" = "file_rev"."id") WHERE "file_rev"."sha1" = '7d97e98f8af710c7e7fe703abc8f639e0ee507c4' AND "file_rev"."sha1" IS NOT NULL AND "file_ident"."is_live" = 't' AND "file_ident"."redirect_id" IS NULL LIMIT 1;

Ugh, this whole cycle again. Query only takes 1ms for a no-hit, or 5ms for success.

    http get http://localhost:9411/v0/file/lookup?sha1=d77cb2a8b207507e0df27ba1393e8118eec9217f => not found
    http get http://localhost:9411/v0/file/lookup?sha1=d6c3f6ca785ee63293da02615c8495c5cd25fa25 => found
        => but no releases?

SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
ALTER USER fatcat SET log_statement = 'all';
ALTER USER fatcat SET auto_explain.log_min_duration = 0;
ALTER USER fatcat SET auto_explain.log_analyze = true;;
sudo service fatcat-api restart


2018-09-16 00:42:29.883 UTC [13868] fatcat@fatcat_prod LOG:  duration: 3028.143 ms  plan:
        Query Text: SELECT "file_ident"."id", "file_ident"."is_live", "file_ident"."rev_id", "file_ident"."redirect_id", "file_rev"."id", "file_rev"."extra_json", "file_rev"."size", "file_rev"."sha1", "file_rev"."sha256", "file_rev"."md5", "file_rev"."mimetype" FROM ("file_ident" INNER JOIN "file_rev" ON "file_ident"."rev_id" = "file_rev"."id") WHERE "file_rev"."sha1" = $1 AND "file_rev"."sha1" IS NOT NULL AND "file_ident"."is_live" = $2 AND "file_ident"."redirect_id" IS NULL LIMIT $3
        Limit  (cost=0.43..8.56 rows=1 width=454) (actual time=3028.139..3028.139 rows=0 loops=1)
          ->  Nested Loop  (cost=0.43..268756.48 rows=33056 width=454) (actual time=3028.138..3028.138 rows=0 loops=1)
                ->  Seq Scan on file_rev  (cost=0.00..205023.16 rows=33042 width=405) (actual time=3028.137..3028.137 rows=0 loops=1)
                      Filter: ((sha1 IS NOT NULL) AND ((sha1)::text = 'd68d4e2cc420647109cdfe410efed86cd64465e5'::text))
                      Rows Removed by Filter: 6607786
                ->  Index Scan using file_ident_rev_idx on file_ident  (cost=0.43..1.92 rows=1 width=49) (never executed)
                      Index Cond: (rev_id = file_rev.id)
                      Filter: (is_live AND (redirect_id IS NULL))


fatcat_prod=# EXPLAIN SELECT "file_ident"."id", "file_ident"."is_live", "file_ident"."rev_id", "file_ident"."redirect_id", "file_rev"."id", "file_rev"."extra_json", "file_rev"."size", "file_rev"."sha1", "file_rev"."sha256", "file_rev"."md5", "file_rev"."mimetype" FROM ("file_ident" INNER JOIN "file_rev" ON "file_ident"."rev_id" = "file_rev"."id") WHERE "file_rev"."sha1" = '3f242a192acc258bdfdb15194341943222222222222440c313' AND "file_rev"."sha1" IS NOT NULL AND "file_ident"."is_live" = 't' AND "file_ident"."redirect_id" IS NULL LIMIT 1;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.99..5.04 rows=1 width=454)
   ->  Nested Loop  (cost=0.99..5.04 rows=1 width=454)
         ->  Index Scan using file_rev_sha1_idx on file_rev  (cost=0.56..2.58 rows=1 width=405)
               Index Cond: ((sha1 IS NOT NULL) AND (sha1 = '3f242a192acc258bdfdb15194341943222222222222440c313'::bpchar))
         ->  Index Scan using file_ident_rev_idx on file_ident  (cost=0.43..2.45 rows=1 width=49)
               Index Cond: (rev_id = file_rev.id)
               Filter: (is_live AND (redirect_id IS NULL))

After all this debugging, asked on IRC and learned that I really shouldn't ever
use CHAR() (or even VARCHAR()). Will update the schema.

## Investigating Missing Rows

    fatcat_prod=# select count(*) from release_ident; => 53028167

Thought I had more than that previously? Table sizes were also bigger. Hrm.