summaryrefslogtreecommitdiffstats
path: root/notes/scaling_works.md
blob: 60b459717b6fb152b1c5e1b21c9b433dec1d9faa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722

Run a partial ~5 million paper batch through:

    zcat /srv/fatcat_scholar/release_export.2019-07-07.5mil_fulltext.json.gz \
        | parallel -j8 --line-buffer --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > data/work_intermediate.5mil.json.gz
    => 5M 21:36:14 [64.3 /s]

    # runs about 70 works/sec with this parallelism => 1mil in 4hr, 5mil in 20hr
    # looks like seaweedfs is bottleneck?
    # tried stopping persist workers on seaweedfs and basically no change

    indexing to ES seems to take... an hour per million? or so. can check index
    monitoring to get better number

## 2020-07-23 First Full Release Batch

Patched to skip fetching `pdftext`

Run full batch through (on aitio), expecting this to take on the order of a
week:

    zcat /fast/download/release_export_expanded.json.gz \
        | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > /grande/snapshots/fatcat_scholar_work_fulltext.20200723.json.gz

Ah, this was running really slow because `MINIO_SECRET_KEY` was not set. Really
should replace `minio` python client library as we are now using seaweedfs!

Got an error:

    36.1M 15:29:38 [ 664 /s]
    parallel: Error: Output is incomplete. Cannot append to buffer file in /fast/tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
    Warning: unable to close filehandle properly: No space left on device during global destruction.

Might have been due to `/` filling up (not `/fast/tmp`)? Had gotten pretty far
in to processing. Restarted, will keep an eye on it.

To index, run from ES machine, as bnewbold:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.partial.20200723.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc

Hrm, again:

    99.9M 56:04:41 [ 308 /s]
    parallel: Error: Output is incomplete. Cannot append to buffer file in /fast/tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

Confirmed that disk was full in that moment; frustrating as had checked in and
disk usage was low enough before, and data was flowing to /grande (large
spinning disk). Should be sufficient to move release dump to `/bigger` and
clear more space on `/fast` to do the full indexing.

    /dev/sdg1       917G  871G     0 100% /fast

    vs.

    /dev/sdg1       917G  442G  430G  51% /fast

    -rw-rw-r-- 1 bnewbold bnewbold  418G Jul 27 05:55 fatcat_scholar_work_fulltext.20200723.json.gz

Got to about 2/3 of full release dump. Current rough estimates for total
processing times:

    enrich 150 million releases: 80hr (3-4 days), 650 GByte on disk (gzip)
    transform and index 150 million releases: 55hr (2-3 days), 1.5 TByte on disk (?)

Failed again, due to null `release.extra` field.

    # 14919639 + 83111800 = 98031439
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723.json.gz     | gunzip   | tail -n +98031439  | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform     | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc

SIM remote indexing command:

    # size before (approx): 743.4 GByte, 98031407 docs; 546G disk free
    ssh aitio.us.archive.org cat /bigger/scholar_old/sim_intermediate.2020-07-23.json.gz     | gunzip   | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform     | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc
    => 1967593 docs in 2h8m32.549646403s at 255.116 docs/s with 4 workers
    # size after: 753.8gb 99926090 docs, 533G disk free

Trying dump again on AITIO, with alternative tmpdir:

    git log | head -n1
    commit 2f0874c84e71a02a10e21b03688593a4aa5ef426

    df -h /sandcrawler-db/
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sdf1       1.8T  684G  1.1T  40% /sandcrawler-db

    export TMPDIR=/sandcrawler-db/tmp
    zcat /fast/download/release_export_expanded.json.gz \
        | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz

## ES Performance Iteration (2020-07-27)

- schema: switch abstracts from nested to simple array
- query: include fewer fields: just biblio (with boost; and maybe title) and "everything"
- query: use date-level granularity for time queries (may already do this?)
- set replica=0 (for now)
- set shards=12, to optimize *individual query* performance
    => if estimating 800 GByte index size, this is 60-70 GByte per shard
- set `index.codec=best_compression` to leverage CPU vs. disk I/O
- ensure transform output is sorted by key
    => <https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents>
- ensure number of cores is large
- return fewer results (15 vs. 25)
    => less highlighting
    => fewer thumbnails to catch

## Work Grouping

Plan for work-grouped expanded release dumps:

Have release identifier dump script include, and sort by, `work_id`. This will
definitely slow down that stage, unclear if too much. `work_id` is indexed.

Bulk dump script iterates and makes work batches of releases to dump, passes
Vec to worker threads. Worker threads pass back Vec of entities, then print all
of them (same work) sequentially.

## ES Permformance Profiling (2020-08-05)

Index size:

    green open scholar_fulltext_v01            uthJZJvSS-mlLIhZxrlVnA 12 0 102039078 578722 748.9gb 748.9gb

Unless otherwise mentioned, these are with default filters in place.

Baseline:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"boosting": {"positive": {"bool": {"must": [{"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["title^5", "biblio_all^3", "abstracts.body^2", "fulltext.body", "everything"]}}], "should": [{"terms": {"access_type": ["ia_sim", "ia_file", "wayback"]}}]}}, "negative": {"bool": {"should": [{"bool": {"must_not": [{"exists": {"field": "title"}}]}}, {"bool": {"must_not": [{"exists": {"field": "year"}}]}}, {"bool": {"must_not": [{"exists": {"field": "type"}}]}}, {"bool": {"must_not": [{"exists": {"field": "stage"}}]}}, {"bool": {"must_not": [{"exists": {"field": "biblio.container_ident"}}]}}]}}, "negative_boost": 0.5}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15, "highlight": {"fields": {"abstracts.body": {"number_of_fragments": 2, "fragment_size": 300}, "fulltext.body": {"number_of_fragments": 2, "fragment_size": 300}, "fulltext.acknowledgment": {"number_of_fragments": 2, "fragment_size": 300}, "fulltext.annex": {"number_of_fragments": 2, "fragment_size": 300}}}}


    jenny durkin
    => 60 Hits in 1.3sec

    "looking at you kid"
    => 83 Hits in 6.6sec

    LIGO black hole
    => 2,440 Hits in 1.6sec

    "configuration that formed when the core of a rapidly rotating massive star collapsed"
    => 1 Hits in 8.0sec
    => requery: in 0.3sec

Disable everything, query only `biblio_all`:

    {"query": {"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}, "from": 0, "size": 15}

    newbold
    => 2,930 Hits in 0.12sec

    guardian galaxy
    => 15 Hits in 0.19sec

    *
    => 102,039,078 Hits in 0.86sec (same on repeat)

Query only `everything`:

    guardian galaxy
    => 1,456 Hits in 0.26sec

    avocado mexico
    => 3,407 Hits in 0.3sec, repeat in 0.017sec

    *
    => 102,039,078 Hits in 0.9sec (same on repeat)


Query all the fields with boosting:

    {"query": {"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["title^5", "biblio_all^3", "abstracts.body^2", "fulltext.body", "everything"]}}, "from": 0, "size": 15}

    berlin population
    => 168,690 Hits in 0.93sec repeat in in 0.11sec

    internet archive
    => 115,139 Hits in 1.1sec

    *
    => 102,039,078 Hits in 4.1sec (same on repeat)

Query only "everything", add highlighting (default config):

    indiana human
    => 86,556 Hits in 0.34sec repeat in 0.04sec
    => scholar-qa: 86,358 Hits in 2.4sec, repeat in 0.47sec

    wikipedia
    => 73,806 Hits in 0.13sec

Query only "everything", no highlighting, basic filters:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"query_string": {"query": "reddit", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["everything"]}}]}}, "from": 0, "size": 15}

    reddit
    => 5,608 Hits in 0.12sec

    "participate in this collaborative editorial process"
    => 1 Hits in 7.9sec, repeat in in 0.4sec
    scholar-qa: timeout (>10sec)

    "one if by land, two if by sea"
    => 20 Hits in 4.5sec

Query only "title", no highlighting, basic filters:

    "discontinuities and noise due to crosstalk"
    => 0 Hits in 0.24sec
    scholar-qa: 1 Hits in 4.7sec

Query only "everything", no highlighting, collapse key:

    greed
    => 35,941 Hits in 0.47sec

    bjog child
    => 6,616 Hits in 0.4sec 

Query only "everything", no highlighting, collapse key, boosting:

    blue
    => 2,407,966 Hits in 3.1sec
    scholar-qa: 2,407,967 Hits in 1.6sec

    distal fin tuna
    => 390 Hits in 0.61sec

    "greater speed made possible by the warm muscle"
    => 1 Hits in 1.2sec

Query "everything", highlight "everything", collapse key, boosting (default but
only "everything" match):

    NOTE: highlighting didn't work

    green
    => 2,742,004 Hits in 3.1sec, repeat in in 2.8sec

    "comprehensive framework for the influences"
    => 1 Hits in 3.1sec

    bivalve extinct
    => 6,631 Hits in 0.47sec

    redwood "big basin"
    => 69 Hits in 0.5sec

Default, except only search+highlight "fulltext.body":

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"boosting": {"positive": {"bool": {"must": [{"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["fulltext.body"]}}], "should": [{"terms": {"access_type": ["ia_sim", "ia_file", "wayback"]}}]}}, "negative": {"bool": {"should": [{"bool": {"must_not": [{"exists": {"field": "title"}}]}}, {"bool": {"must_not": [{"exists": {"field": "year"}}]}}, {"bool": {"must_not": [{"exists": {"field": "type"}}]}}, {"bool": {"must_not": [{"exists": {"field": "stage"}}]}}, {"bool": {"must_not": [{"exists": {"field": "biblio.container_ident"}}]}}]}}, "negative_boost": 0.5}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15, "highlight": {"fields": {"fulltext.body": {"number_of_fragments": 2, "fragment_size": 300}}}}

    radioactive fish eye yellow
    => 1,401 Hits in 0.84sec

    "Ground color yellowish pale, snout and mouth pale gray"
    => 1 Hits in 1.1sec

Back to baseline:

    "palace of the fine arts"
    => 26 Hits in 7.4sec

    john
    => 1,812,894 Hits in 3.1sec

Everything disabled, but fulltext query all the default fields:

    {"query": {"query_string": {"query": "john", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["title^5", "biblio_all^3", "abstracts.body^2", "fulltext.body", "everything"]}}, "from": 0, "size": 15}

    jane
    => 318,757 Hits in 0.29sec

    distress dolphin plant echo
    => 355 Hits in 1.5sec

    "Michael Longley's most recent collection of poems"
    => 1 Hits in 1.2sec

    aqua
    => 95,628 Hits in 0.27sec

Defaults, but query only "biblio_all":

    "global warming"
    => 2,712 Hits in 0.29sec

    pink
    => 1,805 Hits in 0.24sec

    *
    => 20,426,310 Hits in 7.5sec

    review
    => 795,060 Hits in 1.5sec

    "to be or not"
    => 319 Hits in 0.81sec

Simple filters, `biblio_all`, boosting disabled:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"query_string": {"query": "coffee", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15}

    open
    => 155,337 Hits in 0.31sec

    all
    => 40,880 Hits in 0.24sec

    the
    => 7,369,084 Hits in 0.75sec

Boosting disabled, query only `biblio_all`:

    "triangulations among all simple spherical ones can be seen to be"
    => 0 Hits in 0.6sec, again in 0.028sec

    "di Terminal Agribisnis (Holding Ground) Rancamaya Bogor"
    => 1 Hits in 0.21sec

    "to be or not"
    => 319 Hits in 0.042sec

Same as above, add boosting back in:

    {"query": {"bool": {"filter": [{"terms": {"type": ["article-journal", "paper-conference", "chapter"]}}, {"terms": {"access_type": ["wayback", "ia_file", "ia_sim"]}}], "must": [{"query_string": {"query": "the", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}, {"boosting": {"positive": {"bool": {"must": [{"query_string": {"query": "the", "default_operator": "AND", "analyze_wildcard": true, "allow_leading_wildcard": false, "lenient": true, "quote_field_suffix": ".exact", "fields": ["biblio_all^3"]}}], "should": [{"terms": {"access_type": ["ia_sim", "ia_file", "wayback"]}}]}}, "negative": {"bool": {"should": [{"bool": {"must_not": [{"exists": {"field": "title"}}]}}, {"bool": {"must_not": [{"exists": {"field": "year"}}]}}, {"bool": {"must_not": [{"exists": {"field": "type"}}]}}, {"bool": {"must_not": [{"exists": {"field": "stage"}}]}}, {"bool": {"must_not": [{"exists": {"field": "biblio.container_ident"}}]}}]}}, "negative_boost": 0.5}}]}}, "collapse": {"field": "collapse_key", "inner_hits": {"name": "more_pages", "size": 0}}, "from": 0, "size": 15}

    the
    => 7,369,084 Hits in 5.3sec, repeat in 5.1sec

Removing `poor_metadata` fields:

    tree
    => 1,521,663 Hits in 2.3sec, again in 2.2sec

    all but one removed...
    tree
    => 1,521,663 Hits in in 1.0sec, again in in 0.84sec

    3/5 negative...
    tree
    => 1,521,663 Hits in 3.5sec

    no boosting...
    tree
    => 1,521,663 Hits in 0.2sec

Testing "rescore" (with collapse disabled; `window_size`=50):


    search = search.query(basic_fulltext)
    search = search.extra(
        rescore={
            'window_size': 100,
            "query": {
                "rescore_query": Q(
                    "boosting",
                    positive=Q("bool", must=basic_fulltext, should=[has_fulltext],),
                    negative=poor_metadata,
                    negative_boost=0.5,
                ).to_dict(),
            },
        }
    )

    green; access:everything (rescoring)
    => 331,653 Hits in 0.05sec, again in 0.053sec

    *; access:everything (rescoring)
    => 93,043,404 Hits in 1.2sec, again in 1.2sec

    green; access:everything (rescoring)
    => 331,653 Hits in 0.041sec, again in 0.038sec

    *; access:everything (no boost)
    => 93,043,404 Hits in 1.1sec, again in 1.2sec

    green; access:everything (boost query)
    => 331,653 Hits in 0.96sec< again in 0.95sec

    *; access:everything (boost query)
    => 93,043,404 Hits in 13sec


Other notes:

    counting all records, default filters ("*")
        scholar-qa: 20,426,296 Hits in 7.4sec
        svc097: 20,426,310 Hits in 8.6sec

    "to be or not to be" hamlet
        scholar-qa: timeout, then 768 Hits in 0.73sec
        svc097: 768 Hits in 2.5sec, then 0.86 sec

    "to be or not to be"
        svc98: 16sec

Speculative notes:

querying more fields definitely seems heavy. should try `require_field_match`
with highlighter. to allow query and highlight fields to be separate? or
perhaps even a separate highlighter query. query "everything", highlight
specific fields.

scoring/boosting large reponses (more than a few hundred thousand hits) seems
expensive. this include the trivial '*' query.

some fulltext phrase queries seem to always be expensive. look in to phrase
indexing, eg term n-grams? looks like simple `index_phrases` parameter is
sufficient for the basic case

not a performance thing, but should revisit schema and field storage to reduce
size. eg, are we storing "exact" separately from stemming? does that increase
size? is fulltext.body and everything redundant?


TL;DR:
- scoring large result set (with boost) is slow (eg, "*"), but not bad for smaller result sets
    => confirmed this makes a difference, but can't do collapse at same time
- phrase queries are expensive, especially against fulltext
- query/match multiple fields is also proportionately expensive

TODO:
x index tweaks: smaller number types (eg, for year)
    https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html
    volume, issue, pages, contrib counts
x also sort and remove null keys when sending docs to ES
    => already done
x experiment with rescore for things like `has_fulltext` and metadata quality boost. with a large window?
x query on fewer fields and separate highlight fields from query fields (title, `biblio_all`, everything)
x consider not having `biblio_all.exact`
x enable `index_phrases` on at least `everything`, then reindex
    => start with ~1mil test batch
x consider not storing `everything` on disk at all, and maybe not `biblio_all` either (only use these for querying). some way to not make fulltext.body queryable?
- PROBLEM: can't do `collapse` and `rescore` together
    => try only a boolean query instead of boosting
        => at least superficially, no large difference
    x  special case "*" query and do no scoring, maybe even sort by `_doc`
        => huge difference for this specific query
    => could query twice: once with regular storing + collapse, but "halt
       after" short number of hits to reduce rescoring (?), and second time
       with no responses to get total count
    => could manually rescore in client code, just from the returned hits?

future questions:
- consider deserializing hit _source documents to pydantic objects (to avoid null field errors)
- how much of current disk usage is terms? will `index_phrase` make worse?
- do we need to store term offsets in indexes to make phrase queries faster/better, especially if the field is not stored?

Performance seems to have diverged between the two instances, not sure why.
Maybe some query terms just randomly are faster on one instance or the other?
Eg, "wood"

## 2020-08-07 Test Phrase Indexing

Indexing 1 million papers twice, with old and new schema, to check impact of
phrase indexing, in ES 7.x.

    release_export.2019-07-07.5mil_fulltext.json.gz

    git checkout 0c7a2ace5d7c5b357dd4afa708a07e3fa85849fd
    http put ":9200/qa_scholar_fulltext_0c7a2ace?include_type_name=true" < schema/scholar_fulltext.v01.json
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz \
        | gunzip \
        | head -n1000000 \
        | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
        | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_0c7a2ace -type _doc

    # master branch, phrase indexing
    git checkout 2c681e32756538c84b292cc95b623ee9758846a6
    http put ":9200/qa_scholar_fulltext_2c681e327?include_type_name=true" < schema/scholar_fulltext.v01.json
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz \
        | gunzip \
        | head -n1000000 \
        | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
        | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_2c681e327 -type _doc

    http get :9200/_cat/indices
    [...]
    green open qa_scholar_fulltext_0c7a2ace    BQ9tH5OZT0evFCXiIJMdUQ 12 0   1000000      0   6.7gb   6.7gb
    green open qa_scholar_fulltext_2c681e327   PgRMn5v-ReWzGlCTiP7b6g 12 0   1000000      0   9.5gb   9.5gb
    [...]

So phrase indexing is...42% larger index on disk, even with other changes to
reduce size. We will probably approach 2 TByte total index size.

    "to be or not to be"
    => qa_scholar_fulltext_0c7a2ace: 65 Hits in 0.2sec (after repetitions)
    => qa_scholar_fulltext_2c681e327: 65 Hits in 0.065sec

    to be or not to be
    => qa_scholar_fulltext_0c7a2ace: 87,586 Hits in 0.16sec
    => qa_scholar_fulltext_2c681e327: 87,590 Hits in 0.16sec

    "Besides all beneficial properties studied for various LAB, a special attention need to be pay on the possible cytotoxicity levels of the expressed bacteriocins"
    => qa_scholar_fulltext_0c7a2ace: 1 Hits in 0.076sec
    => qa_scholar_fulltext_2c681e327: 1 Hits in 0.055sec

    "insect swarm"
    => qa_scholar_fulltext_0c7a2ace: 4 Hits in 0.032sec
    => qa_scholar_fulltext_2c681e327: 4 Hits in 0.024sec

    "how to"
    => qa_scholar_fulltext_0c7a2ace: 15,761 Hits in 0.11sec
    => qa_scholar_fulltext_2c681e327: 15,763 Hits in 0.054sec

Sort of splitting hairs at this scale, but does seem like phrase indexing helps
with some queries. Seems worth at least trying with large/full index.

## 2020-08-07 Iterated Release Batch

Sharded indexing:

    zcat /fast/download/release_export_expanded.2020-08-05.json.gz | split --lines 25000000 - release_export_expanded.split_ -d --additional-suffix .json

    export TMPDIR=/sandcrawler-db/tmp
    for SHARD in {00..06}; do
        cat /bigger/scholar/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz > /grande/scholar/2020-12-30/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

Record counts:

    24.7M 15:09:08 [ 452 /s]
    24.7M 16:11:22 [ 423 /s]
    24.7M 16:38:19 [ 412 /s]
    24.7M 17:29:46 [ 392 /s]
    24.7M 14:55:53 [ 459 /s]
    24.7M 15:02:49 [ 456 /s]
    2M 1:10:36 [ 472 /s]

Have made transform code changes, now at git rev 7603dd0ade23e22197acd1fd1d35962c314cf797.

Transform and index, on svc097 machine:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
    | gunzip \
    | head -n2000000 \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc

Derp, got a batch-size error. But maybe was just a single huge doc? Added a
hack to try and skip transform of very large docs to start. In the future
should truncate specific fields (probably fulltext).

Ahah, actual error was:

    2020/08/12 23:19:15   {"mapper_parsing_exception" "failed to parse field [biblio.issue_int] of type [short] in document with id 'work_aezuqrgnnfcezkkeoyonr6ll54'. Preview of field's value: '48844'" "" "" ""}

Full indexing:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | pv -l \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
    2> /tmp/error.txt 1> /tmp/output.txt

Started: 2020-08-12 14:24

    6.71M 2:46:56 [ 590 /s]

Yikes, is this going to take 60 hours to index? CPU and disk seem to be
basically maxed out, so don't think tweaking batch size or parallelism would
help much.

NOTE: tail -n +700000
NOTE: could filter line size: awk 'length($0) < 16384'

Had some hardware (?) issue and had to restart.

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_{00..06}.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | pv -l \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
    2> /tmp/error.txt 1> /tmp/output.txt

    => 150M 69:00:35 [ 604 /s]

    => green open scholar_fulltext_v01            2KrkdhuhRDa6SdNC36XR0A 12 0 150232272    130   1.3tb   1.3tb
    => Filesystem      Size  Used Avail Use% Mounted on
    => /dev/vda1       3.5T  1.4T  2.0T  42% /

    ssh aitio.us.archive.org cat /bigger/scholar_old/sim_intermediate.2020-07-23.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
    2> /tmp/error.txt 1> /tmp/output.txt

    => 2020/08/16 21:51:14 1895778 docs in 2h22m55.61416094s at 221.066 docs/s with 4 workers

    => green open scholar_fulltext_v01            2KrkdhuhRDa6SdNC36XR0A 12 0 152090351  26071   1.3tb   1.3tb
    => Filesystem      Size  Used Avail Use% Mounted on
    => /dev/vda1       3.5T  1.4T  2.0T  42% /

Stop elasticsearch, `sync`, restart, to ensure index is fully flushed to disk.

Some warm-up queries: "*", "blood", "to be or not to be"


## 2020-12-30 Simple Release Batch

Hopefully no special cases in this iteration!

    mkdir -p /grande/scholar/2020-12-30/
    cd /grande/scholar/2020-12-30/
    zcat /fast/download/release_export_expanded.2020-12-30.json.gz | split --lines 25000000 - release_export_expanded.split_ -d --additional-suffix .json

    export TMPDIR=/sandcrawler-db/tmp
    for SHARD in {00..06}; do
        cat /grande/scholar_index/2020-12-30/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz > /grande/scholar_index/2020-12-30/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

Continuing 2020-01-16, on new focal elasticsearch 7.10 cluster:

    # commit: e5a5318829e1f3a08a2e0dbc252d839cc6f5e8f0
    http put ":9200/scholar_fulltext_v01?include_type_name=true" < schema/scholar_fulltext.v01.json

    http put ":9200/scholar_fulltext_v01/_settings" index.routing.allocation.include._name=wbgrp-svc500

    # start with single shard (00)
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_00.json.gz \
      | gunzip \
      | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
      | pv -l \
      | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
      2> /tmp/error.txt 1> /tmp/output.txt

Got an error:

    parallel: Error: Output is incomplete. Cannot append to buffer file in /tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
    Warning: unable to close filehandle properly: No space left on device during global destruction.

So added `--compress` and the `--tmpdir` (which needed to be created):

    # run other shards
    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_{01..06}.json.gz \
      | gunzip \
      | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
      | pv -l \
      | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
      2> /tmp/error.txt 1> /tmp/output.txt

## 2021-06-06 Simple Iteration

Some new paths, more parallelism, and more conservative file naming/handling,
but otherwise not much changed from the 2020-12-30 run above.

    export JOBDIR=/kubwa/scholar/2021-06-03
    mkdir -p $JOBDIR
    cd $JOBDIR
    zcat /fast/release_export_expanded.json.gz | split --lines 8000000 - release_export_expanded.split_ -d --additional-suffix .json

    cd /fast/fatcat-scholar
    pipenv shell
    export TMPDIR=/sandcrawler-db/tmp

    # transform
    set -u -o pipefail
    for SHARD in {00..20}; do
        cat $JOBDIR/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

    # dump refs
    set -u -o pipefail
    for SHARD in {00..20}; do
        zcat $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz \
            | pv -l \
            | parallel -j8 --linebuffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.transform run_refs \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz
    done

Ran in to a problem with a single (!) bad TEI-XML document, due to bad text
encoding:

    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 40, column 1122

Root cause was an issue in GROBID, which seems to have been fixed in more
recent versions of GROBID. Patched to continue, and separately commited patch
to fatcat-scholar code base.

Ran several retries, manually.

Upload to petabox:

    export BASENAME=scholar_corpus_bundle_2021-06-03
    for SHARD in {00..20}; do
        ia upload ${BASENAME}_split-${SHARD} $JOBDIR/README.md $JOBDIR/fatcat_scholar_work_fulltext.split_${SHARD}.json.gz -m collection:"scholarly-tdm" --checksum
    done

    ia upload scholar_corpus_refs_2021-06-03 fatcat_scholar_work_fulltext.split_*.refs.json.gz -m collection:"scholarly-tdm" --checksum


### Performance Notes (on 2021-06-06 run)

Recently added crossref refs via sandcrawler-db postgrest lookup. Seem to still
be getting around 40/sec works per second, with a single thread, similar to
previous performance, so not a significant slow down.