notes/scaling_works.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96


Run a partial ~5 million paper batch through:

    zcat /srv/fatcat_scholar/release_export.2019-07-07.5mil_fulltext.json.gz \
        | parallel -j8 --line-buffer --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > data/work_intermediate.5mil.json.gz
    => 5M 21:36:14 [64.3 /s]

    # runs about 70 works/sec with this parallelism => 1mil in 4hr, 5mil in 20hr
    # looks like seaweedfs is bottleneck?
    # tried stopping persist workers on seaweedfs and basically no change

    indexing to ES seems to take... an hour per million? or so. can check index
    monitoring to get better number

## 2020-07-23 First Full Release Batch

Patched to skip fetching `pdftext`

Run full batch through (on aitio), expecting this to take on the order of a
week:

    zcat /fast/download/release_export_expanded.json.gz \
        | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
        | pv -l \
        | gzip > /grande/snapshots/fatcat_scholar_work_fulltext.20200723.json.gz

Ah, this was running really slow because `MINIO_SECRET_KEY` was not set. Really
should replace `minio` python client library as we are now using seaweedfs!

Got an error:

    36.1M 15:29:38 [ 664 /s]
    parallel: Error: Output is incomplete. Cannot append to buffer file in /fast/tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
    Warning: unable to close filehandle properly: No space left on device during global destruction.

Might have been due to `/` filling up (not `/fast/tmp`)? Had gotten pretty far
in to processing. Restarted, will keep an eye on it.

To index, run from ES machine, as bnewbold:

    ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.partial.20200723.json.gz \
    | gunzip \
    | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
    | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_v01 -type _doc

Hrm, again:

    99.9M 56:04:41 [ 308 /s]
    parallel: Error: Output is incomplete. Cannot append to buffer file in /fast/tmp. Is the disk full?
    parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

Confirmed that disk was full in that moment; frustrating as had checked in and
disk usage was low enough before, and data was flowing to /grande (large
spinning disk). Should be sufficient to move release dump to `/bigger` and
clear more space on `/fast` to do the full indexing.

    /dev/sdg1       917G  871G     0 100% /fast

    vs.

    /dev/sdg1       917G  442G  430G  51% /fast

    -rw-rw-r-- 1 bnewbold bnewbold  418G Jul 27 05:55 fatcat_scholar_work_fulltext.20200723.json.gz

Got to about 2/3 of full release dump. Current rough estimates for total
processing times:

    enrich 150 million releases: 80hr (3-4 days), 650 GByte on disk (gzip)
    transform and index 150 million releases: 55hr (2-3 days), 1.5 TByte on disk (?)

## ES Performance Iteration (2020-07-27)

- schema: switch abstracts from nested to simple array
- query: include fewer fields: just biblio (with boost; and maybe title) and "everything"
- query: use date-level granularity for time queries (may already do this?)
- set replica=0 (for now)
- set shards=12, to optimize *individual query* performance
    => if estimating 800 GByte index size, this is 60-70 GByte per shard
- set `index.codec=best_compression` to leverage CPU vs. disk I/O
- ensure transform output is sorted by key
    => <https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents>
- ensure number of cores is large

## Work Grouping

Plan for work-grouped expanded release dumps:

Have release identifier dump script include, and sort by, `work_id`. This will
definitely slow down that stage, unclear if too much. `work_id` is indexed.

Bulk dump script iterates and makes work batches of releases to dump, passes
Vec to worker threads. Worker threads pass back Vec of entities, then print all
of them (same work) sequentially.