notes/es-indexing.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152


Going to do some initial indexing of refs ("BiblioRefs" schema) into
elasticsearch 7 cluster.

## 2021-07-16

* generated "bref" dataset on aitio,
  "/magna/refcat/2021-07-06/BrefCombined/date-2021-07-06.json.zst",
  d69838fb71623a83b60e03be3493042b27539567
* 1,865,637,767 docs, about 40% increase since last version

The index name will be: `fatcat_ref_v02_20210716`

    http put :9200/fatcat_ref_v02_20210716 < /srv/fatcat/src/extra/elasticsearch/ref_schema.json

Single shard:

    http put ":9200/fatcat_ref_v02_20210716/_settings" index.routing.allocation.include._name=wbgrp-svc500

Confirm:

    $ http get :9200/_cat/shards/fatcat_ref_v02_20210716
    HTTP/1.1 200 OK
    content-encoding: gzip
    content-length: 117
    content-type: text/plain; charset=UTF-8

    fatcat_ref_v02_20210716 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
    fatcat_ref_v02_20210716 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
    fatcat_ref_v02_20210716 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
    fatcat_ref_v02_20210716 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
    fatcat_ref_v02_20210716 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
    fatcat_ref_v02_20210716 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500

Expecting 1,865,637,767 edges, as we improved deduplication since v01.

    zstdcat -T0 /srv/fatcat/datasets/fatcat_refs.date-2021-07-06.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v02_20210716

Watch indexing:

    watch -n 10 'curl -s localhost:9200/_cat/indices | grep fatcat_ref_v02_20210716'

Indexing time:

    real    1599m7.314s
    user    1926m34.791s
    sys     169m16.626s

    $ curl -s localhost:9200/_cat/indices | grep fatcat_ref_v02_20210716
    green open fatcat_ref_v02_20210716             Cg-LNym9Q6OPUKJekNPCPw  6 0 1865637767        0 435.7gb 435.7gb

After 26h, indexing completed; only unique edges in the dataset; size: 435G.

## 2021-04-12

Reduced `number_of_shards` from 12 to 6.

Create index from schema:

    # note, *not* include_type_name=true in the URL
    http put :9200/fatcat_ref_v01 < ref_schema.json

Force all shards to a single machine (wbgrp-svc500):

    http put ":9200/fatcat_ref_v01/_settings" index.routing.allocation.include._name=wbgrp-svc500

    # would run this later, after adding more nodes to list, for production deployment
    #http put ":9200/fatcat_ref_v01/_settings" index.number_of_replicas=1

Confirm shard locations:

    http get :9200/_cat/shards/fatcat_ref_v01

        fatcat_ref_v01 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500

Copied over `aitio:/magna/refcat/BiblioRefV2/date-2021-02-20.json.zst`, which
seems to have been output on 2021-03-24 (the date in the filename refers to the
date of the source raw reference dump, I believe).

    du -sh /srv/fatcat/datasets/date-2021-02-20.json.zst
    39G     /srv/fatcat/datasets/date-2021-02-20.json.zst

Check that esbulk is updated:

    esbulk -v
    0.7.3

Expecting on the order of:

    785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed

Start with 1 million documents to index:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n1000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

    => 2021/04/13 02:55:17 1000000 docs in 19.89s at 50265.935 docs/s with 8 workers

    => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 983111 16889 305.5mb 305.5mb

Very fast!

Not 1 million documents; presumably some duplication due to duplicate checks?
If we thought we could extrapolate from 1mil ~= 0.3 GB, then 1bil ~= 300 GByte,
which seems pretty reasonable.

Bump to 20 million:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

    => 2021/04/13 03:05:17 20000000 docs in 396.53s at 50437.434 docs/s with 8 workers

    => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 19731816 171926 5.4gb 5.4gb

Configured es-public-proxy to enable public access to `fatcat_ref`, and added an alias:

    http put :9200/fatcat_ref_v01/_alias/fatcat_ref

Crude queries are like:

    http https://search.fatcat.wiki/fatcat_ref/_search q==target_work_ident:o547luzdqragbd66ejlm4pughi

Seems like these were all join/exact matches. Would like some others, so do an additional `shuf` instead of head:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | shuf -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

Interestingly, `shuf` takes more CPU than `zstdcat`, presumably because it is
doing a lot of random number generation. Note that `shuf` uses reservoir
sampling, so is usually pretty efficient.

Back over on `aitio`, trying:

    zstdcat /magna/tmp/date-2021-02-20.json.zst | jq ._id -c | sort -u | wc -l

Ok, just run the whole thing:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

    => 2021/04/13 10:10:12 785569011 docs in 22309.29s at 35212.641 docs/s with 8 workers
    => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 734856532 38611510 184.2gb 184.2gb

Took a bit over 6 hours, and ~180 GByte total index size. Nice!

## Notes and Thoughts

- many of the `ref_key` are pretty large, like `bibr4-0036933016674866`. Maybe these can be parsed down.
- might want aggregations on `source_release_stage`, `source_year`, etc?
- `update_ts` may not have come through correctly? it is an integer, not a datetime