aboutsummaryrefslogtreecommitdiffstats
path: root/notes/es-indexing.md
blob: 04fca2ea5f9cdede0389749d494ef98420e2365c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

Going to do some initial indexing of refs ("BiblioRefs" schema) into
elasticsearch 7 cluster.

## 2021-04-12

Reduced `number_of_shards` from 12 to 6.

Create index from schema:

    # note, *not* include_type_name=true in the URL
    http put :9200/fatcat_ref_v01 < ref_schema.json

Force all shards to a single machine (wbgrp-svc500):

    http put ":9200/fatcat_ref_v01/_settings" index.routing.allocation.include._name=wbgrp-svc500

    # would run this later, after adding more nodes to list, for production deployment
    #http put ":9200/fatcat_ref_v01/_settings" index.number_of_replicas=1

Confirm shard locations:

    http get :9200/_cat/shards/fatcat_ref_v01

        fatcat_ref_v01 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
        fatcat_ref_v01 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500

Copied over `aitio:/magna/refcat/BiblioRefV2/date-2021-02-20.json.zst`, which
seems to have been output on 2021-03-24 (the date in the filename refers to the
date of the source raw reference dump, I believe).

    du -sh /srv/fatcat/datasets/date-2021-02-20.json.zst
    39G     /srv/fatcat/datasets/date-2021-02-20.json.zst

Check that esbulk is updated:

    esbulk -v
    0.7.3

Expecting on the order of:

    785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed

Start with 1 million documents to index:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n1000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

    => 2021/04/13 02:55:17 1000000 docs in 19.89s at 50265.935 docs/s with 8 workers

    => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 983111 16889 305.5mb 305.5mb

Very fast!

Not 1 million documents; presumably some duplication due to duplicate checks?
If we thought we could extrapolate from 1mil ~= 0.3 GB, then 1bil ~= 300 GByte,
which seems pretty reasonable.

Bump to 20 million:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

    => 2021/04/13 03:05:17 20000000 docs in 396.53s at 50437.434 docs/s with 8 workers

    => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 19731816 171926 5.4gb 5.4gb

Configured es-public-proxy to enable public access to `fatcat_ref`, and added an alias:

    http put :9200/fatcat_ref_v01/_alias/fatcat_ref

Crude queries are like:

    http https://search.fatcat.wiki/fatcat_ref/_search q==target_work_ident:o547luzdqragbd66ejlm4pughi

Seems like these were all join/exact matches. Would like some others, so do an additional `shuf` instead of head:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | shuf -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

Interestingly, `shuf` takes more CPU than `zstdcat`, presumably because it is
doing a lot of random number generation. Note that `shuf` uses reservoir
sampling, so is usually pretty efficient.

Back over on `aitio`, trying:

    zstdcat /magna/tmp/date-2021-02-20.json.zst | jq ._id -c | sort -u | wc -l

Ok, just run the whole thing:

    zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01

    => 2021/04/13 10:10:12 785569011 docs in 22309.29s at 35212.641 docs/s with 8 workers
    => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 734856532 38611510 184.2gb 184.2gb

Took a bit over 6 hours, and ~180 GByte total index size. Nice!

## Notes and Thoughts

- many of the `ref_key` are pretty large, like `bibr4-0036933016674866`. Maybe these can be parsed down.
- might want aggregations on `source_release_stage`, `source_year`, etc?
- `update_ts` may not have come through correctly? it is an integer, not a datetime