1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
|
Going to do some initial indexing of refs ("BiblioRefs" schema) into
elasticsearch 7 cluster.
## 2021-07-16
* generated "bref" dataset on aitio,
"/magna/refcat/2021-07-06/BrefCombined/date-2021-07-06.json.zst",
d69838fb71623a83b60e03be3493042b27539567
* 1,865,637,767 docs, about 40% increase since last version
The index name will be: `fatcat_ref_v02_20210716`
http put :9200/fatcat_ref_v02_20210716 < /srv/fatcat/src/extra/elasticsearch/ref_schema.json
Single shard:
http put ":9200/fatcat_ref_v02_20210716/_settings" index.routing.allocation.include._name=wbgrp-svc500
Confirm:
$ http get :9200/_cat/shards/fatcat_ref_v02_20210716
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 117
content-type: text/plain; charset=UTF-8
fatcat_ref_v02_20210716 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v02_20210716 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v02_20210716 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v02_20210716 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v02_20210716 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v02_20210716 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
Expecting 1,865,637,767 edges, as we improved deduplication since v01.
zstdcat -T0 /srv/fatcat/datasets/fatcat_refs.date-2021-07-06.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v02_20210716
Watch indexing:
watch -n 10 'curl -s localhost:9200/_cat/indices | grep fatcat_ref_v02_20210716'
Indexing time:
real 1599m7.314s
user 1926m34.791s
sys 169m16.626s
$ curl -s localhost:9200/_cat/indices | grep fatcat_ref_v02_20210716
green open fatcat_ref_v02_20210716 Cg-LNym9Q6OPUKJekNPCPw 6 0 1865637767 0 435.7gb 435.7gb
After 26h, indexing completed; only unique edges in the dataset; size: 435G.
## 2021-04-12
Reduced `number_of_shards` from 12 to 6.
Create index from schema:
# note, *not* include_type_name=true in the URL
http put :9200/fatcat_ref_v01 < ref_schema.json
Force all shards to a single machine (wbgrp-svc500):
http put ":9200/fatcat_ref_v01/_settings" index.routing.allocation.include._name=wbgrp-svc500
# would run this later, after adding more nodes to list, for production deployment
#http put ":9200/fatcat_ref_v01/_settings" index.number_of_replicas=1
Confirm shard locations:
http get :9200/_cat/shards/fatcat_ref_v01
fatcat_ref_v01 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v01 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v01 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v01 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v01 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
fatcat_ref_v01 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500
Copied over `aitio:/magna/refcat/BiblioRefV2/date-2021-02-20.json.zst`, which
seems to have been output on 2021-03-24 (the date in the filename refers to the
date of the source raw reference dump, I believe).
du -sh /srv/fatcat/datasets/date-2021-02-20.json.zst
39G /srv/fatcat/datasets/date-2021-02-20.json.zst
Check that esbulk is updated:
esbulk -v
0.7.3
Expecting on the order of:
785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed
Start with 1 million documents to index:
zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n1000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01
=> 2021/04/13 02:55:17 1000000 docs in 19.89s at 50265.935 docs/s with 8 workers
=> green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 983111 16889 305.5mb 305.5mb
Very fast!
Not 1 million documents; presumably some duplication due to duplicate checks?
If we thought we could extrapolate from 1mil ~= 0.3 GB, then 1bil ~= 300 GByte,
which seems pretty reasonable.
Bump to 20 million:
zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01
=> 2021/04/13 03:05:17 20000000 docs in 396.53s at 50437.434 docs/s with 8 workers
=> green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 19731816 171926 5.4gb 5.4gb
Configured es-public-proxy to enable public access to `fatcat_ref`, and added an alias:
http put :9200/fatcat_ref_v01/_alias/fatcat_ref
Crude queries are like:
http https://search.fatcat.wiki/fatcat_ref/_search q==target_work_ident:o547luzdqragbd66ejlm4pughi
Seems like these were all join/exact matches. Would like some others, so do an additional `shuf` instead of head:
zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | shuf -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01
Interestingly, `shuf` takes more CPU than `zstdcat`, presumably because it is
doing a lot of random number generation. Note that `shuf` uses reservoir
sampling, so is usually pretty efficient.
Back over on `aitio`, trying:
zstdcat /magna/tmp/date-2021-02-20.json.zst | jq ._id -c | sort -u | wc -l
Ok, just run the whole thing:
zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01
=> 2021/04/13 10:10:12 785569011 docs in 22309.29s at 35212.641 docs/s with 8 workers
=> green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 734856532 38611510 184.2gb 184.2gb
Took a bit over 6 hours, and ~180 GByte total index size. Nice!
## Notes and Thoughts
- many of the `ref_key` are pretty large, like `bibr4-0036933016674866`. Maybe these can be parsed down.
- might want aggregations on `source_release_stage`, `source_year`, etc?
- `update_ts` may not have come through correctly? it is an integer, not a datetime
|