Going to do some initial indexing of refs ("BiblioRefs" schema) into elasticsearch 7 cluster. ## 2021-07-16 * generated "bref" dataset on aitio, "/magna/refcat/2021-07-06/BrefCombined/date-2021-07-06.json.zst", d69838fb71623a83b60e03be3493042b27539567 * 1,865,637,767 docs, about 40% increase since last version The index name will be: `fatcat_ref_v02_20210716` http put :9200/fatcat_ref_v02_20210716 < /srv/fatcat/src/extra/elasticsearch/ref_schema.json Single shard: http put ":9200/fatcat_ref_v02_20210716/_settings" index.routing.allocation.include._name=wbgrp-svc500 Confirm: $ http get :9200/_cat/shards/fatcat_ref_v02_20210716 HTTP/1.1 200 OK content-encoding: gzip content-length: 117 content-type: text/plain; charset=UTF-8 fatcat_ref_v02_20210716 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v02_20210716 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v02_20210716 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v02_20210716 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v02_20210716 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v02_20210716 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 Expecting 1,865,637,767 edges, as we improved deduplication since v01. zstdcat -T0 /srv/fatcat/datasets/fatcat_refs.date-2021-07-06.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v02_20210716 Watch indexing: watch -n 10 'curl -s localhost:9200/_cat/indices | grep fatcat_ref_v02_20210716' Indexing time: real 1599m7.314s user 1926m34.791s sys 169m16.626s $ curl -s localhost:9200/_cat/indices | grep fatcat_ref_v02_20210716 green open fatcat_ref_v02_20210716 Cg-LNym9Q6OPUKJekNPCPw 6 0 1865637767 0 435.7gb 435.7gb After 26h, indexing completed; only unique edges in the dataset; size: 435G. Do not include other shards just; TODO: specify setup path to balanced index. ## 2021-04-12 Reduced `number_of_shards` from 12 to 6. Create index from schema: # note, *not* include_type_name=true in the URL http put :9200/fatcat_ref_v01 < ref_schema.json Force all shards to a single machine (wbgrp-svc500): http put ":9200/fatcat_ref_v01/_settings" index.routing.allocation.include._name=wbgrp-svc500 # would run this later, after adding more nodes to list, for production deployment #http put ":9200/fatcat_ref_v01/_settings" index.number_of_replicas=1 Confirm shard locations: http get :9200/_cat/shards/fatcat_ref_v01 fatcat_ref_v01 3 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v01 1 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v01 2 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v01 4 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v01 5 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 fatcat_ref_v01 0 p STARTED 0 208b 207.241.225.228 wbgrp-svc500 Copied over `aitio:/magna/refcat/BiblioRefV2/date-2021-02-20.json.zst`, which seems to have been output on 2021-03-24 (the date in the filename refers to the date of the source raw reference dump, I believe). du -sh /srv/fatcat/datasets/date-2021-02-20.json.zst 39G /srv/fatcat/datasets/date-2021-02-20.json.zst Check that esbulk is updated: esbulk -v 0.7.3 Expecting on the order of: 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed Start with 1 million documents to index: zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n1000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01 => 2021/04/13 02:55:17 1000000 docs in 19.89s at 50265.935 docs/s with 8 workers => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 983111 16889 305.5mb 305.5mb Very fast! Not 1 million documents; presumably some duplication due to duplicate checks? If we thought we could extrapolate from 1mil ~= 0.3 GB, then 1bil ~= 300 GByte, which seems pretty reasonable. Bump to 20 million: zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | head -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01 => 2021/04/13 03:05:17 20000000 docs in 396.53s at 50437.434 docs/s with 8 workers => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 19731816 171926 5.4gb 5.4gb Configured es-public-proxy to enable public access to `fatcat_ref`, and added an alias: http put :9200/fatcat_ref_v01/_alias/fatcat_ref Crude queries are like: http https://search.fatcat.wiki/fatcat_ref/_search q==target_work_ident:o547luzdqragbd66ejlm4pughi Seems like these were all join/exact matches. Would like some others, so do an additional `shuf` instead of head: zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | shuf -n20000000 | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01 Interestingly, `shuf` takes more CPU than `zstdcat`, presumably because it is doing a lot of random number generation. Note that `shuf` uses reservoir sampling, so is usually pretty efficient. Back over on `aitio`, trying: zstdcat /magna/tmp/date-2021-02-20.json.zst | jq ._id -c | sort -u | wc -l Ok, just run the whole thing: zstdcat /srv/fatcat/datasets/date-2021-02-20.json.zst | esbulk -verbose -size 2000 -id _id -w 8 -index fatcat_ref_v01 => 2021/04/13 10:10:12 785569011 docs in 22309.29s at 35212.641 docs/s with 8 workers => green open fatcat_ref_v01 tpHidEK_RSSrY0YDYgTH2Q 6 0 734856532 38611510 184.2gb 184.2gb Took a bit over 6 hours, and ~180 GByte total index size. Nice! ## Notes and Thoughts - many of the `ref_key` are pretty large, like `bibr4-0036933016674866`. Maybe these can be parsed down. - might want aggregations on `source_release_stage`, `source_year`, etc? - `update_ts` may not have come through correctly? it is an integer, not a datetime