summaryrefslogtreecommitdiffstats
path: root/extra/elasticsearch/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'extra/elasticsearch/README.md')
-rw-r--r--extra/elasticsearch/README.md47
1 files changed, 44 insertions, 3 deletions
diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md
index b9800143..0d205903 100644
--- a/extra/elasticsearch/README.md
+++ b/extra/elasticsearch/README.md
@@ -25,8 +25,49 @@ relation is *removed*. For example, if a file match against a given release is
removed, the old release elastic object needs to be updated to remove the file
from it's `files`.
-## TODO
+## Loading Data
+
+Drop and rebuild the schema:
+
+ http delete :9200/fatcat
+ http put :9200/fatcat < release_schema.json
+
+Put a single object (good for debugging):
+
+ head -n1 examples.json | http post :9200/fatcat/release/0
+ http get :9200/fatcat/release/0
+
+Bulk insert from a file on disk:
+
+ esbulk -verbose -id ident -index fatcat -type release examples.json
-"enum" types, distinct from "keyword"?
+Or, in a bulk production live-stream conversion:
+
+ time zcat /srv/fatcat/snapshots/fatcat_release_dump_expanded.json.gz | ./transform_release.py | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat-releases -type release
+
+## Full-Text Querying
+
+A generic full-text "query string" query look like this (replace "blood" with
+actual query string, and "size" field with the max results to return):
+
+ GET /fatcat/release/_search
+ {
+ "query": {
+ "query_string": {
+ "query": "blood",
+ "analyzer": "textIcuSearch",
+ "default_operator": "AND",
+ "analyze_wildcard": true,
+ "lenient": true,
+ "fields": ["title^3", "contrib_names^3", "container_title"]
+ }
+ },
+ "size": 3
+ }
+
+In the results take `.hits.hits[]._source` as the objects; `.hits.total` is the
+total number of search hits.
+
+## TODO
-Other identifiers in search index? core, wikidata
+- file URL domains? seems heavy