diff options
Diffstat (limited to 'extra/elasticsearch/README.md')
-rw-r--r-- | extra/elasticsearch/README.md | 47 |
1 files changed, 44 insertions, 3 deletions
diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md index b9800143..0d205903 100644 --- a/extra/elasticsearch/README.md +++ b/extra/elasticsearch/README.md @@ -25,8 +25,49 @@ relation is *removed*. For example, if a file match against a given release is removed, the old release elastic object needs to be updated to remove the file from it's `files`. -## TODO +## Loading Data + +Drop and rebuild the schema: + + http delete :9200/fatcat + http put :9200/fatcat < release_schema.json + +Put a single object (good for debugging): + + head -n1 examples.json | http post :9200/fatcat/release/0 + http get :9200/fatcat/release/0 + +Bulk insert from a file on disk: + + esbulk -verbose -id ident -index fatcat -type release examples.json -"enum" types, distinct from "keyword"? +Or, in a bulk production live-stream conversion: + + time zcat /srv/fatcat/snapshots/fatcat_release_dump_expanded.json.gz | ./transform_release.py | esbulk -verbose -size 20000 -id ident -w 8 -index fatcat-releases -type release + +## Full-Text Querying + +A generic full-text "query string" query look like this (replace "blood" with +actual query string, and "size" field with the max results to return): + + GET /fatcat/release/_search + { + "query": { + "query_string": { + "query": "blood", + "analyzer": "textIcuSearch", + "default_operator": "AND", + "analyze_wildcard": true, + "lenient": true, + "fields": ["title^3", "contrib_names^3", "container_title"] + } + }, + "size": 3 + } + +In the results take `.hits.hits[]._source` as the objects; `.hits.total` is the +total number of search hits. + +## TODO -Other identifiers in search index? core, wikidata +- file URL domains? seems heavy |