summaryrefslogtreecommitdiffstats
path: root/extra/elasticsearch/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'extra/elasticsearch/README.md')
-rw-r--r--extra/elasticsearch/README.md32
1 files changed, 32 insertions, 0 deletions
diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md
new file mode 100644
index 00000000..b9800143
--- /dev/null
+++ b/extra/elasticsearch/README.md
@@ -0,0 +1,32 @@
+
+# Elasticsearch Schemas and Pipeline Docs
+
+Eventually, we might end up with schemas for multiple entity types, and in
+particular glom/merge releases under their work, but for now we just have a
+release-oriented schema that pulls in collection and files metadata.
+
+Elasticsearch has at least two uses: user-facing search for entities, and
+exploring aggregate numbes.
+
+The schema tries to stay close to the release entity type, but adds some extra
+aggregated fields and flags.
+
+The simple batch update pipeline currently in use is to:
+
+- make a fresh "expanded" release entity dump (JSON)
+- transform using `parallel` and a python script
+- bulk import into elastic using `esbulk`
+
+In the future, it would be nice to have a script that "tails" the changelog for
+edits and updates just those entities in the database. This is somewhat
+non-trivial because the "expand" data requires more sophisticated cache
+invalidation (entity updates), particularly in the case where an inter-entity
+relation is *removed*. For example, if a file match against a given release is
+removed, the old release elastic object needs to be updated to remove the file
+from it's `files`.
+
+## TODO
+
+"enum" types, distinct from "keyword"?
+
+Other identifiers in search index? core, wikidata