From 86d15bda26280437ac7a853e73d460d0bf9dd418 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@robocracy.org>
Date: Fri, 21 Sep 2018 16:56:01 -0700
Subject: first pass at a release elastic schema

---
 extra/elasticsearch/README.md | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
 create mode 100644 extra/elasticsearch/README.md

(limited to 'extra/elasticsearch/README.md')

diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md
new file mode 100644
index 00000000..b9800143
--- /dev/null
+++ b/extra/elasticsearch/README.md
@@ -0,0 +1,32 @@
+
+# Elasticsearch Schemas and Pipeline Docs
+
+Eventually, we might end up with schemas for multiple entity types, and in
+particular glom/merge releases under their work, but for now we just have a
+release-oriented schema that pulls in collection and files metadata.
+
+Elasticsearch has at least two uses: user-facing search for entities, and
+exploring aggregate numbes.
+
+The schema tries to stay close to the release entity type, but adds some extra
+aggregated fields and flags.
+
+The simple batch update pipeline currently in use is to:
+
+- make a fresh "expanded" release entity dump (JSON)
+- transform using `parallel` and a python script
+- bulk import into elastic using `esbulk`
+
+In the future, it would be nice to have a script that "tails" the changelog for
+edits and updates just those entities in the database. This is somewhat
+non-trivial because the "expand" data requires more sophisticated cache
+invalidation (entity updates), particularly in the case where an inter-entity
+relation is *removed*. For example, if a file match against a given release is
+removed, the old release elastic object needs to be updated to remove the file
+from it's `files`.
+
+## TODO
+
+"enum" types, distinct from "keyword"?
+
+Other identifiers in search index? core, wikidata
-- 
cgit v1.2.3