From 86d15bda26280437ac7a853e73d460d0bf9dd418 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 21 Sep 2018 16:56:01 -0700 Subject: first pass at a release elastic schema --- extra/elasticsearch/README.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 extra/elasticsearch/README.md (limited to 'extra/elasticsearch/README.md') diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md new file mode 100644 index 00000000..b9800143 --- /dev/null +++ b/extra/elasticsearch/README.md @@ -0,0 +1,32 @@ + +# Elasticsearch Schemas and Pipeline Docs + +Eventually, we might end up with schemas for multiple entity types, and in +particular glom/merge releases under their work, but for now we just have a +release-oriented schema that pulls in collection and files metadata. + +Elasticsearch has at least two uses: user-facing search for entities, and +exploring aggregate numbes. + +The schema tries to stay close to the release entity type, but adds some extra +aggregated fields and flags. + +The simple batch update pipeline currently in use is to: + +- make a fresh "expanded" release entity dump (JSON) +- transform using `parallel` and a python script +- bulk import into elastic using `esbulk` + +In the future, it would be nice to have a script that "tails" the changelog for +edits and updates just those entities in the database. This is somewhat +non-trivial because the "expand" data requires more sophisticated cache +invalidation (entity updates), particularly in the case where an inter-entity +relation is *removed*. For example, if a file match against a given release is +removed, the old release elastic object needs to be updated to remove the file +from it's `files`. + +## TODO + +"enum" types, distinct from "keyword"? + +Other identifiers in search index? core, wikidata -- cgit v1.2.3