diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2018-09-21 16:56:01 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2018-09-21 16:56:01 -0700 |
commit | 86d15bda26280437ac7a853e73d460d0bf9dd418 (patch) | |
tree | cfd8347bb1f4e98cdab67cebb4637421458673a9 /extra/elasticsearch/README.md | |
parent | d495df1f76c44b7e09db2fb8b93615ffcdf6b818 (diff) | |
download | fatcat-86d15bda26280437ac7a853e73d460d0bf9dd418.tar.gz fatcat-86d15bda26280437ac7a853e73d460d0bf9dd418.zip |
first pass at a release elastic schema
Diffstat (limited to 'extra/elasticsearch/README.md')
-rw-r--r-- | extra/elasticsearch/README.md | 32 |
1 files changed, 32 insertions, 0 deletions
diff --git a/extra/elasticsearch/README.md b/extra/elasticsearch/README.md new file mode 100644 index 00000000..b9800143 --- /dev/null +++ b/extra/elasticsearch/README.md @@ -0,0 +1,32 @@ + +# Elasticsearch Schemas and Pipeline Docs + +Eventually, we might end up with schemas for multiple entity types, and in +particular glom/merge releases under their work, but for now we just have a +release-oriented schema that pulls in collection and files metadata. + +Elasticsearch has at least two uses: user-facing search for entities, and +exploring aggregate numbes. + +The schema tries to stay close to the release entity type, but adds some extra +aggregated fields and flags. + +The simple batch update pipeline currently in use is to: + +- make a fresh "expanded" release entity dump (JSON) +- transform using `parallel` and a python script +- bulk import into elastic using `esbulk` + +In the future, it would be nice to have a script that "tails" the changelog for +edits and updates just those entities in the database. This is somewhat +non-trivial because the "expand" data requires more sophisticated cache +invalidation (entity updates), particularly in the case where an inter-entity +relation is *removed*. For example, if a file match against a given release is +removed, the old release elastic object needs to be updated to remove the file +from it's `files`. + +## TODO + +"enum" types, distinct from "keyword"? + +Other identifiers in search index? core, wikidata |