update notes

author: Martin Czygan <martin.czygan@gmail.com> 2021-08-02 17:27:38 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-08-02 17:27:38 +0200
commit: a67c33a19166e610b153a3f77be19c6cf5cf4235 (patch)
tree: c0fa28129d86dedcd415ceb96b991c7207e2b07b /skate
parent: edd7a195846c8c1cf7fb4f915f8cd1610736d6d5 (diff)
download: refcat-a67c33a19166e610b153a3f77be19c6cf5cf4235.tar.gz
refcat-a67c33a19166e610b153a3f77be19c6cf5cf4235.zip
1 files changed, 20 insertions, 0 deletions
diff --git a/skate/README.md b/skate/README.md
index a63ce18..40a863d 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -90,3 +90,23 @@ Step 3 can now go into all depth understanding multiplicities, e.g. is it an
 "ebd", "ff" type? Does it come from different source (e.g. then choose the one
 most likely being correct, etc), ...
 
+### A simple pattern
+
+We have a typical pattern:
+
+* two medium size data sets with different schemas
+* mapper functions per schema
+* a reduce function streaming through the "mapped" files of both schemas
+
+Basically a `GROUP BY`, where we might want to group by a value that we need to
+compute first (e.g. title normalization, SOUNDEX, NYSIIS, ...); where the
+aggregation is a full function, e.g. able to generate document in a third
+schema (e.g. a biblioref document), etc.
+
+We could look into something like PG and add custom functions. Load JSON files,
+load functions, run. Or keep data at rest and try to implement a performant
+scan over it, manually.
+
+Some type that encapsulates schema, extraction and reduction into a single,
+runnable entity.
+
author	Martin Czygan <martin.czygan@gmail.com>	2021-08-02 17:27:38 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-08-02 17:27:38 +0200
commit	a67c33a19166e610b153a3f77be19c6cf5cf4235 (patch)
tree	c0fa28129d86dedcd415ceb96b991c7207e2b07b /skate
parent	edd7a195846c8c1cf7fb4f915f8cd1610736d6d5 (diff)
download	refcat-a67c33a19166e610b153a3f77be19c6cf5cf4235.tar.gz refcat-a67c33a19166e610b153a3f77be19c6cf5cf4235.zip