aboutsummaryrefslogtreecommitdiffstats
path: root/skate
diff options
context:
space:
mode:
Diffstat (limited to 'skate')
-rw-r--r--skate/README.md20
1 files changed, 20 insertions, 0 deletions
diff --git a/skate/README.md b/skate/README.md
index a63ce18..40a863d 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -90,3 +90,23 @@ Step 3 can now go into all depth understanding multiplicities, e.g. is it an
"ebd", "ff" type? Does it come from different source (e.g. then choose the one
most likely being correct, etc), ...
+### A simple pattern
+
+We have a typical pattern:
+
+* two medium size data sets with different schemas
+* mapper functions per schema
+* a reduce function streaming through the "mapped" files of both schemas
+
+Basically a `GROUP BY`, where we might want to group by a value that we need to
+compute first (e.g. title normalization, SOUNDEX, NYSIIS, ...); where the
+aggregation is a full function, e.g. able to generate document in a third
+schema (e.g. a biblioref document), etc.
+
+We could look into something like PG and add custom functions. Load JSON files,
+load functions, run. Or keep data at rest and try to implement a performant
+scan over it, manually.
+
+Some type that encapsulates schema, extraction and reduction into a single,
+runnable entity.
+