From a67c33a19166e610b153a3f77be19c6cf5cf4235 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Mon, 2 Aug 2021 17:27:38 +0200 Subject: update notes --- skate/README.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/skate/README.md b/skate/README.md index a63ce18..40a863d 100644 --- a/skate/README.md +++ b/skate/README.md @@ -90,3 +90,23 @@ Step 3 can now go into all depth understanding multiplicities, e.g. is it an "ebd", "ff" type? Does it come from different source (e.g. then choose the one most likely being correct, etc), ... +### A simple pattern + +We have a typical pattern: + +* two medium size data sets with different schemas +* mapper functions per schema +* a reduce function streaming through the "mapped" files of both schemas + +Basically a `GROUP BY`, where we might want to group by a value that we need to +compute first (e.g. title normalization, SOUNDEX, NYSIIS, ...); where the +aggregation is a full function, e.g. able to generate document in a third +schema (e.g. a biblioref document), etc. + +We could look into something like PG and add custom functions. Load JSON files, +load functions, run. Or keep data at rest and try to implement a performant +scan over it, manually. + +Some type that encapsulates schema, extraction and reduction into a single, +runnable entity. + -- cgit v1.2.3