aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-02 17:27:38 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-02 17:27:38 +0200
commita67c33a19166e610b153a3f77be19c6cf5cf4235 (patch)
treec0fa28129d86dedcd415ceb96b991c7207e2b07b
parentedd7a195846c8c1cf7fb4f915f8cd1610736d6d5 (diff)
downloadrefcat-a67c33a19166e610b153a3f77be19c6cf5cf4235.tar.gz
refcat-a67c33a19166e610b153a3f77be19c6cf5cf4235.zip
update notes
-rw-r--r--skate/README.md20
1 files changed, 20 insertions, 0 deletions
diff --git a/skate/README.md b/skate/README.md
index a63ce18..40a863d 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -90,3 +90,23 @@ Step 3 can now go into all depth understanding multiplicities, e.g. is it an
"ebd", "ff" type? Does it come from different source (e.g. then choose the one
most likely being correct, etc), ...
+### A simple pattern
+
+We have a typical pattern:
+
+* two medium size data sets with different schemas
+* mapper functions per schema
+* a reduce function streaming through the "mapped" files of both schemas
+
+Basically a `GROUP BY`, where we might want to group by a value that we need to
+compute first (e.g. title normalization, SOUNDEX, NYSIIS, ...); where the
+aggregation is a full function, e.g. able to generate document in a third
+schema (e.g. a biblioref document), etc.
+
+We could look into something like PG and add custom functions. Load JSON files,
+load functions, run. Or keep data at rest and try to implement a performant
+scan over it, manually.
+
+Some type that encapsulates schema, extraction and reduction into a single,
+runnable entity.
+