aboutsummaryrefslogtreecommitdiffstats
path: root/python
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-05-24 16:35:37 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-05-24 16:35:37 +0200
commit1aa8471c5ba39b8472ca02c4f18d7b81a118752b (patch)
treed54677384bdb28f417fb3a23768649ff7b8da1ff /python
parentd1d74b6ddcb432e8f2b7bfcbe1a784b2db0ae382 (diff)
downloadrefcat-1aa8471c5ba39b8472ca02c4f18d7b81a118752b.tar.gz
refcat-1aa8471c5ba39b8472ca02c4f18d7b81a118752b.zip
overview notes
Diffstat (limited to 'python')
-rw-r--r--python/notes/overview.md20
1 files changed, 20 insertions, 0 deletions
diff --git a/python/notes/overview.md b/python/notes/overview.md
new file mode 100644
index 0000000..27b9177
--- /dev/null
+++ b/python/notes/overview.md
@@ -0,0 +1,20 @@
+# Generic data processing approach
+
+## A basic setup
+
+* Python for orchestration; data deps; structured outputs; multistage pipelines
+ with inspectable intermediate results
+* Go for custom, fast tools, when needed
+
+## Quick Fusion
+
+For data fusion, e.g. merging OL works and editions to get a joint dataset:
+
+* select and tabularize data
+* a one-off merge script using, e.g. Pandas
+
+## Hadoop Scripting
+
+* Pig Latin
+* PySpark
+