diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-05-24 16:35:37 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-05-24 16:35:37 +0200 |
commit | 1aa8471c5ba39b8472ca02c4f18d7b81a118752b (patch) | |
tree | d54677384bdb28f417fb3a23768649ff7b8da1ff | |
parent | d1d74b6ddcb432e8f2b7bfcbe1a784b2db0ae382 (diff) | |
download | refcat-1aa8471c5ba39b8472ca02c4f18d7b81a118752b.tar.gz refcat-1aa8471c5ba39b8472ca02c4f18d7b81a118752b.zip |
overview notes
-rw-r--r-- | python/notes/overview.md | 20 |
1 files changed, 20 insertions, 0 deletions
diff --git a/python/notes/overview.md b/python/notes/overview.md new file mode 100644 index 0000000..27b9177 --- /dev/null +++ b/python/notes/overview.md @@ -0,0 +1,20 @@ +# Generic data processing approach + +## A basic setup + +* Python for orchestration; data deps; structured outputs; multistage pipelines + with inspectable intermediate results +* Go for custom, fast tools, when needed + +## Quick Fusion + +For data fusion, e.g. merging OL works and editions to get a joint dataset: + +* select and tabularize data +* a one-off merge script using, e.g. Pandas + +## Hadoop Scripting + +* Pig Latin +* PySpark + |