aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/overview.md
blob: 27b9177a094e3e6e842bc035eeb5e82659a6f7e6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Generic data processing approach

## A basic setup

* Python for orchestration; data deps; structured outputs; multistage pipelines
  with inspectable intermediate results
* Go for custom, fast tools, when needed

## Quick Fusion

For data fusion, e.g. merging OL works and editions to get a joint dataset:

* select and tabularize data
* a one-off merge script using, e.g. Pandas

## Hadoop Scripting

* Pig Latin
* PySpark