blob: 27b9177a094e3e6e842bc035eeb5e82659a6f7e6 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Generic data processing approach
## A basic setup
* Python for orchestration; data deps; structured outputs; multistage pipelines
with inspectable intermediate results
* Go for custom, fast tools, when needed
## Quick Fusion
For data fusion, e.g. merging OL works and editions to get a joint dataset:
* select and tabularize data
* a one-off merge script using, e.g. Pandas
## Hadoop Scripting
* Pig Latin
* PySpark
|