aboutsummaryrefslogtreecommitdiffstats
path: root/scratch/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'scratch/README.md')
-rw-r--r--scratch/README.md10
1 files changed, 10 insertions, 0 deletions
diff --git a/scratch/README.md b/scratch/README.md
new file mode 100644
index 0000000..4c3fa65
--- /dev/null
+++ b/scratch/README.md
@@ -0,0 +1,10 @@
+# PySpark Test Run
+
+* 2020-04-02
+
+Goal: We want to understand, which URLs of the citation corpus have been
+preserved. Also we want the GWB URL if possible. We'll try pyspark.
+
+Our cluster runs Hadoop 2.6, so we'll try:
+
+ $ PYSPARK_HADOOP_VERSION=2.7 pip install pyspark