diff options
Diffstat (limited to 'scratch/README.md')
-rw-r--r-- | scratch/README.md | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/scratch/README.md b/scratch/README.md new file mode 100644 index 0000000..4c3fa65 --- /dev/null +++ b/scratch/README.md @@ -0,0 +1,10 @@ +# PySpark Test Run + +* 2020-04-02 + +Goal: We want to understand, which URLs of the citation corpus have been +preserved. Also we want the GWB URL if possible. We'll try pyspark. + +Our cluster runs Hadoop 2.6, so we'll try: + + $ PYSPARK_HADOOP_VERSION=2.7 pip install pyspark |