# PySpark Test Run * 2020-04-02 Goal: We want to understand, which URLs of the citation corpus have been preserved. Also we want the GWB URL if possible. We'll try pyspark. Our cluster runs Hadoop 2.6, so we'll try: $ PYSPARK_HADOOP_VERSION=2.7 pip install pyspark