CDX Record Pipeline (GrobId Edition) ===================================== Hadoop based pipeline to process PDFs from a specified IA CDX dataset ## Local mode example ## ``` cat -n /home/bnewbold/100k_random_gwb_pdf.cdx | ./cdx-record-pipeline.py ``` ## Cluster mode example ## ``` input=100k_random_gwb_pdf.cdx output=100k_random_gwb_pdf.out lines_per_map=1000 hadoop jar /home/webcrawl/hadoop-2/hadoop-mapreduce/hadoop-streaming.jar -archives "hdfs://ia802400.us.archive.org:6000/lib/cdx-record-pipeline-venv.zip#cdx-record-pipeline-venv" -D mapred.reduce.tasks=0 -D mapred.job.name=Cdx-Record-Pipeline -D mapreduce.job.queuename=extraction -D mapred.line.input.format.linespermap=${lines_per_map} -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat -input ${input} -output ${output} -mapper cdx-record-pipeline.py -file cdx-record-pipeline.py ```