aboutsummaryrefslogtreecommitdiffstats
path: root/cdx-record-pipeline/README.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-03-29 20:16:05 -0700
committerBryan Newbold <bnewbold@archive.org>2018-03-29 20:16:33 -0700
commit7c81b7bea3d670876faff1eb290c40656697dddb (patch)
tree4d3413d98089d56fa50de75f0f9c7ea310f02ce4 /cdx-record-pipeline/README.md
parentd2203182c9ed6e1ff13fa70fb25f049ef87c75a0 (diff)
downloadsandcrawler-7c81b7bea3d670876faff1eb290c40656697dddb.tar.gz
sandcrawler-7c81b7bea3d670876faff1eb290c40656697dddb.zip
move to top level
Diffstat (limited to 'cdx-record-pipeline/README.md')
-rw-r--r--cdx-record-pipeline/README.md33
1 files changed, 33 insertions, 0 deletions
diff --git a/cdx-record-pipeline/README.md b/cdx-record-pipeline/README.md
new file mode 100644
index 0000000..797b8eb
--- /dev/null
+++ b/cdx-record-pipeline/README.md
@@ -0,0 +1,33 @@
+CDX Record Pipeline (GrobId Edition)
+=====================================
+
+Hadoop based pipeline to process PDFs from a specified IA CDX dataset
+
+## Local mode example ##
+
+```
+cat -n /home/bnewbold/100k_random_gwb_pdf.cdx | ./cdx-record-pipeline.py
+
+```
+
+## Cluster mode example ##
+
+```
+input=100k_random_gwb_pdf.cdx
+output=100k_random_gwb_pdf.out
+lines_per_map=1000
+
+hadoop jar /home/webcrawl/hadoop-2/hadoop-mapreduce/hadoop-streaming.jar
+ -archives "hdfs://ia802400.us.archive.org:6000/lib/cdx-record-pipeline-venv.zip#cdx-record-pipeline-venv"
+ -D mapred.reduce.tasks=0
+ -D mapred.job.name=Cdx-Record-Pipeline
+ -D mapreduce.job.queuename=extraction
+ -D mapred.line.input.format.linespermap=${lines_per_map}
+ -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
+ -input ${input}
+ -output ${output}
+ -mapper cdx-record-pipeline.py
+ -file cdx-record-pipeline.py
+
+```
+