cdx-record-pipeline/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

CDX Record Pipeline (GrobId Edition)
=====================================

Hadoop based pipeline to process PDFs from a specified IA CDX dataset

## Local mode example ##

```
cat -n /home/bnewbold/100k_random_gwb_pdf.cdx | ./cdx-record-pipeline.py
 
```

## Cluster mode example ##

```
input=100k_random_gwb_pdf.cdx
output=100k_random_gwb_pdf.out
lines_per_map=1000

hadoop jar /home/webcrawl/hadoop-2/hadoop-mapreduce/hadoop-streaming.jar
	-archives "hdfs://ia802400.us.archive.org:6000/lib/cdx-record-pipeline-venv.zip#cdx-record-pipeline-venv"
	-D mapred.reduce.tasks=0
	-D mapred.job.name=Cdx-Record-Pipeline
	-D mapreduce.job.queuename=extraction
	-D mapred.line.input.format.linespermap=${lines_per_map} 
	-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat 
	-input ${input}
	-output ${output}
	-mapper cdx-record-pipeline.py
	-file cdx-record-pipeline.py

```