blob: 797b8eb9e2ab5dae7831db9d512cfbea3dda6f2a (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
CDX Record Pipeline (GrobId Edition)
=====================================
Hadoop based pipeline to process PDFs from a specified IA CDX dataset
## Local mode example ##
```
cat -n /home/bnewbold/100k_random_gwb_pdf.cdx | ./cdx-record-pipeline.py
```
## Cluster mode example ##
```
input=100k_random_gwb_pdf.cdx
output=100k_random_gwb_pdf.out
lines_per_map=1000
hadoop jar /home/webcrawl/hadoop-2/hadoop-mapreduce/hadoop-streaming.jar
-archives "hdfs://ia802400.us.archive.org:6000/lib/cdx-record-pipeline-venv.zip#cdx-record-pipeline-venv"
-D mapred.reduce.tasks=0
-D mapred.job.name=Cdx-Record-Pipeline
-D mapreduce.job.queuename=extraction
-D mapred.line.input.format.linespermap=${lines_per_map}
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
-input ${input}
-output ${output}
-mapper cdx-record-pipeline.py
-file cdx-record-pipeline.py
```
|