1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
|
### QA matchcrossref
[D8C7F2CA7620450991838D540489948D/8B17786779BE44579C98D8A325AC5959] sandcrawler.ScoreJob/(1/1) ...-24-2102.32-matchcrossref
Submitted: Fri Aug 24 21:03:09 UTC 2018
Started: Fri Aug 24 21:03:20 UTC 2018
Finished: Sat Aug 25 09:46:55 UTC 2018
Elapsed: 12hrs, 43mins, 34sec
Diagnostics:
Average Map Time 24mins, 31sec
Average Shuffle Time 15sec
Average Merge Time 21sec
Average Reduce Time 7mins, 17sec
Map 2312 2312
Reduce 100 100
crossref-rows-filtered 73901964 0 73901964
grobid-rows-filtered 1092992 0 1092992
joined-rows 0 623837 623837
cascading.flow.StepCounters
Tuples_Read 94831255 0 94831255
Tuples_Written 0 623837 623837
Read_Duration 7108430 352241 7460671
Tuples_Read 94831255 74994956 169826211
Tuples_Written 74994956 623837 75618793
Write_Duration 7650302 21468 7671770
## QA UnGrobided
Submitted: Sat Aug 25 01:23:22 UTC 2018
Started: Sat Aug 25 05:06:36 UTC 2018
Finished: Sat Aug 25 05:13:45 UTC 2018
Elapsed: 7mins, 8sec
Diagnostics:
Average Map Time 1mins, 20sec
Average Shuffle Time 12sec
Average Merge Time 15sec
Average Reduce Time 29sec
Map 48 48
Reduce 1 1
bnewbold@bnewbold-dev$ gohdfs du -sh sandcrawler/output-qa/2018-08-25-0122.54-dumpungrobided/part*
56.8M /user/bnewbold/sandcrawler/output-qa/2018-08-25-0122.54-dumpungrobided/part-00000
## Prod UnGrobided
[D76F6BF91D894E879E747C868B0DEDE7/394A1AFC44694992B71E6920AF8BA3FB] sandcrawler.DumpUnGrobidedJob/(1/1) ...26-0910.25-dumpungrobided
Map 278 278
Reduce 1 1
Submitted: Sun Aug 26 09:10:51 UTC 2018
Started: Sun Aug 26 09:18:21 UTC 2018
Finished: Sun Aug 26 10:29:28 UTC 2018
Elapsed: 1hrs, 11mins, 7sec
Diagnostics:
Average Map Time 4mins, 48sec
Average Shuffle Time 24mins, 17sec
Average Merge Time 14sec
Average Reduce Time 13mins, 54sec
cading.flow.StepCounters
Name
Map
Reduce
Total
Tuples_Read 64510564 0 64510564
Tuples_Written 0 21618164 21618164
## Prod Crossref Match
[6C063C0809244446BA8602C3BE99CEC2/5FE5D87899154F38991A1ED58BEB34D4] sandcrawler.ScoreJob/(1/1) ...-25-1753.01-matchcrossref
Map 2427 2427
Reduce 50 50
Submitted: Sat Aug 25 17:53:50 UTC 2018
Started: Sat Aug 25 17:53:59 UTC 2018
Finished: Sun Aug 26 11:22:52 UTC 2018
Elapsed: 17hrs, 28mins, 52sec
Diagnostics:
Average Map Time 31mins, 20sec
Average Shuffle Time 1mins, 21sec
Average Merge Time 41sec
Average Reduce Time 3hrs, 14mins, 39sec
crossref-rows-filtered 73901964 0 73901964
grobid-rows-filtered 14222226 0 14222226
joined-rows 0 14115453 14115453
## "Prod" Fatcat Group Works (run 2019-08-10)
./please --prod groupworks-fatcat hdfs:///user/bnewbold/release_export.2019-07-07.json
job_1559844455575_118299
http://ia802401.us.archive.org:6988/proxy/application_1559844455575_118299
## Re-GROBID batch (2019-11-12)
Want to re-process "old" GROBID output with newer (0.5.5+fatcat) GROBID version
(vanilla training) plus biblio-glutton identification. Hoping to make a couple
million new fatcat matches; will probably do a later round of ML matching over
this batch as well.
# in /grande/regrobid
# as postgres
psql sandcrawler < dump_regrobid_pdf.sql > dump_regrobid_pdf.txt
# as bnewbold
cat dump_regrobid_pdf.txt | sort -S 4G | uniq -w 40 | cut -f2 | pv -l > dump_regrobid_pdf.2019-11-12.json
# 41.5M lines, uniq by SHA1
# NOTE: not the full 56m+ from GROBID table... some in archive.org, others
# not application/pdf type. Will need to follow-up on those later
# intend to have 3 worker machines, but splitting 6 ways in case we need to
# re-balance load or get extra machines or something
split -n l/6 -a1 -d --additional-suffix=.json dump_regrobid_pdf.2019-11-12.json regrobid_cdx.split_
# distribute to tmp001, tmp002, tmp003:
tmp001: 0,1
tmp002: 2,3
tmp003: 4,5
# test local grobid config:
head /srv/sandcrawler/tasks/regrobid_cdx.split_0.json | pv -l | ./grobid_tool.py --grobid-host http://localhost:8070 -j0 extract-json - > example_out.json
# expect at least a couple fatcat matches
cat example_out.json | jq .tei_xml -r | rg fatcat
# test GROBID+kafka config:
cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | head | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
# full run, in a screen session
cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
NOTE: really should get parallel kafka worker going soon. if there is a reboot
or something in the middle of this process, will need to re-run from the start.
Was getting a bunch of weird kafka INVALID_MSG errors on produce. Would be nice to be able to retry, so doing:
cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel --joblog regrobid_job.log --retries 5 -j40 --linebuffer --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
Never mind, going to split into chunks which can be retried.
cd /srv/sandcrawler/tasks
sudo chown sandcrawler:staff .
cat regrobid_cdx.split_* | split -l 20000 -a4 -d --additional-suffix=.json - chunk_
ls /srv/sandcrawler/tasks/chunk_*.json | parallel -j4 ./extract_chunk.sh {}
extract_chunk.sh:
#!/bin/bash
set -x -e -u -o pipefail
if [ -f $1.SUCCESS ]; then
echo "Skipping: $1..."
exit
fi
echo "Extracting $1..."
date
cat $1 | parallel -j10 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
touch $1.SUCCESS
seems to be working better! tested and if there is a problem with one chunk the others continue
## Pig Joins (around 2019-12-24)
Partial (as a start):
pig -param INPUT_CDX="/user/bnewbold/pdfs/gwb-pdf-20191005172329" -param INPUT_DIGEST="/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted" -param OUTPUT="/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx" join-cdx-sha1.pig
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0-cdh5.11.2 0.12.0-cdh5.0.1 bnewbold 2019-12-27 00:39:38 2019-12-27 15:32:44 HASH_JOIN,ORDER_BY,DISTINCT,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1574819148370_46540 4880 0 143 10 27 21 n/a n/a n/a n/a cdx MAP_ONLY
job_1574819148370_46541 19 0 59 9 25 18 n/a n/a n/a n/a digests MAP_ONLY
job_1574819148370_46773 24 1 17 7 10 9 6 6 6 6 digests SAMPLER
job_1574819148370_46774 7306 1 55 4 7 7 25 25 25 25 cdx SAMPLER
job_1574819148370_46778 7306 40 127 8 18 15 4970 1936 2768 2377 cdx ORDER_BY
job_1574819148370_46779 24 20 80 24 60 66 90 26 38 37 digests ORDER_BY
job_1574819148370_46822 22 3 101 27 53 48 1501 166 735 539 DISTINCT
job_1574819148370_46828 7146 959 122 7 16 14 91 21 35 32 full_join,result HASH_JOIN /user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx,
Input(s):
Successfully read 1968654006 records (654323590996 bytes) from: "/user/bnewbold/pdfs/gwb-pdf-20191005172329"
Successfully read 74254196 records (2451575849 bytes) from: "/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted"
Output(s):
Successfully stored 0 records in: "/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx"
Oops! Didn't upper-case the sha1b32 output.
Full GWB:
pig -param INPUT_CDX="/user/bnewbold/pdfs/gwb-pdf-20191005172329" -param INPUT_DIGEST="/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted" -param OUTPUT="/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx" join-cdx-sha1.pig
|