1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
|
### QA matchcrossref
[D8C7F2CA7620450991838D540489948D/8B17786779BE44579C98D8A325AC5959] sandcrawler.ScoreJob/(1/1) ...-24-2102.32-matchcrossref
Submitted: Fri Aug 24 21:03:09 UTC 2018
Started: Fri Aug 24 21:03:20 UTC 2018
Finished: Sat Aug 25 09:46:55 UTC 2018
Elapsed: 12hrs, 43mins, 34sec
Diagnostics:
Average Map Time 24mins, 31sec
Average Shuffle Time 15sec
Average Merge Time 21sec
Average Reduce Time 7mins, 17sec
Map 2312 2312
Reduce 100 100
crossref-rows-filtered 73901964 0 73901964
grobid-rows-filtered 1092992 0 1092992
joined-rows 0 623837 623837
cascading.flow.StepCounters
Tuples_Read 94831255 0 94831255
Tuples_Written 0 623837 623837
Read_Duration 7108430 352241 7460671
Tuples_Read 94831255 74994956 169826211
Tuples_Written 74994956 623837 75618793
Write_Duration 7650302 21468 7671770
## QA UnGrobided
Submitted: Sat Aug 25 01:23:22 UTC 2018
Started: Sat Aug 25 05:06:36 UTC 2018
Finished: Sat Aug 25 05:13:45 UTC 2018
Elapsed: 7mins, 8sec
Diagnostics:
Average Map Time 1mins, 20sec
Average Shuffle Time 12sec
Average Merge Time 15sec
Average Reduce Time 29sec
Map 48 48
Reduce 1 1
bnewbold@bnewbold-dev$ gohdfs du -sh sandcrawler/output-qa/2018-08-25-0122.54-dumpungrobided/part*
56.8M /user/bnewbold/sandcrawler/output-qa/2018-08-25-0122.54-dumpungrobided/part-00000
## Prod UnGrobided
[D76F6BF91D894E879E747C868B0DEDE7/394A1AFC44694992B71E6920AF8BA3FB] sandcrawler.DumpUnGrobidedJob/(1/1) ...26-0910.25-dumpungrobided
Map 278 278
Reduce 1 1
Submitted: Sun Aug 26 09:10:51 UTC 2018
Started: Sun Aug 26 09:18:21 UTC 2018
Finished: Sun Aug 26 10:29:28 UTC 2018
Elapsed: 1hrs, 11mins, 7sec
Diagnostics:
Average Map Time 4mins, 48sec
Average Shuffle Time 24mins, 17sec
Average Merge Time 14sec
Average Reduce Time 13mins, 54sec
cading.flow.StepCounters
Name
Map
Reduce
Total
Tuples_Read 64510564 0 64510564
Tuples_Written 0 21618164 21618164
## Prod Crossref Match
[6C063C0809244446BA8602C3BE99CEC2/5FE5D87899154F38991A1ED58BEB34D4] sandcrawler.ScoreJob/(1/1) ...-25-1753.01-matchcrossref
Map 2427 2427
Reduce 50 50
Submitted: Sat Aug 25 17:53:50 UTC 2018
Started: Sat Aug 25 17:53:59 UTC 2018
Finished: Sun Aug 26 11:22:52 UTC 2018
Elapsed: 17hrs, 28mins, 52sec
Diagnostics:
Average Map Time 31mins, 20sec
Average Shuffle Time 1mins, 21sec
Average Merge Time 41sec
Average Reduce Time 3hrs, 14mins, 39sec
crossref-rows-filtered 73901964 0 73901964
grobid-rows-filtered 14222226 0 14222226
joined-rows 0 14115453 14115453
## "Prod" Fatcat Group Works (run 2019-08-10)
./please --prod groupworks-fatcat hdfs:///user/bnewbold/release_export.2019-07-07.json
job_1559844455575_118299
http://ia802401.us.archive.org:6988/proxy/application_1559844455575_118299
## Re-GROBID batch (2019-11-12)
Want to re-process "old" GROBID output with newer (0.5.5+fatcat) GROBID version
(vanilla training) plus biblio-glutton identification. Hoping to make a couple
million new fatcat matches; will probably do a later round of ML matching over
this batch as well.
# in /grande/regrobid
# as postgres
psql sandcrawler < dump_regrobid_pdf.sql > dump_regrobid_pdf.txt
# as bnewbold
cat dump_regrobid_pdf.txt | sort -S 4G | uniq -w 40 | cut -f2 | pv -l > dump_regrobid_pdf.2019-11-12.json
# 41.5M lines, uniq by SHA1
# NOTE: not the full 56m+ from GROBID table... some in archive.org, others
# not application/pdf type. Will need to follow-up on those later
# intend to have 3 worker machines, but splitting 6 ways in case we need to
# re-balance load or get extra machines or something
split -n l/6 -a1 -d --additional-suffix=.json dump_regrobid_pdf.2019-11-12.json regrobid_cdx.split_
# distribute to tmp001, tmp002, tmp003:
tmp001: 0,1
tmp002: 2,3
tmp003: 4,5
# test local grobid config:
head /srv/sandcrawler/tasks/regrobid_cdx.split_0.json | pv -l | ./grobid_tool.py --grobid-host http://localhost:8070 -j0 extract-json - > example_out.json
# expect at least a couple fatcat matches
cat example_out.json | jq .tei_xml -r | rg fatcat
# test GROBID+kafka config:
cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | head | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
# full run, in a screen session
cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
NOTE: really should get parallel kafka worker going soon. if there is a reboot
or something in the middle of this process, will need to re-run from the start.
|