aboutsummaryrefslogtreecommitdiffstats
path: root/notes/job_log.txt
blob: 68bef9be5d0ec61bd345eef286c481906369c4d4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

### QA matchcrossref

[D8C7F2CA7620450991838D540489948D/8B17786779BE44579C98D8A325AC5959] sandcrawler.ScoreJob/(1/1) ...-24-2102.32-matchcrossref

Submitted:  Fri Aug 24 21:03:09 UTC 2018
Started:    Fri Aug 24 21:03:20 UTC 2018
Finished:   Sat Aug 25 09:46:55 UTC 2018
Elapsed:    12hrs, 43mins, 34sec
Diagnostics:    
Average Map Time    24mins, 31sec
Average Shuffle Time    15sec
Average Merge Time  21sec
Average Reduce Time 7mins, 17sec

Map 2312    2312
Reduce  100 100

crossref-rows-filtered  73901964    0   73901964
grobid-rows-filtered    1092992 0   1092992
joined-rows 0   623837  623837

cascading.flow.StepCounters 
Tuples_Read 94831255    0   94831255
Tuples_Written  0   623837  623837

Read_Duration   7108430 352241  7460671
Tuples_Read 94831255    74994956    169826211
Tuples_Written  74994956    623837  75618793
Write_Duration  7650302 21468   7671770

## QA UnGrobided

Submitted:  Sat Aug 25 01:23:22 UTC 2018
Started:    Sat Aug 25 05:06:36 UTC 2018
Finished:   Sat Aug 25 05:13:45 UTC 2018
Elapsed:    7mins, 8sec
Diagnostics:    
Average Map Time    1mins, 20sec
Average Shuffle Time    12sec
Average Merge Time  15sec
Average Reduce Time 29sec

Map 48  48
Reduce  1   1

bnewbold@bnewbold-dev$ gohdfs du -sh sandcrawler/output-qa/2018-08-25-0122.54-dumpungrobided/part*
56.8M   /user/bnewbold/sandcrawler/output-qa/2018-08-25-0122.54-dumpungrobided/part-00000

## Prod UnGrobided

[D76F6BF91D894E879E747C868B0DEDE7/394A1AFC44694992B71E6920AF8BA3FB] sandcrawler.DumpUnGrobidedJob/(1/1) ...26-0910.25-dumpungrobided

Map    278 278
Reduce  1   1

Submitted:  Sun Aug 26 09:10:51 UTC 2018
Started:    Sun Aug 26 09:18:21 UTC 2018
Finished:   Sun Aug 26 10:29:28 UTC 2018
Elapsed:    1hrs, 11mins, 7sec
Diagnostics:    
Average Map Time    4mins, 48sec
Average Shuffle Time    24mins, 17sec
Average Merge Time  14sec
Average Reduce Time 13mins, 54sec


cading.flow.StepCounters    
Name
Map
Reduce
Total
Tuples_Read 64510564    0   64510564
Tuples_Written  0   21618164    21618164

## Prod Crossref Match

[6C063C0809244446BA8602C3BE99CEC2/5FE5D87899154F38991A1ED58BEB34D4] sandcrawler.ScoreJob/(1/1) ...-25-1753.01-matchcrossref

Map 2427    2427
Reduce  50  50

Submitted:  Sat Aug 25 17:53:50 UTC 2018
Started:    Sat Aug 25 17:53:59 UTC 2018
Finished:   Sun Aug 26 11:22:52 UTC 2018
Elapsed:    17hrs, 28mins, 52sec
Diagnostics:    
Average Map Time    31mins, 20sec
Average Shuffle Time    1mins, 21sec
Average Merge Time  41sec
Average Reduce Time 3hrs, 14mins, 39sec

crossref-rows-filtered  73901964    0   73901964
grobid-rows-filtered    14222226    0   14222226
joined-rows 0   14115453    14115453

## "Prod" Fatcat Group Works (run 2019-08-10)

    ./please --prod groupworks-fatcat hdfs:///user/bnewbold/release_export.2019-07-07.json

    job_1559844455575_118299
    http://ia802401.us.archive.org:6988/proxy/application_1559844455575_118299

## Re-GROBID batch (2019-11-12)

Want to re-process "old" GROBID output with newer (0.5.5+fatcat) GROBID version
(vanilla training) plus biblio-glutton identification. Hoping to make a couple
million new fatcat matches; will probably do a later round of ML matching over
this batch as well.

    # in /grande/regrobid

    # as postgres
    psql sandcrawler < dump_regrobid_pdf.sql > dump_regrobid_pdf.txt

    # as bnewbold
    cat dump_regrobid_pdf.txt | sort -S 4G | uniq -w 40 | cut -f2 | pv -l > dump_regrobid_pdf.2019-11-12.json
    # 41.5M lines, uniq by SHA1
    # NOTE: not the full 56m+ from GROBID table... some in archive.org, others
    # not application/pdf type. Will need to follow-up on those later

    # intend to have 3 worker machines, but splitting 6 ways in case we need to
    # re-balance load or get extra machines or something
    split -n l/6 -a1 -d --additional-suffix=.json dump_regrobid_pdf.2019-11-12.json regrobid_cdx.split_

    # distribute to tmp001, tmp002, tmp003:
    tmp001: 0,1
    tmp002: 2,3
    tmp003: 4,5

    # test local grobid config:
    head /srv/sandcrawler/tasks/regrobid_cdx.split_0.json | pv -l | ./grobid_tool.py --grobid-host http://localhost:8070 -j0 extract-json - > example_out.json
    # expect at least a couple fatcat matches
    cat example_out.json | jq .tei_xml -r | rg fatcat

    # test GROBID+kafka config:
    cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | head | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
    
    # full run, in a screen session
    cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -

NOTE: really should get parallel kafka worker going soon. if there is a reboot
or something in the middle of this process, will need to re-run from the start.

Was getting a bunch of weird kafka INVALID_MSG errors on produce. Would be nice to be able to retry, so doing:

    cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel --joblog regrobid_job.log --retries 5 -j40 --linebuffer --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -

Never mind, going to split into chunks which can be retried.

    cd /srv/sandcrawler/tasks
    sudo chown sandcrawler:staff .
    cat regrobid_cdx.split_* | split -l 20000 -a4 -d --additional-suffix=.json - chunk_
    ls /srv/sandcrawler/tasks/chunk_*.json | parallel -j4 ./extract_chunk.sh {}

extract_chunk.sh:


    #!/bin/bash

    set -x -e -u -o pipefail

    if [ -f $1.SUCCESS ]; then
        echo "Skipping: $1..."
        exit
    fi

    echo "Extracting $1..."

    date
    cat $1 | parallel -j10 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -

    touch $1.SUCCESS

seems to be working better! tested and if there is a problem with one chunk the others continue