aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-11_doaj.md
blob: 473dd0db143026944fcc40a053efa42bee304cd4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
This is the first ingest (and crawl) of URLs from DOAJ article-level metadata.
It will include at least 'pdf' and 'html' ingest requests, not just 'pdf' as in
the past.

Working off a 2020-11-13 snapshot.

## Transform and Load

    # in sandcrawler pipenv on aitio
    zcat /schnell/DOAJ-CRAWL-2020-11/doaj_article_data_2020-11-13_all.json.gz | ./scripts/doaj2ingestrequest.py - | pv -l > /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json
    => 6.7M 0:24:28 [4.57k/s]

    cat /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => ran in to error with blank `base_url`

Second try after patches:

    zcat /schnell/DOAJ-CRAWL-2020-11/doaj_article_data_2020-11-13_all.json.gz | ./scripts/doaj2ingestrequest.py - | pv -l > /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json
    => 6.7M 0:24:29 [4.56k/s]

    cat /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => Worker: Counter({'total': 6703036, 'insert-requests': 163854, 'update-requests': 0})
    => JSON lines pushed: Counter({'total': 6703036, 'pushed': 6703036})

## Check Pre-Crawl Status

    SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.link_source = 'doaj'
    GROUP BY ingest_request.ingest_type, status
    -- next time include ingest_type in sort
    ORDER BY COUNT DESC
    LIMIT 30;


     ingest_type |         status          |  count  
    -------------+-------------------------+---------
     pdf         |                         | 3711532
     html        |                         | 2429003
     pdf         | success                 |  454403
     pdf         | redirect-loop           |   48587
     pdf         | no-pdf-link             |   24901
     pdf         | no-capture              |   11569
     xml         |                         |    9442
     pdf         | link-loop               |    8466
     pdf         | terminal-bad-status     |    2015
     pdf         | wrong-mimetype          |    1441
     pdf         | null-body               |    1057
     pdf         | petabox-error           |     299
     pdf         | cdx-error               |     124
     pdf         | gateway-timeout         |     114
     pdf         | wayback-error           |      77
     pdf         | spn2-cdx-lookup-failure |      20
     pdf         | invalid-host-resolution |       4
     pdf         | spn2-error              |       1
    (18 rows)

## Dump new URLs, Transform, Bulk Ingest (PDF and XML only)

Dump:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.base_url = ingest_request.base_url
            AND ingest_file_result.ingest_type = ingest_request.ingest_type
        WHERE
            (ingest_request.ingest_type = 'pdf'
                OR ingest_request.ingest_type = 'xml')
            AND ingest_request.link_source = 'doaj'
            -- AND date(ingest_request.created) > '2020-12-01'
            AND (ingest_file_result.status IS NULL
                OR ingest_file_result.status = 'no-capture')
    ) TO '/grande/snapshots/doaj_noingest_2020-11-19.rows.json';
    => COPY 3732543

Transform:

    ./scripts/ingestrequest_row2json.py /grande/snapshots/doaj_noingest_2020-11-19.rows.json | pv -l | shuf > /grande/snapshots/doaj_noingest_2020-11-19.ingest_request.json
    => 3.73M 0:02:18 [26.9k/s]

Definitely some non-URL strings in there; should try to filter those out
earlier in the transform process. And/or have a constraint on the URL column in
the database.

Enqueue the whole batch:

    cat /grande/snapshots/doaj_noingest_2020-11-19.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Started this batch off at 2020-11-19 18:10 (Pacific time)

Stats after run:

    SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.link_source = 'doaj'
    GROUP BY ingest_request.ingest_type, status
    ORDER BY ingest_request.ingest_type, COUNT DESC
    LIMIT 30;

## Dump Seedlist

After preliminary bulk ingest attempts, dump rows:

    COPY (
        SELECT row_to_json(t1.*)
        FROM (
            SELECT ingest_request.*, ingest_file_result as result
            FROM ingest_request
            LEFT JOIN ingest_file_result
                ON ingest_file_result.base_url = ingest_request.base_url
                AND ingest_file_result.ingest_type = ingest_request.ingest_type
            WHERE
                ingest_request.link_source = 'doaj'
                AND (ingest_request.ingest_type = 'pdf'
                    OR ingest_request.ingest_type = 'xml')
                AND ingest_file_result.status != 'success'
                AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
                AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
                AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
                AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
                AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
                AND ingest_request.base_url NOT LIKE '%://archive.org/%'
                AND ingest_request.base_url NOT LIKE '%://web.archive.org/%'
                AND ingest_request.base_url NOT LIKE '%://www.archive.org/%'
                AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%://archive.org/%'
                AND ingest_file_result.terminal_url NOT LIKE '%://web.archive.org/%'
                AND ingest_file_result.terminal_url NOT LIKE '%://www.archive.org/%'
        ) t1
    ) TO '/grande/snapshots/doaj_seedlist_2020-11-19.rows.json';
    => 1,899,555

TODO: filter for valid URLs

Prep ingest requests (for post-crawl use):

    ./scripts/ingestrequest_row2json.py /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | pv -l > /grande/snapshots/doaj_crawl_ingest_2020-11-19.json

And actually dump seedlist(s):

    cat /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | jq -r .base_url | sort -u -S 4G > /grande/snapshots/doaj_seedlist_2020-11-19.url.txt
    cat /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | rg '"no-capture"' | jq -r .result.terminal_url | rg -v ^null$ | sort -u -S 4G > /grande/snapshots/doaj_seedlist_2020-11-19.terminal_url.txt
    cat /grande/snapshots/doaj_seedlist_2020-11-19.rows.json | rg -v '"no-capture"' | jq -r .base_url | sort -u -S 4G > /grande/snapshots/doaj_seedlist_2020-11-19.no_terminal_url.txt

    wc -l doaj_seedlist_2020-11-19.*.txt

## Post-Crawl Ingest

Re-run all ingests, from original batch (pdf, xml, and html), now that DOAJ
identifiers are all in fatcat:

    cat /schnell/DOAJ-CRAWL-2020-11/doaj_20201113.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

    # started 2020-12-23 15:05 (Pacific)
    # finished around 2020-12-31, after one long/slow partition

Stats again after everything:

    SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.link_source = 'doaj'
    GROUP BY ingest_request.ingest_type, status
    ORDER BY ingest_request.ingest_type, COUNT DESC
    LIMIT 50;

     ingest_type |          status          |  count  
    -------------+--------------------------+---------
     html        | wrong-scope              | 1089423
     html        | no-capture               |  423917
     html        | redirect-loop            |  212910
     html        | unknown-scope            |  204069
     html        | html-resource-no-capture |  165587
     html        | success                  |  122937
     html        | null-body                |  100296
     html        | wayback-content-error    |   53918
     html        | wrong-mimetype           |   18908
     html        | terminal-bad-status      |   14059
     html        | petabox-error            |   13520
     html        | cdx-error                |    6823
     html        | wayback-error            |     890
     html        |                          |     620
     html        | blocked-cookie           |     543
     html        | blocked-captcha          |     250
     html        | redirects-exceeded       |     135
     html        | too-many-resources       |     111
     html        | max-hops-exceeded        |      84
     html        | bad-redirect             |       3
     pdf         | success                  | 2851324
     pdf         | no-pdf-link              |  529914
     pdf         | redirect-loop            |  349494
     pdf         | no-capture               |  272202
     pdf         | null-body                |  129027
     pdf         | terminal-bad-status      |   91796
     pdf         | link-loop                |   25267
     pdf         | wrong-mimetype           |    6504
     pdf         | wayback-error            |    2968
     pdf         |                          |    2068
     pdf         | wayback-content-error    |    1548
     pdf         | cdx-error                |    1095
     pdf         | petabox-error            |    1024
     pdf         | bad-redirect             |     203
     pdf         | redirects-exceeded       |     135
     pdf         | timeout                  |      20
     pdf         | max-hops-exceeded        |      19
     pdf         | bad-gzip-encoding        |       2
     xml         | success                  |    6897
     xml         | null-body                |    2353
     xml         | wrong-mimetype           |     184
     xml         | no-capture               |       5
     xml         | cdx-error                |       3
    (43 rows)


And on filtered subset that we actually crawled:

    SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.link_source = 'doaj'
        AND (ingest_request.ingest_type = 'pdf'
            OR ingest_request.ingest_type = 'xml')
        AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
        AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
        AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
        AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
        AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
        AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
        AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
        AND ingest_request.base_url NOT LIKE '%://archive.org/%'
        AND ingest_request.base_url NOT LIKE '%://web.archive.org/%'
        AND ingest_request.base_url NOT LIKE '%://www.archive.org/%'
        AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
        AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
        AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
        AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
        AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
        AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
        AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
        AND ingest_file_result.terminal_url NOT LIKE '%://archive.org/%'
        AND ingest_file_result.terminal_url NOT LIKE '%://web.archive.org/%'
        AND ingest_file_result.terminal_url NOT LIKE '%://www.archive.org/%'
    GROUP BY ingest_request.ingest_type, status
    ORDER BY ingest_request.ingest_type, COUNT DESC
    LIMIT 50;

     ingest_type |        status         |  count  
    -------------+-----------------------+---------
     pdf         | success               | 2851286
     pdf         | no-pdf-link           |  527495
     pdf         | redirect-loop         |  345138
     pdf         | no-capture            |  268140
     pdf         | null-body             |  129027
     pdf         | terminal-bad-status   |   91125
     pdf         | link-loop             |   25267
     pdf         | wrong-mimetype        |    6504
     pdf         | wayback-error         |    2907
     pdf         | petabox-error         |     363
     pdf         | wayback-content-error |     242
     pdf         | bad-redirect          |     203
     pdf         | redirects-exceeded    |     135
     pdf         | max-hops-exceeded     |      19
     pdf         | cdx-error             |      15
     pdf         | bad-gzip-encoding     |       2
     xml         | success               |    6897
     xml         | null-body             |    2353
     xml         | wrong-mimetype        |     184
     xml         | no-capture            |       5
    (20 rows)