DESIGN.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327


Going to look something like:

    zcat DOI-LANDING-CRAWL-2018-06-full_crawl_logs/DOI-LANDING-CRAWL-2018-06.$SHARD.us.archive.org.crawl.log.gz | tr -cd '[[:print:]]\n\r\t' | rg '//doi.org/' | /fast/scratch/unpaywall/make_doi_list.py > doi_list.$SHARD.txt

    zcat /fast/unpaywall-munging/DOI-LANDING-CRAWL-2018-06/DOI-LANDING-CRAWL-2018-06-full_crawl_logs/DOI-LANDING-CRAWL-2018-06.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD.db

    cat /fast/unpaywall-munging/DOI-LANDING-CRAWL-2018-06/doi_list.$SHARD.txt | pv | /fast/scratch/unpaywall/make_output.py redirectmap.$SHARD.db > doi_index.$SHARD.tsv

Let's start with:

    mkdir UNPAYWALL-PDF-CRAWL-2018-07
    ia download UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs

export SHARD=wbgrp-svc279 # running
export SHARD=wbgrp-svc280 # running
export SHARD=wbgrp-svc281 # running
export SHARD=wbgrp-svc282 # running
zcat UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs/UNPAYWALL-PDF-CRAWL-2018-07.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD.db
zcat UNPAYWALL-PDF-CRAWL-2018-07-full_crawl_logs/UNPAYWALL-PDF-CRAWL-2018-07-PATCH.$SHARD.us.archive.org.crawl.log.gz | pv | /fast/scratch/unpaywall/make_map.py redirectmap.$SHARD-PATCH.db

### Design

If possible, we'd like something that will work with as many crawls as
possible. Want to work with shards, then merge outputs.

Output: JSON and/or sqlite rows with:

- identifier (optional?)
- initial-uri (indexed)
- breadcrumbs
- final-uri
- final-http-status
- final-sha1
- final-mimetype-normalized
- final-was-dedupe (boolean)
- final-cdx (string, if would be extracted)

This will allow filtering on various fields, checking success stats, etc.

Components:

- {identifier, initial-uri} input (basically, seedlist)
- full crawl logs
- raw CDX, indexed by final-uri
- referer map

Process:

- use full crawl logs to generate a referer map; this is a dict with keys as
  URI, and value as {referer URI, status, breadcrumb, was-dedupe, mimetype};
  the referer may be null. database can be whatever.
- iterate through CDX, filtering by HTTP status and mimetype (including
  revists). for each potential, lookup in referer map. if mimetype is
  confirmed, then iterate through full referer chain, and print a final line
  which is all-but-identifier
- iterate through identifier/URI list, inserting identifier columns

Complications:

- non-PDF terminals: error codes, or HTML only (failed to find PDF)
- multiple terminals per seed; eg, multiple PDFs, or PDF+postscript+HTML or
  whatever

Process #2:

- use full crawl logs to generate a bi-directional referer map: sqlite3 table
  with uri, referer-uri both indexed. also {status, breadcrumb, was-dedupe,
  mimetype} rows
- iterate through CDX, selecting successful "terminal" lines (mime, status).
  use referer map to iterate back to an initial URI, and generate a row. lookup
  output table by initial-uri; if an entry already exists, behavior is
  flag-dependent: overwrite if "better", or add a second line
- in a second pass, update rows with identifier based on URI. if rows not
  found/updated, then do a "forwards" lookup to a terminal condition, and write
  that status. note that these rows won't have CDX.

More Complications:

- handling revisits correctly... raw CDX probably not actually helpful for PDF
  case, only landing/HTML case
- given above, should probably just (or as a mode) iterate over only crawl logs
  in "backwards" stage
- fan-out of "forward" redirect map, in the case of embeds and PDF link
  extraction
- could pull out first and final URI domains for easier SQL stats/reporting
- should include final datetime (for wayback lookups)

NOTE/TODO: journal-level dumps of fatcat metadata would be cool... could
roll-up release dumps as an alternative to hitting elasticsearch? or just hit
elasticsearch and both dump to sqlite and enrich elastic doc? should probably
have an indexed "last updated" timestamp in all elastic docs

### Crawl Log Notes

Fields:

    0   timestamp (ISO8601) of log line
    1   status code (HTTP or negative)
    2   size in bytes (content only)
    3   URI of this download
    4   discovery breadcrumbs
    5   "referer" URI
    6   mimetype (as reported?)
    7   worker thread
    8   full timestamp (start of network fetch; this is dt?)
    9   SHA1
    10  source tag
    11  annotations
    12  partial CDX JSON

### External Prep for, Eg, Unpaywall Crawl

    export LC_ALL=C
    sort -S 8G -u seedlist.shard > seedlist.shard.sorted

    zcat unpaywall_20180621.pdf_meta.tsv.gz | awk '{print $2 "\t" $1}' | sort -S 8G -u > unpaywall_20180621.seed_id.tsv

    join -t $'\t' unpaywall_20180621.seed_id.tsv unpaywall_crawl_patch_seedlist.split_3.schedule.sorted > seed_id.shard.tsv

TODO: why don't these sum/match correctly?

    bnewbold@orithena$ wc -l seed_id.shard.tsv unpaywall_crawl_patch_seedlist.split_3.schedule.sorted
    880737 seed_id.shard.tsv
    929459 unpaywall_crawl_patch_seedlist.split_3.schedule.sorted

    why is:
    http://00ec89c.netsolhost.com/brochures/200605_JAWMA_Hg_Paper_Lee_Hastings.pdf
    in unpaywall_crawl_patch_seedlist, but not unpaywall_20180621.pdf_meta?

    # Can't even filter on HTTP 200, because revisits are '-'
    #zcat UNPAYWALL-PDF-CRAWL-2018-07.cdx.gz | rg 'wbgrp-svc282' | rg ' 200 ' | rg '(pdf)|(revisit)' > UNPAYWALL-PDF-CRAWL-2018-07.svc282.filtered.cdx

    zcat UNPAYWALL-PDF-CRAWL-2018-07.cdx.gz | rg 'UNPAYWALL-PDF-CRAWL-2018-07-PATCH' | rg 'wbgrp-svc282' | rg '(pdf)|( warc/revisit )|(postscript)|( unk )' > UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx

TODO: spaces in URLs, like 'https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf'

### Revisit Notes

Neither CDX nor crawl logs seem to have revisits actually point to final
content, they just point to the revisit record in the (crawl-local) WARC.

### sqlite3 stats

    select count(*) from crawl_result;

    select count(*) from crawl_result where identifier is null;

    select breadcrumbs, count(*) from crawl_result group by breadcrumbs;

    select final_was_dedupe, count(*) from crawl_result group by final_was_dedupe;

    select final_http_status, count(*) from crawl_result group by final_http_status;

    select final_mimetype, count(*) from crawl_result group by final_mimetype;

    select * from crawl_result where final_mimetype = 'text/html' and final_http_status = '200' order by random() limit 5;

    select count(*) from crawl_result where final_uri like 'https://academic.oup.com/Govern%';

    select count(distinct identifier) from crawl_result where final_sha1 is not null;

### testing shard notes

880737  `seed_id` lines
21776   breadcrumbs are null (no crawl logs line); mostly normalized URLs?
24985   "first" URIs with no identifier; mostly normalized URLs?

backward: Counter({'skip-cdx-scope': 807248, 'inserted': 370309, 'skip-map-scope': 2913})
forward (dirty): Counter({'inserted': 509242, 'existing-id-updated': 347218, 'map-uri-missing': 15556, 'existing-complete': 8721, '_normalized-seed-uri': 5520})

874131 identifier is not null
881551 breadcrumbs is not null
376057 final_mimetype is application/pdf
370309 final_sha1 is not null
332931 application/pdf in UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx

summary:
    370309/874131 42% got a PDF
    264331/874131 30% some domain dead-end
        196747/874131 23% onlinelibrary.wiley.com
        33879/874131   4% www.nature.com
        11074/874131   1% www.tandfonline.com
    125883/874131 14% blocked, 404, other crawl failures
            select count(*) from crawl_result where final_http_status >= '400' or final_http_status < '200';
    121028/874131 14% HTTP 200, but not pdf
        105317/874131 12% academic.oup.com; all rate-limited or cookie fail
    15596/874131  1.7% didn't even try crawling (null final status)

TODO:
- add "success" flag (instead of "final_sha1 is null")
- 

    http://oriental-world.org.ua/sites/default/files/Archive/2017/3/4.pdf   10.15407/orientw2017.03.021 -       http://oriental-world.org.ua/sites/default/files/Archive/2017/3/4.pdf   403     ¤       application/pdf 0       ¤

Iterated:

./arabesque.py backward UNPAYWALL-PDF-CRAWL-2018-07-PATCH.svc282.filtered.cdx map.sqlite out.sqlite
Counter({'skip-cdx-scope': 813760, 'inserted': 370435, 'skip-map-scope': 4620, 'skip-tiny-octetstream-': 30})

./arabesque.py forward unpaywall_20180621.seed_id.shard.tsv map.sqlite out.sqlite
Counter({'inserted': 523594, 'existing-id-updated': 350009, '_normalized-seed-uri': 21371, 'existing-complete': 6638, 'map-uri-missing': 496})

894029 breadcrumbs is not null
874102 identifier is not null
20423 identifier is null
496 breadcrumbs is null
370435 final_sha1 is not null

### URL/seed non-match issues!

Easily fixable:
- capitalization of domains
- empty port number, like `http://genesis.mi.ras.ru:/~razborov/hadamard.ps`

Encodable:
- URL encoding
    http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11[1].pdf
    http://accounting.rutgers.edu/docs/seminars/Fall11/Clawbacks_9-27-11%5B1%5D.pdf
- whitespace in URL (should be url-encoded)
    https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A case.pdf
    https://www.termedia.pl/Journal/-7/pdf-27330-10?filename=A%EF%BF%BD%EF%BF%BDcase.pdf
- tricky hidden unicode
    http://goldhorde.ru/wp-content/uploads/2017/03/ЗО-1-2017-206-212.pdf
    http://goldhorde.ru/wp-content/uploads/2017/03/%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD-1-2017-206-212.pdf

Harder/Custom?
- paths including "/../" or "/./" are collapsed
- port number 80, like `http://fermet.misis.ru:80/jour/article/download/724/700`
- aos2.uniba.it:8080papers

- fragments stripped by crawler: 'https://www.termedia.pl/Journal/-85/pdf-27083-10?filename=BTA#415-06-str307-316.pdf'

### Debugging "redirect terminal" issue

Some are redirect loops; fine.

Some are from 'cookieSet=1' redirects, like 'http://journals.sagepub.com/doi/pdf/10.1177/105971230601400206?cookieSet=1'. This comes through like:

    sqlite> select * from crawl_result where initial_uri = 'http://adb.sagepub.com/cgi/reprint/14/2/147.pdf';
    initial_uri     identifier      breadcrumbs     final_uri       final_http_status       final_sha1      final_mimetype  final_was_dedupe        final_cdx
    http://adb.sagepub.com/cgi/reprint/14/2/147.pdf 10.1177/105971230601400206 R       http://journals.sagepub.com/doi/pdf/10.1177/105971230601400206  302     ¤       text/html       0       ¤

Using 'http' (note: this is not an OA article):

    http://adb.sagepub.com/cgi/reprint/14/2/147.pdf
    https://journals.sagepub.com/doi/pdf/10.1177/105971230601400206
    https://journals.sagepub.com/doi/pdf/10.1177/105971230601400206?cookieSet=1
    http://journals.sagepub.com/action/cookieAbsent

Is heritrix refusing to do that second redirect? In some cases it will do at
leat the first, like:
    
    http://pubs.rsna.org/doi/pdf/10.1148/radiographics.11.1.1996385
    http://pubs.rsna.org/doi/pdf/10.1148/radiographics.11.1.1996385?cookieSet=1
    http://pubs.rsna.org/action/cookieAbsent

I think the vast majority of redirect terminals are when we redirect to a page
that has already been crawled. This is a bummer because we can't find the
redirect target in the logs.

Eg, academic.oup.com sometimes redirects to cookieSet, then cookieAbsent; other
times it redirects to Governer. It's important to distinguish between these.

### Scratch

What are actual advantages/use-cases of CDX mode?
=> easier CDX-to-WARC output mode
=> sending CDX along with WARCs as an index

Interested in scale-up behavior: full unpaywall PDF crawls, and/or full DOI landing crawls
=> eatmydata
dentifier is not null


    zcat UNPAYWALL-PDF-CRAWL-2018-07-PATCH* | time /fast/scratch/unpaywall/arabesque.py referrer - UNPAYWALL-PDF-CRAWL-2018-07-PATCH.map.sqlite
    [snip]
    ... referrer 5542000
    Referrer map complete.
    317.87user 274.57system 21:20.22elapsed 46%CPU (0avgtext+0avgdata 22992maxresident)k
    24inputs+155168464outputs (0major+802114minor)pagefaults 0swaps

    bnewbold@ia601101$ ls -lathr
    -rw-r--r-- 1 bnewbold bnewbold 1.7G Dec 12 12:33 UNPAYWALL-PDF-CRAWL-2018-07-PATCH.map.sqlite

Scaling!

    16,736,800 UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.us.archive.org.crawl.log
    17,215,895 unpaywall_20180621.seed_id.tsv

Oops; need to shard the seed_id file.

Ugh, this one is a little derp because I didn't sort correctly. Let's say close enough though...

    4318674 unpaywall_crawl_seedlist.svc282.tsv
    3901403 UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.seed_id.tsv


/fast/scratch/unpaywall/arabesque.py everything CORE-UPSTREAM-CRAWL-2018-11.combined.log core_2018-03-01_metadata.seed_id.tsv CORE-UPSTREAM-CRAWL-2018-11.out.sqlite

    Counter({'inserted': 3226191, 'skip-log-scope': 2811395, 'skip-log-prereq': 108932, 'skip-tiny-octetstream-': 855, 'skip-map-scope': 2})
    Counter({'existing-id-updated': 3221984, 'inserted': 809994, 'existing-complete': 228909, '_normalized-seed-uri': 17287, 'map-uri-missing': 2511, '_redirect-recursion-limit': 221, 'skip-bad-seed-uri': 17})

time /fast/scratch/unpaywall/arabesque.py everything UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.us.archive.org.crawl.log UNPAYWALL-PDF-CRAWL-2018-07.wbgrp-svc282.seed_id.tsv UNPAYWALL-PDF-CRAWL-2018-07.out.sqlite

    Everything complete!
    Counter({'skip-log-scope': 13476816, 'inserted': 2536452, 'skip-log-prereq': 682460, 'skip-tiny-octetstream-': 41067})
    Counter({'existing-id-updated': 1652463, 'map-uri-missing': 1245789, 'inserted': 608802, 'existing-complete': 394349, '_normalized-seed-uri': 22573, '_redirect-recursion-limit': 157})

    real    63m42.124s
    user    53m31.007s
    sys     6m50.535s

### Performance

Before tweaks:

    real    2m55.975s
    user    2m6.772s
    sys     0m12.684s

After:

    real    1m51.500s
    user    1m44.600s
    sys     0m3.496s