1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
|
Want to dump a URL list to share with partners, filtered to content we think is
likely to be scholarly.
Columns to include:
- original URL
- capture timestamp
- SHA1
## Stats Overview
file_meta table, mimetype=application/pdf: 173,816,433
cdx table, mimetype=application/pdf: 131,346,703
ingest_file_result table, pdf, success: 66,487,928
## Ingested PDF URLs
"Ingested" URLs: ingest_file_result table, pdf and hit=true; include base URL also?
COPY (
SELECT
base_url as start_url,
terminal_url as pdf_url,
terminal_dt as pdf_url_timestamp,
terminal_sha1hex as pdf_sha1hex
FROM ingest_file_result
WHERE
ingest_type = 'pdf'
AND status = 'success'
)
TO '/srv/sandcrawler/tasks/wayback_pdf_targeted.2021-09-09.tsv'
WITH NULL '';
=> 77,892,849
## CDX PDFs
"All web PDFs": CDX query; left join file_meta, but don't require
COPY (
SELECT
cdx.url as pdf_url,
cdx.datetime as pdf_url_timestamp,
cdx.sha1hex as pdf_sha1hex
FROM cdx
LEFT JOIN file_meta
ON
cdx.sha1hex = file_meta.sha1hex
WHERE
file_meta.mimetype = 'application/pdf'
OR (
file_meta.mimetype IS NULL
AND cdx.mimetype = 'application/pdf'
)
)
TO '/srv/sandcrawler/tasks/wayback_pdf_speculative.2021-09-09.tsv'
WITH NULL '';
=> 147,837,935
## Processed web PDFs
"Parsed web PDFs": `file_meta`, left join CDX
(didn't do this one)
|