notes/tasks/2021-09-09_pdf_url_lists.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66


Want to dump a URL list to share with partners, filtered to content we think is
likely to be scholarly.

Columns to include:

- original URL
- capture timestamp
- SHA1

## Stats Overview

file_meta table, mimetype=application/pdf: 173,816,433

cdx table, mimetype=application/pdf: 131,346,703

ingest_file_result table, pdf, success: 66,487,928

## Ingested PDF URLs

"Ingested" URLs: ingest_file_result table, pdf and hit=true; include base URL also?

    COPY (
        SELECT
            base_url as start_url,
            terminal_url as pdf_url,
            terminal_dt as pdf_url_timestamp,
            terminal_sha1hex as pdf_sha1hex
        FROM ingest_file_result
        WHERE
            ingest_type = 'pdf'
            AND status = 'success'
    )
    TO '/srv/sandcrawler/tasks/wayback_pdf_targeted.2021-09-09.tsv'
    WITH NULL '';
    => 77,892,849

## CDX PDFs

"All web PDFs": CDX query; left join file_meta, but don't require

    COPY (
        SELECT
            cdx.url as pdf_url,
            cdx.datetime as pdf_url_timestamp,
            cdx.sha1hex as pdf_sha1hex
        FROM cdx
        LEFT JOIN file_meta
        ON
            cdx.sha1hex = file_meta.sha1hex
        WHERE
            file_meta.mimetype = 'application/pdf'
            OR (
                file_meta.mimetype IS NULL
                AND cdx.mimetype = 'application/pdf'
            )
    )
    TO '/srv/sandcrawler/tasks/wayback_pdf_speculative.2021-09-09.tsv'
    WITH NULL '';
    => 147,837,935

## Processed web PDFs

"Parsed web PDFs": `file_meta`, left join CDX

(didn't do this one)