aboutsummaryrefslogtreecommitdiffstats
path: root/sql/README.md
blob: 38052fde8a8d2465cb442736d044339c963c94f3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173

TL;DR: replace hbase with postgresql tables with REST API (http://postgrest.org)

No primary storage of anything in this table. Everything should be rapidly
re-creatable from dumps, kafka topics (compressed), CDX, petabox metadata, etc.
This is a secondary view on all of that.

## Create Database and User

Create system user with your username like:

    sudo su postgres
    createuser -s bnewbold

Create database using `diesel` tool (see fatcat rust docs for install notes):

    # DANGER: will delete/recreate entire database
    diesel database reset

In the future would probably be better to create a real role/password and
supply these via `DATABASE_URL` env variable.

## Schema

    schema/database name is 'sandcrawler'

    cdx: include revisits or not?
        id: int64, PK
        sha1hex: string, not null, index
        cdx_sha1hex: string
        url: string, not null
        datetime: ISO 8601:1988 (string?), not null
        mimetype: string
        warc_path: string (item and filename)
        warc_offset: i64
        created: datetime, index (?)
       ?crawl: string
       ?domain: string

    file_meta
        sha1hex, string, PK
        md5hex: string
        sha256hex: string
        size_bytes: i64
        mime: string (verifying file status; optional for now?)

    fatcat_file
        sha1hex: string, PK
        file_ident: string, index?
        release_ident: ?

    petabox
        id: int64, PK
        sha1hex: string, notnull, index
        item: string, notnull
        path: string, notnull (TODO: URL encoded? separate sub-archive path?)

    grobid
        sha1hex: string, PK
        updated: datetime
        grobid_version (string)
        status_code: i32
        status: string (JSONB?), only if status != 200
        metadata: JSONB, title, first author, year (not for now?)
        glutton_fatcat_release: string, index

    shadow
        sha1hex: string, PK
        shadow_corpus: string, PK
        shadow_id: string
        doi: string
        pmid: string
        isbn13: string

Alternatively, to be more like existing system could have "one big table" or
multiple tables all with same key (sha1b32) and UNIQ. As is, every sha1 pk
column is 40 bytes of both index and data, or 8+ GByte (combined) for each
table with 100 million rows. using raw bytes could help, but makes all
code/queries much trickier.

Should we have "created" or "updated" timestamps on all these columns to enable
kafka tailing?

TODO:
- how to indicate CDX sha1 vs. true sha1 mis-match? pretty rare. recrawl and delete row from `gwb_cdx`?
- only most recent GROBID? or keep multiple versions? here and minio

## Existing Stuff Sizes (estimates)

     78.5G  /user/bnewbold/journal_crawl_cdx
     19.7G  /user/bnewbold/sandcrawler/output-prod/2018-12-14-1737.00-dumpfilemeta
      2.7G  file_hashes.tsv
    228.5G  /user/bnewbold/sandcrawler/output-prod/2018-09-23-0405.30-dumpgrobidmetainsertable

## Use Cases

Core goal here is to mostly kill hbase/hadoop. What jobs are actually used there?

- backfill: load in-scope (fulltext) crawl results from CDX
    => bulk (many line) inserts
- rowcount: "how many unique PDFs crawled?"
    => trivial SQL query
- status code count: "how much GROBID progress?"
    => trivial SQL query
- dumpungrobided: "what files still need to be processed"
    => SQL join with a "first" on CDX side
- dumpgrobidxml: "merge CDX/file info with extracted XML, for those that were successful"
    => SQL dump or rowscan, then minio fetches

This table is generally "single file raw fulltext metadata".

"Enrichment" jobs:

- GROBID
- glutton (if not GROBID)
- extra file metadata
- match newly enriched files to fatcat

What else?

- track additional raw file metadata
- dump all basic GROBID metadata (title, authors, year) to attempt merge/match

Questions we might want to answer

- total size of PDF corpus (terabytes)
- unqiue files hit per domain

## Prototype Plan

- backfill all CDX crawl files (TSV transform?)
- load full GROBID XML (both into minio and into SQL)
- load full fatcat file dump (TSV transform)
- load dumpfilemeta

## Example Useful Lookups


    http get :3030/cdx?url=eq.https://coleccionables.mercadolibre.com.ar/arduino-pdf_Installments_NoInterest_BestSellers_YES
    http get :3030/file_meta?sha1hex=eq.120582c855a7cc3c70a8527c560d7f27e6027278

## Parse URLs

One approach is to do regexes, something like:

    SELECT substring(column_name FROM '[^/]+://([^/]+)/') AS domain_name FROM table_name;

Eg:

    SELECT DISTINCT(domain), COUNT(domain)
        FROM (select substring(base_url FROM '[^/]+://([^/]*)') as domain FROM ingest_file_result) t1
        WHERE t1.domain != ''
        GROUP BY domain
        ORDER BY COUNT DESC 
        LIMIT 10;

Or:

    SELECT domain, status, COUNT((domain, status))
        FROM (SELECT status, substring(terminal_url FROM '[^/]+://([^/]*)') AS domain FROM ingest_file_result) t1
        WHERE t1.domain != ''
            AND t1.status != 'success'
        GROUP BY domain, status
        ORDER BY COUNT DESC
        LIMIT 10;

Can also do some quick lookups for a specific domain and protocol like:

    SELECT *
        FROM ingest_file_result
        WHERE terminal_url LIKE 'https://insights.ovid.com/%'
        LIMIT 10;