1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
|
TL;DR: replace hbase with postgresql tables with REST API (http://postgrest.org)
No primary storage of anything in this table. Everything should be rapidly
re-creatable from dumps, kafka topics (compressed), CDX, petabox metadata, etc.
This is a secondary view on all of that.
## Schema
schema/database name is 'sandcrawler'
cdx: include revisits or not?
id: int64, PK
sha1hex: string, not null, index
cdx_sha1hex: string
url: string, not null
datetime: ISO 8601:1988 (string?), not null
mimetype: string
warc_path: string (item and filename)
warc_offset: i64
created: datetime, index (?)
?crawl: string
?domain: string
file_meta
sha1hex, string, PK
md5hex: string
sha256hex: string
size_bytes: i64
mime: string (verifying file status; optional for now?)
fatcat_file
sha1hex: string, PK
file_ident: string, index?
release_ident: ?
petabox
id: int64, PK
sha1hex: string, notnull, index
item: string, notnull
path: string, notnull (TODO: URL encoded? separate sub-archive path?)
grobid
sha1hex: string, PK
updated: datetime
grobid_version (string)
status_code: i32
status: string (JSONB?), only if status != 200
metadata: JSONB, title, first author, year (not for now?)
glutton_fatcat_release: string, index
shadow
sha1hex: string, PK
shadow_corpus: string, PK
shadow_id: string
doi: string
pmid: string
isbn13: string
Alternatively, to be more like existing system could have "one big table" or
multiple tables all with same key (sha1b32) and UNIQ. As is, every sha1 pk
column is 40 bytes of both index and data, or 8+ GByte (combined) for each
table with 100 million rows. using raw bytes could help, but makes all
code/queries much trickier.
Should we have "created" or "updated" timestamps on all these columns to enable
kafka tailing?
TODO:
- how to indicate CDX sha1 vs. true sha1 mis-match? pretty rare. recrawl and delete row from `gwb_cdx`?
- only most recent GROBID? or keep multiple versions? here and minio
## Existing Stuff Sizes (estimates)
78.5G /user/bnewbold/journal_crawl_cdx
19.7G /user/bnewbold/sandcrawler/output-prod/2018-12-14-1737.00-dumpfilemeta
2.7G file_hashes.tsv
228.5G /user/bnewbold/sandcrawler/output-prod/2018-09-23-0405.30-dumpgrobidmetainsertable
## Use Cases
Core goal here is to mostly kill hbase/hadoop. What jobs are actually used there?
- backfill: load in-scope (fulltext) crawl results from CDX
=> bulk (many line) inserts
- rowcount: "how many unique PDFs crawled?"
=> trivial SQL query
- status code count: "how much GROBID progress?"
=> trivial SQL query
- dumpungrobided: "what files still need to be processed"
=> SQL join with a "first" on CDX side
- dumpgrobidxml: "merge CDX/file info with extracted XML, for those that were successful"
=> SQL dump or rowscan, then minio fetches
This table is generally "single file raw fulltext metadata".
"Enrichment" jobs:
- GROBID
- glutton (if not GROBID)
- extra file metadata
- match newly enriched files to fatcat
What else?
- track additional raw file metadata
- dump all basic GROBID metadata (title, authors, year) to attempt merge/match
Questions we might want to answer
- total size of PDF corpus (terabytes)
- unqiue files hit per domain
## Prototype Plan
- backfill all CDX crawl files (TSV transform?)
- load full GROBID XML (both into minio and into SQL)
- load full fatcat file dump (TSV transform)
- load dumpfilemeta
## Example Useful Lookups
http get :3030/cdx?url=eq.https://coleccionables.mercadolibre.com.ar/arduino-pdf_Installments_NoInterest_BestSellers_YES
http get :3030/file_meta?sha1hex=eq.120582c855a7cc3c70a8527c560d7f27e6027278
|