aboutsummaryrefslogtreecommitdiffstats
path: root/postgresql/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'postgresql/README.md')
-rw-r--r--postgresql/README.md120
1 files changed, 120 insertions, 0 deletions
diff --git a/postgresql/README.md b/postgresql/README.md
new file mode 100644
index 0000000..4774adb
--- /dev/null
+++ b/postgresql/README.md
@@ -0,0 +1,120 @@
+
+TL;DR: replace hbase with postgresql tables with REST API (http://postgrest.org)
+
+No primary storage of anything in this table. Everything should be rapidly
+re-creatable from dumps, kafka topics (compressed), CDX, petabox metadata, etc.
+This is a secondary view on all of that.
+
+## Schema
+
+ schema/database name is 'sandcrawler'
+
+ cdx: include revisits or not?
+ id: int64, PK
+ sha1hex: string, not null, index
+ cdx_sha1hex: string
+ url: string, not null
+ datetime: ISO 8601:1988 (string?), not null
+ mimetype: string
+ warc_path: string (item and filename)
+ warc_offset: i64
+ created: datetime, index (?)
+ ?crawl: string
+ ?domain: string
+
+ file_meta
+ sha1hex, string, PK
+ md5hex: string
+ sha256hex: string
+ size_bytes: i64
+ mime: string (verifying file status; optional for now?)
+
+ fatcat_file
+ sha1hex: string, PK
+ file_ident: string, index?
+ release_ident: ?
+
+ petabox
+ id: int64, PK
+ sha1hex: string, notnull, index
+ item: string, notnull
+ path: string, notnull (TODO: URL encoded? separate sub-archive path?)
+
+ grobid
+ sha1hex: string, PK
+ updated: datetime
+ grobid_version (string)
+ status_code: i32
+ status: string (JSONB?), only if status != 200
+ metadata: JSONB, title, first author, year (not for now?)
+ glutton_fatcat_release: string, index
+
+ shadow
+ sha1hex: string, PK
+ shadow_corpus: string, PK
+ shadow_id: string
+ doi: string
+ pmid: string
+ isbn13: string
+
+Alternatively, to be more like existing system could have "one big table" or
+multiple tables all with same key (sha1b32) and UNIQ. As is, every sha1 pk
+column is 40 bytes of both index and data, or 8+ GByte (combined) for each
+table with 100 million rows. using raw bytes could help, but makes all
+code/queries much trickier.
+
+Should we have "created" or "updated" timestamps on all these columns to enable
+kafka tailing?
+
+TODO:
+- how to indicate CDX sha1 vs. true sha1 mis-match? pretty rare. recrawl and delete row from `gwb_cdx`?
+- only most recent GROBID? or keep multiple versions? here and minio
+
+## Existing Stuff Sizes (estimates)
+
+ 78.5G /user/bnewbold/journal_crawl_cdx
+ 19.7G /user/bnewbold/sandcrawler/output-prod/2018-12-14-1737.00-dumpfilemeta
+ 2.7G file_hashes.tsv
+ 228.5G /user/bnewbold/sandcrawler/output-prod/2018-09-23-0405.30-dumpgrobidmetainsertable
+
+## Use Cases
+
+Core goal here is to mostly kill hbase/hadoop. What jobs are actually used there?
+
+- backfill: load in-scope (fulltext) crawl results from CDX
+ => bulk (many line) inserts
+- rowcount: "how many unique PDFs crawled?"
+ => trivial SQL query
+- status code count: "how much GROBID progress?"
+ => trivial SQL query
+- dumpungrobided: "what files still need to be processed"
+ => SQL join with a "first" on CDX side
+- dumpgrobidxml: "merge CDX/file info with extracted XML, for those that were successful"
+ => SQL dump or rowscan, then minio fetches
+
+This table is generally "single file raw fulltext metadata".
+
+"Enrichment" jobs:
+
+- GROBID
+- glutton (if not GROBID)
+- extra file metadata
+- match newly enriched files to fatcat
+
+What else?
+
+- track additional raw file metadata
+- dump all basic GROBID metadata (title, authors, year) to attempt merge/match
+
+Questions we might want to answer
+
+- total size of PDF corpus (terabytes)
+- unqiue files hit per domain
+
+## Prototype Plan
+
+- backfill all CDX crawl files (TSV transform?)
+- load full GROBID XML (both into minio and into SQL)
+- load full fatcat file dump (TSV transform)
+- load dumpfilemeta
+