From 67755e366bcc1df455a9d75710a11030c3e2cc52 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 24 Aug 2018 13:39:21 -0700 Subject: move HBase schema and notes from journal-infra repo --- hbase/schema_design.md | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 hbase/schema_design.md (limited to 'hbase/schema_design.md') diff --git a/hbase/schema_design.md b/hbase/schema_design.md new file mode 100644 index 0000000..67a940f --- /dev/null +++ b/hbase/schema_design.md @@ -0,0 +1,71 @@ + +## PDF Table + +Table name: `wbgrp-journal-extract--` + +Eg: `wbgrp-journal-extract-0-prod` + +Key is the sha1 of the file, as raw bytes (20 bytes). + +Could conceivably need to handle, eg, postscript files, JATS XML, or even HTML +in the future? If possible be filetype-agnostic, but only "fulltext" file types +will end up in here, and don't bend over backwards. + +Keep only a single version (do we need `VERSIONS => 1`, or is 1 the default?) + +IMPORTANT: column names should be unique across column families. Eg, should not +have both `grobid0:status` and `match0:status`. HBase and some client libraries +don't care, but some map/reduce frameworks (eg, Scalding) can have name +collisions. Differences between "orthogonal" columns *might* be OK (eg, +`grobid0:status` and `grobid1:status`). + +Column families: + +- `key`: sha1 of the file in base32 (not a column or column family) +- `f`: heritrix HBaseContentDigestHistory de-dupe + - `c`: (json string) + - `u`: original URL (required) + - `d`: original date (required; ISO 8601:1988) + - `f`: warc filename (recommend) + - `o`: warc offset (recommend) + - `c`: dupe count (optional) + - `i`: warc record ID (optional) +- `file`: crawl and file metadata + - `size` (uint64), uncompressed (not in CDX) + - `mime` (string; might do postscript in the future; normalized) + - `cdx` (json string) with all as strings + - `surt` + - `url` + - `dt` + - `warc` (item and file name) + - `offset` + - `c_size` (compressed size) +- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY` + - `status_code` (signed int; HTTP status from grobid) + - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad") + - `status` (json string from grobid) + - `tei_xml` (xml string from grobid) + - `tei_json` (json string with fulltext) + - `metadata` (json string with author, title, abstract, citations, etc) +- `match0`: status of identification against "the catalog" + - `mstatus` (string; did it match?) + - `doi` (string) + - `minfo` (json string) + +Can add additional groups in the future for additional processing steps. For +example, we might want to do first pass looking at files to see "is this a PDF +or not", which out output some status (and maybe certainty). + +The Heritrix schema is fixed by the existing implementation. We could +patch/extend heritrix to use the `file` schema in the future if we decide +it's worth it. There are some important pieces of metadata missing, so at +least to start I think we should keep `f` and `file` distinct. We could merge +them later. `f` info will be populated by crawlers; `file` info should be +populated when back-filling or processing CDX lines. + +If we wanted to support multiple CDX rows as part of the same row (eg, as +alternate locations), we could use HBase's versions feature, which can +automatically cap the number of versions stored. + +If we had enough RAM resources, we could store `f` (and maybe `file`) metadata +in memory for faster access. -- cgit v1.2.3