diff options
Diffstat (limited to 'hbase/schema_design.md')
-rw-r--r-- | hbase/schema_design.md | 79 |
1 files changed, 0 insertions, 79 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md deleted file mode 100644 index 2db8998..0000000 --- a/hbase/schema_design.md +++ /dev/null @@ -1,79 +0,0 @@ - -## PDF Table - -Table name: `wbgrp-journal-extract-<version>-<env>` - -Eg: `wbgrp-journal-extract-0-prod` - -Key is the sha1 of the file, as raw bytes (20 bytes). - -Could conceivably need to handle, eg, postscript files, JATS XML, or even HTML -in the future? If possible be filetype-agnostic, but only "fulltext" file types -will end up in here, and don't bend over backwards. - -Keep only a single version (do we need `VERSIONS => 1`, or is 1 the default?) - -IMPORTANT: column names should be unique across column families. Eg, should not -have both `grobid0:status` and `match0:status`. HBase and some client libraries -don't care, but some map/reduce frameworks (eg, Scalding) can have name -collisions. Differences between "orthogonal" columns *might* be OK (eg, -`grobid0:status` and `grobid1:status`). - -Column families: - -- `key`: sha1 of the file in base32 (not a column or column family) -- `f`: heritrix HBaseContentDigestHistory de-dupe - - `c`: (json string) - - `u`: original URL (required) - - `d`: original date (required; ISO 8601:1988) - - `f`: warc filename (recommend) - - `o`: warc offset (recommend) - - `c`: dupe count (optional) - - `i`: warc record ID (optional) -- `file`: crawl and file metadata - - `size` (uint64), uncompressed (not in CDX) - - `mime` (string; might do postscript in the future; normalized) - - `cdx` (json string) with all as strings - - `surt` - - `url` - - `dt` - - `warc` (item and file name) - - `offset` - - `c_size` (compressed size) - - `meta` (json string) - - `size` (int) - - `mime` (str) - - `magic` (str) - - `magic_mime` (str) - - `sha1` (hex str) - - `md5` (hex str) - - `sha256` (hex str) -- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY` - - `status_code` (signed int; HTTP status from grobid) - - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad") - - `status` (json string from grobid) - - `tei_xml` (xml string from grobid) - - `tei_json` (json string with fulltext) - - `metadata` (json string with author, title, abstract, citations, etc) -- `match0`: status of identification against "the catalog" - - `mstatus` (string; did it match?) - - `doi` (string) - - `minfo` (json string) - -Can add additional groups in the future for additional processing steps. For -example, we might want to do first pass looking at files to see "is this a PDF -or not", which out output some status (and maybe certainty). - -The Heritrix schema is fixed by the existing implementation. We could -patch/extend heritrix to use the `file` schema in the future if we decide -it's worth it. There are some important pieces of metadata missing, so at -least to start I think we should keep `f` and `file` distinct. We could merge -them later. `f` info will be populated by crawlers; `file` info should be -populated when back-filling or processing CDX lines. - -If we wanted to support multiple CDX rows as part of the same row (eg, as -alternate locations), we could use HBase's versions feature, which can -automatically cap the number of versions stored. - -If we had enough RAM resources, we could store `f` (and maybe `file`) metadata -in memory for faster access. |