From 1e19371d6f47dc89744271c4791f3888a246dc4a Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 12 Apr 2019 13:51:10 -0700 Subject: schema notes on deeper file metadata --- hbase/schema_design.md | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'hbase') diff --git a/hbase/schema_design.md b/hbase/schema_design.md index 67a940f..2db8998 100644 --- a/hbase/schema_design.md +++ b/hbase/schema_design.md @@ -40,6 +40,14 @@ Column families: - `warc` (item and file name) - `offset` - `c_size` (compressed size) + - `meta` (json string) + - `size` (int) + - `mime` (str) + - `magic` (str) + - `magic_mime` (str) + - `sha1` (hex str) + - `md5` (hex str) + - `sha256` (hex str) - `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY` - `status_code` (signed int; HTTP status from grobid) - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad") -- cgit v1.2.3