aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--hbase/schema_design.md8
1 files changed, 8 insertions, 0 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md
index 67a940f..2db8998 100644
--- a/hbase/schema_design.md
+++ b/hbase/schema_design.md
@@ -40,6 +40,14 @@ Column families:
- `warc` (item and file name)
- `offset`
- `c_size` (compressed size)
+ - `meta` (json string)
+ - `size` (int)
+ - `mime` (str)
+ - `magic` (str)
+ - `magic_mime` (str)
+ - `sha1` (hex str)
+ - `md5` (hex str)
+ - `sha256` (hex str)
- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY`
- `status_code` (signed int; HTTP status from grobid)
- `quality` (int or string; we define the meaning ("good"/"marginal"/"bad")