aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-04-12 13:51:10 -0700
committerBryan Newbold <bnewbold@archive.org>2019-04-12 14:19:29 -0700
commit1e19371d6f47dc89744271c4791f3888a246dc4a (patch)
tree9bb55bd90e7bc5b0306715523b1d0de7fb75a5b9
parentb23455fcc90416be370c4396c1f1e4bbe36b93d6 (diff)
downloadsandcrawler-1e19371d6f47dc89744271c4791f3888a246dc4a.tar.gz
sandcrawler-1e19371d6f47dc89744271c4791f3888a246dc4a.zip
schema notes on deeper file metadata
-rw-r--r--hbase/schema_design.md8
1 files changed, 8 insertions, 0 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md
index 67a940f..2db8998 100644
--- a/hbase/schema_design.md
+++ b/hbase/schema_design.md
@@ -40,6 +40,14 @@ Column families:
- `warc` (item and file name)
- `offset`
- `c_size` (compressed size)
+ - `meta` (json string)
+ - `size` (int)
+ - `mime` (str)
+ - `magic` (str)
+ - `magic_mime` (str)
+ - `sha1` (hex str)
+ - `md5` (hex str)
+ - `sha256` (hex str)
- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY`
- `status_code` (signed int; HTTP status from grobid)
- `quality` (int or string; we define the meaning ("good"/"marginal"/"bad")