diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-04-12 13:51:10 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-04-12 14:19:29 -0700 |
commit | 1e19371d6f47dc89744271c4791f3888a246dc4a (patch) | |
tree | 9bb55bd90e7bc5b0306715523b1d0de7fb75a5b9 /hbase | |
parent | b23455fcc90416be370c4396c1f1e4bbe36b93d6 (diff) | |
download | sandcrawler-1e19371d6f47dc89744271c4791f3888a246dc4a.tar.gz sandcrawler-1e19371d6f47dc89744271c4791f3888a246dc4a.zip |
schema notes on deeper file metadata
Diffstat (limited to 'hbase')
-rw-r--r-- | hbase/schema_design.md | 8 |
1 files changed, 8 insertions, 0 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md index 67a940f..2db8998 100644 --- a/hbase/schema_design.md +++ b/hbase/schema_design.md @@ -40,6 +40,14 @@ Column families: - `warc` (item and file name) - `offset` - `c_size` (compressed size) + - `meta` (json string) + - `size` (int) + - `mime` (str) + - `magic` (str) + - `magic_mime` (str) + - `sha1` (hex str) + - `md5` (hex str) + - `sha256` (hex str) - `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY` - `status_code` (signed int; HTTP status from grobid) - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad") |