diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2019-04-12 13:51:10 -0700 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2019-04-12 14:19:29 -0700 | 
| commit | 1e19371d6f47dc89744271c4791f3888a246dc4a (patch) | |
| tree | 9bb55bd90e7bc5b0306715523b1d0de7fb75a5b9 /hbase/schema_design.md | |
| parent | b23455fcc90416be370c4396c1f1e4bbe36b93d6 (diff) | |
| download | sandcrawler-1e19371d6f47dc89744271c4791f3888a246dc4a.tar.gz sandcrawler-1e19371d6f47dc89744271c4791f3888a246dc4a.zip  | |
schema notes on deeper file metadata
Diffstat (limited to 'hbase/schema_design.md')
| -rw-r--r-- | hbase/schema_design.md | 8 | 
1 files changed, 8 insertions, 0 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md index 67a940f..2db8998 100644 --- a/hbase/schema_design.md +++ b/hbase/schema_design.md @@ -40,6 +40,14 @@ Column families:          - `warc` (item and file name)          - `offset`          - `c_size` (compressed size) +    - `meta` (json string) +        - `size` (int) +        - `mime` (str) +        - `magic` (str) +        - `magic_mime` (str) +        - `sha1` (hex str) +        - `md5` (hex str) +        - `sha256` (hex str)  - `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY`      - `status_code` (signed int; HTTP status from grobid)      - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad")  | 
