move HBase schema and notes from journal-infra repo

author: Bryan Newbold <bnewbold@archive.org> 2018-08-24 13:39:21 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2018-08-24 13:39:21 -0700
commit: 67755e366bcc1df455a9d75710a11030c3e2cc52 (patch)
tree: 9e46155ba3290634e9a328fc10fc3362789d448d /hbase/schema_design.md
parent: 1ae7fd2f0c5661560b15be86614c2c4d41b21205 (diff)
download: sandcrawler-67755e366bcc1df455a9d75710a11030c3e2cc52.tar.gz
sandcrawler-67755e366bcc1df455a9d75710a11030c3e2cc52.zip
1 files changed, 71 insertions, 0 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md
new file mode 100644
index 0000000..67a940f
--- /dev/null
+++ b/hbase/schema_design.md
@@ -0,0 +1,71 @@
+
+## PDF Table
+
+Table name: `wbgrp-journal-extract-<version>-<env>`
+
+Eg: `wbgrp-journal-extract-0-prod`
+
+Key is the sha1 of the file, as raw bytes (20 bytes).
+
+Could conceivably need to handle, eg, postscript files, JATS XML, or even HTML
+in the future? If possible be filetype-agnostic, but only "fulltext" file types
+will end up in here, and don't bend over backwards.
+
+Keep only a single version (do we need `VERSIONS => 1`, or is 1 the default?)
+
+IMPORTANT: column names should be unique across column families. Eg, should not
+have both `grobid0:status` and `match0:status`. HBase and some client libraries
+don't care, but some map/reduce frameworks (eg, Scalding) can have name
+collisions. Differences between "orthogonal" columns *might* be OK (eg,
+`grobid0:status` and `grobid1:status`).
+
+Column families:
+
+- `key`: sha1 of the file in base32 (not a column or column family)
+- `f`: heritrix HBaseContentDigestHistory de-dupe
+    - `c`: (json string)
+        - `u`: original URL (required)
+        - `d`: original date (required; ISO 8601:1988)
+        - `f`: warc filename (recommend)
+        - `o`: warc offset (recommend)
+        - `c`: dupe count (optional)
+        - `i`: warc record ID (optional)
+- `file`: crawl and file metadata
+    - `size` (uint64), uncompressed (not in CDX)
+    - `mime` (string; might do postscript in the future; normalized)
+    - `cdx` (json string) with all as strings
+        - `surt`
+        - `url`
+        - `dt`
+        - `warc` (item and file name)
+        - `offset`
+        - `c_size` (compressed size)
+- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY`
+    - `status_code` (signed int; HTTP status from grobid)
+    - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad")
+    - `status` (json string from grobid)
+    - `tei_xml` (xml string from grobid)
+    - `tei_json` (json string with fulltext)
+    - `metadata` (json string with author, title, abstract, citations, etc)
+- `match0`: status of identification against "the catalog"
+    - `mstatus` (string; did it match?)
+    - `doi` (string)
+    - `minfo` (json string)
+
+Can add additional groups in the future for additional processing steps. For
+example, we might want to do first pass looking at files to see "is this a PDF
+or not", which out output some status (and maybe certainty).
+
+The Heritrix schema is fixed by the existing implementation. We could
+patch/extend heritrix to use the `file` schema in the future if we decide
+it's worth it. There are some important pieces of metadata missing, so at
+least to start I think we should keep `f` and `file` distinct. We could merge
+them later. `f` info will be populated by crawlers; `file` info should be
+populated when back-filling or processing CDX lines.
+
+If we wanted to support multiple CDX rows as part of the same row (eg, as
+alternate locations), we could use HBase's versions feature, which can
+automatically cap the number of versions stored.
+
+If we had enough RAM resources, we could store `f` (and maybe `file`) metadata
+in memory for faster access.
author	Bryan Newbold <bnewbold@archive.org>	2018-08-24 13:39:21 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2018-08-24 13:39:21 -0700
commit	67755e366bcc1df455a9d75710a11030c3e2cc52 (patch)
tree	9e46155ba3290634e9a328fc10fc3362789d448d /hbase/schema_design.md
parent	1ae7fd2f0c5661560b15be86614c2c4d41b21205 (diff)
download	sandcrawler-67755e366bcc1df455a9d75710a11030c3e2cc52.tar.gz sandcrawler-67755e366bcc1df455a9d75710a11030c3e2cc52.zip