From 67755e366bcc1df455a9d75710a11030c3e2cc52 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Fri, 24 Aug 2018 13:39:21 -0700
Subject: move HBase schema and notes from journal-infra repo

---
 hbase/schema_design.md | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 hbase/schema_design.md

(limited to 'hbase/schema_design.md')
diff --git a/hbase/schema_design.md b/hbase/schema_design.md
new file mode 100644
index 0000000..67a940f
--- /dev/null
+++ b/hbase/schema_design.md
@@ -0,0 +1,71 @@
+
+## PDF Table
+
+Table name: `wbgrp-journal-extract-<version>-<env>`
+
+Eg: `wbgrp-journal-extract-0-prod`
+
+Key is the sha1 of the file, as raw bytes (20 bytes).
+
+Could conceivably need to handle, eg, postscript files, JATS XML, or even HTML
+in the future? If possible be filetype-agnostic, but only "fulltext" file types
+will end up in here, and don't bend over backwards.
+
+Keep only a single version (do we need `VERSIONS => 1`, or is 1 the default?)
+
+IMPORTANT: column names should be unique across column families. Eg, should not
+have both `grobid0:status` and `match0:status`. HBase and some client libraries
+don't care, but some map/reduce frameworks (eg, Scalding) can have name
+collisions. Differences between "orthogonal" columns *might* be OK (eg,
+`grobid0:status` and `grobid1:status`).
+
+Column families:
+
+- `key`: sha1 of the file in base32 (not a column or column family)
+- `f`: heritrix HBaseContentDigestHistory de-dupe
+    - `c`: (json string)
+        - `u`: original URL (required)
+        - `d`: original date (required; ISO 8601:1988)
+        - `f`: warc filename (recommend)
+        - `o`: warc offset (recommend)
+        - `c`: dupe count (optional)
+        - `i`: warc record ID (optional)
+- `file`: crawl and file metadata
+    - `size` (uint64), uncompressed (not in CDX)
+    - `mime` (string; might do postscript in the future; normalized)
+    - `cdx` (json string) with all as strings
+        - `surt`
+        - `url`
+        - `dt`
+        - `warc` (item and file name)
+        - `offset`
+        - `c_size` (compressed size)
+- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY`
+    - `status_code` (signed int; HTTP status from grobid)
+    - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad")
+    - `status` (json string from grobid)
+    - `tei_xml` (xml string from grobid)
+    - `tei_json` (json string with fulltext)
+    - `metadata` (json string with author, title, abstract, citations, etc)
+- `match0`: status of identification against "the catalog"
+    - `mstatus` (string; did it match?)
+    - `doi` (string)
+    - `minfo` (json string)
+
+Can add additional groups in the future for additional processing steps. For
+example, we might want to do first pass looking at files to see "is this a PDF
+or not", which out output some status (and maybe certainty).
+
+The Heritrix schema is fixed by the existing implementation. We could
+patch/extend heritrix to use the `file` schema in the future if we decide
+it's worth it. There are some important pieces of metadata missing, so at
+least to start I think we should keep `f` and `file` distinct. We could merge
+them later. `f` info will be populated by crawlers; `file` info should be
+populated when back-filling or processing CDX lines.
+
+If we wanted to support multiple CDX rows as part of the same row (eg, as
+alternate locations), we could use HBase's versions feature, which can
+automatically cap the number of versions stored.
+
+If we had enough RAM resources, we could store `f` (and maybe `file`) metadata
+in memory for faster access.
-- 
cgit v1.2.3