aboutsummaryrefslogtreecommitdiffstats
path: root/hbase/schema_design.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-08-24 13:39:21 -0700
committerBryan Newbold <bnewbold@archive.org>2018-08-24 13:39:21 -0700
commit67755e366bcc1df455a9d75710a11030c3e2cc52 (patch)
tree9e46155ba3290634e9a328fc10fc3362789d448d /hbase/schema_design.md
parent1ae7fd2f0c5661560b15be86614c2c4d41b21205 (diff)
downloadsandcrawler-67755e366bcc1df455a9d75710a11030c3e2cc52.tar.gz
sandcrawler-67755e366bcc1df455a9d75710a11030c3e2cc52.zip
move HBase schema and notes from journal-infra repo
Diffstat (limited to 'hbase/schema_design.md')
-rw-r--r--hbase/schema_design.md71
1 files changed, 71 insertions, 0 deletions
diff --git a/hbase/schema_design.md b/hbase/schema_design.md
new file mode 100644
index 0000000..67a940f
--- /dev/null
+++ b/hbase/schema_design.md
@@ -0,0 +1,71 @@
+
+## PDF Table
+
+Table name: `wbgrp-journal-extract-<version>-<env>`
+
+Eg: `wbgrp-journal-extract-0-prod`
+
+Key is the sha1 of the file, as raw bytes (20 bytes).
+
+Could conceivably need to handle, eg, postscript files, JATS XML, or even HTML
+in the future? If possible be filetype-agnostic, but only "fulltext" file types
+will end up in here, and don't bend over backwards.
+
+Keep only a single version (do we need `VERSIONS => 1`, or is 1 the default?)
+
+IMPORTANT: column names should be unique across column families. Eg, should not
+have both `grobid0:status` and `match0:status`. HBase and some client libraries
+don't care, but some map/reduce frameworks (eg, Scalding) can have name
+collisions. Differences between "orthogonal" columns *might* be OK (eg,
+`grobid0:status` and `grobid1:status`).
+
+Column families:
+
+- `key`: sha1 of the file in base32 (not a column or column family)
+- `f`: heritrix HBaseContentDigestHistory de-dupe
+ - `c`: (json string)
+ - `u`: original URL (required)
+ - `d`: original date (required; ISO 8601:1988)
+ - `f`: warc filename (recommend)
+ - `o`: warc offset (recommend)
+ - `c`: dupe count (optional)
+ - `i`: warc record ID (optional)
+- `file`: crawl and file metadata
+ - `size` (uint64), uncompressed (not in CDX)
+ - `mime` (string; might do postscript in the future; normalized)
+ - `cdx` (json string) with all as strings
+ - `surt`
+ - `url`
+ - `dt`
+ - `warc` (item and file name)
+ - `offset`
+ - `c_size` (compressed size)
+- `grobid0`: processing status, version, XML and JSON fulltext, JSON metadata. timestamp. Should be compressed! `COMPRESSION => SNAPPY`
+ - `status_code` (signed int; HTTP status from grobid)
+ - `quality` (int or string; we define the meaning ("good"/"marginal"/"bad")
+ - `status` (json string from grobid)
+ - `tei_xml` (xml string from grobid)
+ - `tei_json` (json string with fulltext)
+ - `metadata` (json string with author, title, abstract, citations, etc)
+- `match0`: status of identification against "the catalog"
+ - `mstatus` (string; did it match?)
+ - `doi` (string)
+ - `minfo` (json string)
+
+Can add additional groups in the future for additional processing steps. For
+example, we might want to do first pass looking at files to see "is this a PDF
+or not", which out output some status (and maybe certainty).
+
+The Heritrix schema is fixed by the existing implementation. We could
+patch/extend heritrix to use the `file` schema in the future if we decide
+it's worth it. There are some important pieces of metadata missing, so at
+least to start I think we should keep `f` and `file` distinct. We could merge
+them later. `f` info will be populated by crawlers; `file` info should be
+populated when back-filling or processing CDX lines.
+
+If we wanted to support multiple CDX rows as part of the same row (eg, as
+alternate locations), we could use HBase's versions feature, which can
+automatically cap the number of versions stored.
+
+If we had enough RAM resources, we could store `f` (and maybe `file`) metadata
+in memory for faster access.