move a bunch of top-level files/directories to ./extra/

author: Bryan Newbold <bnewbold@archive.org> 2022-12-23 15:52:02 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2022-12-23 15:52:02 -0800
commit: f3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch)
tree: fdae9373e78671d0031f83045e6c76de9ad616cf /extra/hbase/howto.md
parent: 8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff)
download: sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz
sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip
1 files changed, 42 insertions, 0 deletions
diff --git a/extra/hbase/howto.md b/extra/hbase/howto.md
new file mode 100644
index 0000000..26d33f4
--- /dev/null
+++ b/extra/hbase/howto.md
@@ -0,0 +1,42 @@
+
+Commands can be run from any cluster machine with hadoop environment config
+set up. Most of these commands are run from the shell (start with `hbase
+shell`). There is only one AIT/Webgroup HBase instance/namespace; there may be
+QA/prod tables, but there are not QA/prod clusters.
+
+## Create Table
+
+Create column families (note: not all individual columns) with something like:
+
+    create 'wbgrp-journal-extract-0-qa', 'f', 'file', {NAME => 'grobid0', COMPRESSION => 'snappy'}
+
+## Run Thrift Server Informally
+
+The Thrift server can technically be run from any old cluster machine that has
+Hadoop client stuff set up, using:
+
+    hbase thrift start -nonblocking -c
+
+Note that this will run version 0.96, while the actual HBase service seems to
+be running 0.98.
+
+To interact with this config, use happybase (python) config:
+
+    conn = happybase.Connection("bnewbold-dev.us.archive.org", transport="framed", protocol="compact")
+    # Test connection
+    conn.tables()
+
+## Queries From Shell
+
+Fetch all columns for a single row:
+
+    hbase> get 'wbgrp-journal-extract-0-qa', 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ'
+
+Fetch multiple columns for a single row, using column families:
+
+    hbase> get 'wbgrp-journal-extract-0-qa', 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ', 'f', 'file'
+
+Scan a fixed number of rows (here 5) starting at a specific key prefix, all
+columns:
+
+    hbase> scan 'wbgrp-journal-extract-0-qa',{LIMIT=>5,STARTROW=>'sha1:A'}
author	Bryan Newbold <bnewbold@archive.org>	2022-12-23 15:52:02 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2022-12-23 15:52:02 -0800
commit	f3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch)
tree	fdae9373e78671d0031f83045e6c76de9ad616cf /extra/hbase/howto.md
parent	8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff)
download	sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip