diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-12-23 15:52:02 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-12-23 15:52:02 -0800 |
commit | f3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch) | |
tree | fdae9373e78671d0031f83045e6c76de9ad616cf /extra/hbase/howto.md | |
parent | 8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff) | |
download | sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip |
move a bunch of top-level files/directories to ./extra/
Diffstat (limited to 'extra/hbase/howto.md')
-rw-r--r-- | extra/hbase/howto.md | 42 |
1 files changed, 42 insertions, 0 deletions
diff --git a/extra/hbase/howto.md b/extra/hbase/howto.md new file mode 100644 index 0000000..26d33f4 --- /dev/null +++ b/extra/hbase/howto.md @@ -0,0 +1,42 @@ + +Commands can be run from any cluster machine with hadoop environment config +set up. Most of these commands are run from the shell (start with `hbase +shell`). There is only one AIT/Webgroup HBase instance/namespace; there may be +QA/prod tables, but there are not QA/prod clusters. + +## Create Table + +Create column families (note: not all individual columns) with something like: + + create 'wbgrp-journal-extract-0-qa', 'f', 'file', {NAME => 'grobid0', COMPRESSION => 'snappy'} + +## Run Thrift Server Informally + +The Thrift server can technically be run from any old cluster machine that has +Hadoop client stuff set up, using: + + hbase thrift start -nonblocking -c + +Note that this will run version 0.96, while the actual HBase service seems to +be running 0.98. + +To interact with this config, use happybase (python) config: + + conn = happybase.Connection("bnewbold-dev.us.archive.org", transport="framed", protocol="compact") + # Test connection + conn.tables() + +## Queries From Shell + +Fetch all columns for a single row: + + hbase> get 'wbgrp-journal-extract-0-qa', 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ' + +Fetch multiple columns for a single row, using column families: + + hbase> get 'wbgrp-journal-extract-0-qa', 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ', 'f', 'file' + +Scan a fixed number of rows (here 5) starting at a specific key prefix, all +columns: + + hbase> scan 'wbgrp-journal-extract-0-qa',{LIMIT=>5,STARTROW=>'sha1:A'} |