aboutsummaryrefslogtreecommitdiffstats
path: root/mapreduce/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'mapreduce/README.md')
-rw-r--r--mapreduce/README.md25
1 files changed, 21 insertions, 4 deletions
diff --git a/mapreduce/README.md b/mapreduce/README.md
index 99dd4f9..ed4067e 100644
--- a/mapreduce/README.md
+++ b/mapreduce/README.md
@@ -16,7 +16,8 @@ Check test coverage with:
pytest --cov --cov-report html
# open ./htmlcov/index.html in a browser
-TODO: GROBID and HBase during development?
+TODO: Persistant GROBID and HBase during development? Or just use live
+resources?
## Extraction Task
@@ -26,9 +27,25 @@ running on a devbox and GROBID running on a dedicated machine:
./extraction_cdx_grobid.py \
--hbase-table wbgrp-journal-extract-0-qa \
--hbase-host bnewbold-dev.us.archive.org \
- --grobid-uri http://wbgrp-svc096.us.archive.org:8070
+ --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \
tests/files/example.cdx
+Running from the cluster:
+
+ # Create tarball of virtualenv
+ pipenv shell
+ export VENVSHORT=`basename $VIRTUAL_ENV`
+ tar -czf $VENVSHORT.tar.gz -C /home/bnewbold/.local/share/virtualenvs/$VENVSHORT .
+
+ ./extraction_cdx_grobid.py \
+ --hbase-table wbgrp-journal-extract-0-qa \
+ --hbase-host bnewbold-dev.us.archive.org \
+ --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \
+ -r hadoop \
+ -c mrjob.conf \
+ --archive $VENVSHORT.tar.gz#venv \
+ hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx
+
## Backfill Task
An example actually connecting to HBase from a local machine, with thrift
@@ -36,7 +53,7 @@ running on a devbox:
./backfill_hbase_from_cdx.py \
--hbase-table wbgrp-journal-extract-0-qa \
- --hbase-host bnewbold-dev.us.archive.org
+ --hbase-host bnewbold-dev.us.archive.org \
tests/files/example.cdx
Actual invocation to run on Hadoop cluster (running on an IA devbox, where
@@ -52,5 +69,5 @@ hadoop environment is configured):
--hbase-table wbgrp-journal-extract-0-qa \
-r hadoop \
-c mrjob.conf \
- --archive $VENVSHORT#venv \
+ --archive $VENVSHORT.tar.gz#venv \
hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx