configs and README updates

author: Bryan Newbold <bnewbold@archive.org> 2018-04-07 00:55:02 +0000
committer: Bryan Newbold <bnewbold@archive.org> 2018-04-07 00:55:02 +0000
commit: 683844f6bb26d867ea6bd2fd89d7669ace45075a (patch)
tree: f013ac467116f5982409bf2a05eb4bf354dafecf
parent: 1b7f579a881777a8e6fe517e9ee860ff875fe51f (diff)
download: sandcrawler-683844f6bb26d867ea6bd2fd89d7669ace45075a.tar.gz
sandcrawler-683844f6bb26d867ea6bd2fd89d7669ace45075a.zip
4 files changed, 27 insertions, 5 deletions
diff --git a/TODO b/TODO
index 5a5ae5f..e998728 100644
--- a/TODO
+++ b/TODO
@@ -3,6 +3,7 @@ Will probably eventually refactor into top-level plus modules. Eg, "common"
 directory, "backfill" and "extraction" as sub-directories. Downside of this is
 single giant pipenv venv with all dependencies?
 
+- how to get argument (like --hbase-table) into mrjob.conf, or similar?
 - fix pig gitlab-ci tests (JAVA_HOME). also make fetch_deps *way* more quiet
 
 sentry:
diff --git a/mapreduce/.pylintrc b/mapreduce/.pylintrc
index 5dc3ce0..78e9e7f 100644
--- a/mapreduce/.pylintrc
+++ b/mapreduce/.pylintrc
@@ -7,4 +7,4 @@ include-ids=yes
 
 [MISCELLANEOUS]
 # List of note tags to take in consideration, separated by a comma.
-notes=FIXME,XXX
+notes=FIXME,XXX,DELETEME
diff --git a/mapreduce/README.md b/mapreduce/README.md
index 99dd4f9..ed4067e 100644
--- a/mapreduce/README.md
+++ b/mapreduce/README.md
@@ -16,7 +16,8 @@ Check test coverage with:
     pytest --cov --cov-report html
     # open ./htmlcov/index.html in a browser
 
-TODO: GROBID and HBase during development?
+TODO: Persistant GROBID and HBase during development? Or just use live
+resources?
 
 ## Extraction Task
 
@@ -26,9 +27,25 @@ running on a devbox and GROBID running on a dedicated machine:
     ./extraction_cdx_grobid.py \
         --hbase-table wbgrp-journal-extract-0-qa \
         --hbase-host bnewbold-dev.us.archive.org \
-        --grobid-uri http://wbgrp-svc096.us.archive.org:8070
+        --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \
         tests/files/example.cdx
 
+Running from the cluster:
+
+    # Create tarball of virtualenv
+    pipenv shell
+    export VENVSHORT=`basename $VIRTUAL_ENV`
+    tar -czf $VENVSHORT.tar.gz -C /home/bnewbold/.local/share/virtualenvs/$VENVSHORT .
+
+    ./extraction_cdx_grobid.py \
+        --hbase-table wbgrp-journal-extract-0-qa \
+        --hbase-host bnewbold-dev.us.archive.org \
+        --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \
+        -r hadoop \
+        -c mrjob.conf \
+        --archive $VENVSHORT.tar.gz#venv \
+        hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx
+
 ## Backfill Task
 
 An example actually connecting to HBase from a local machine, with thrift
@@ -36,7 +53,7 @@ running on a devbox:
 
     ./backfill_hbase_from_cdx.py \
         --hbase-table wbgrp-journal-extract-0-qa \
-        --hbase-host bnewbold-dev.us.archive.org
+        --hbase-host bnewbold-dev.us.archive.org \
         tests/files/example.cdx
 
 Actual invocation to run on Hadoop cluster (running on an IA devbox, where
@@ -52,5 +69,5 @@ hadoop environment is configured):
         --hbase-table wbgrp-journal-extract-0-qa \
         -r hadoop \
         -c mrjob.conf \
-        --archive $VENVSHORT#venv \
+        --archive $VENVSHORT.tar.gz#venv \
         hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx
diff --git a/mapreduce/mrjob.conf b/mapreduce/mrjob.conf
index cb286f1..66724cb 100644
--- a/mapreduce/mrjob.conf
+++ b/mapreduce/mrjob.conf
@@ -1,4 +1,8 @@
 runners:
   hadoop:
+    no_output: true
+    upload_files:
+      - common.py
+      - grobid2json.py
     setup:
       - export PYTHONPATH=$PYTHONPATH:venv/lib/python3.5/site-packages/
author	Bryan Newbold <bnewbold@archive.org>	2018-04-07 00:55:02 +0000
committer	Bryan Newbold <bnewbold@archive.org>	2018-04-07 00:55:02 +0000
commit	683844f6bb26d867ea6bd2fd89d7669ace45075a (patch)
tree	f013ac467116f5982409bf2a05eb4bf354dafecf
parent	1b7f579a881777a8e6fe517e9ee860ff875fe51f (diff)
download	sandcrawler-683844f6bb26d867ea6bd2fd89d7669ace45075a.tar.gz sandcrawler-683844f6bb26d867ea6bd2fd89d7669ace45075a.zip