diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-04-07 00:55:02 +0000 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-04-07 00:55:02 +0000 |
commit | 683844f6bb26d867ea6bd2fd89d7669ace45075a (patch) | |
tree | f013ac467116f5982409bf2a05eb4bf354dafecf | |
parent | 1b7f579a881777a8e6fe517e9ee860ff875fe51f (diff) | |
download | sandcrawler-683844f6bb26d867ea6bd2fd89d7669ace45075a.tar.gz sandcrawler-683844f6bb26d867ea6bd2fd89d7669ace45075a.zip |
configs and README updates
-rw-r--r-- | TODO | 1 | ||||
-rw-r--r-- | mapreduce/.pylintrc | 2 | ||||
-rw-r--r-- | mapreduce/README.md | 25 | ||||
-rw-r--r-- | mapreduce/mrjob.conf | 4 |
4 files changed, 27 insertions, 5 deletions
@@ -3,6 +3,7 @@ Will probably eventually refactor into top-level plus modules. Eg, "common" directory, "backfill" and "extraction" as sub-directories. Downside of this is single giant pipenv venv with all dependencies? +- how to get argument (like --hbase-table) into mrjob.conf, or similar? - fix pig gitlab-ci tests (JAVA_HOME). also make fetch_deps *way* more quiet sentry: diff --git a/mapreduce/.pylintrc b/mapreduce/.pylintrc index 5dc3ce0..78e9e7f 100644 --- a/mapreduce/.pylintrc +++ b/mapreduce/.pylintrc @@ -7,4 +7,4 @@ include-ids=yes [MISCELLANEOUS] # List of note tags to take in consideration, separated by a comma. -notes=FIXME,XXX +notes=FIXME,XXX,DELETEME diff --git a/mapreduce/README.md b/mapreduce/README.md index 99dd4f9..ed4067e 100644 --- a/mapreduce/README.md +++ b/mapreduce/README.md @@ -16,7 +16,8 @@ Check test coverage with: pytest --cov --cov-report html # open ./htmlcov/index.html in a browser -TODO: GROBID and HBase during development? +TODO: Persistant GROBID and HBase during development? Or just use live +resources? ## Extraction Task @@ -26,9 +27,25 @@ running on a devbox and GROBID running on a dedicated machine: ./extraction_cdx_grobid.py \ --hbase-table wbgrp-journal-extract-0-qa \ --hbase-host bnewbold-dev.us.archive.org \ - --grobid-uri http://wbgrp-svc096.us.archive.org:8070 + --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \ tests/files/example.cdx +Running from the cluster: + + # Create tarball of virtualenv + pipenv shell + export VENVSHORT=`basename $VIRTUAL_ENV` + tar -czf $VENVSHORT.tar.gz -C /home/bnewbold/.local/share/virtualenvs/$VENVSHORT . + + ./extraction_cdx_grobid.py \ + --hbase-table wbgrp-journal-extract-0-qa \ + --hbase-host bnewbold-dev.us.archive.org \ + --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \ + -r hadoop \ + -c mrjob.conf \ + --archive $VENVSHORT.tar.gz#venv \ + hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx + ## Backfill Task An example actually connecting to HBase from a local machine, with thrift @@ -36,7 +53,7 @@ running on a devbox: ./backfill_hbase_from_cdx.py \ --hbase-table wbgrp-journal-extract-0-qa \ - --hbase-host bnewbold-dev.us.archive.org + --hbase-host bnewbold-dev.us.archive.org \ tests/files/example.cdx Actual invocation to run on Hadoop cluster (running on an IA devbox, where @@ -52,5 +69,5 @@ hadoop environment is configured): --hbase-table wbgrp-journal-extract-0-qa \ -r hadoop \ -c mrjob.conf \ - --archive $VENVSHORT#venv \ + --archive $VENVSHORT.tar.gz#venv \ hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx diff --git a/mapreduce/mrjob.conf b/mapreduce/mrjob.conf index cb286f1..66724cb 100644 --- a/mapreduce/mrjob.conf +++ b/mapreduce/mrjob.conf @@ -1,4 +1,8 @@ runners: hadoop: + no_output: true + upload_files: + - common.py + - grobid2json.py setup: - export PYTHONPATH=$PYTHONPATH:venv/lib/python3.5/site-packages/ |