aboutsummaryrefslogtreecommitdiffstats
path: root/mapreduce/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'mapreduce/README.md')
-rw-r--r--mapreduce/README.md74
1 files changed, 0 insertions, 74 deletions
diff --git a/mapreduce/README.md b/mapreduce/README.md
deleted file mode 100644
index aebc160..0000000
--- a/mapreduce/README.md
+++ /dev/null
@@ -1,74 +0,0 @@
-
-Hadoop streaming map/reduce jobs written in python using the mrjob library.
-
-## Development and Testing
-
-System dependencies in addition to `../README.md`:
-
-- `libjpeg-dev` (for wayback libraries)
-
-Run the tests with:
-
- pipenv run pytest
-
-Check test coverage with:
-
- pytest --cov --cov-report html
- # open ./htmlcov/index.html in a browser
-
-TODO: Persistant GROBID and HBase during development? Or just use live
-resources?
-
-## Extraction Task
-
-An example actually connecting to HBase from a local machine, with thrift
-running on a devbox and GROBID running on a dedicated machine:
-
- ./extraction_cdx_grobid.py \
- --hbase-table wbgrp-journal-extract-0-qa \
- --hbase-host wbgrp-svc263.us.archive.org \
- --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \
- tests/files/example.cdx
-
-Running from the cluster:
-
- # Create tarball of virtualenv
- export PIPENV_VENV_IN_PROJECT=1
- pipenv shell
- export VENVSHORT=`basename $VIRTUAL_ENV`
- tar -czf $VENVSHORT.tar.gz -C /home/bnewbold/.local/share/virtualenvs/$VENVSHORT .
-
- ./extraction_cdx_grobid.py \
- --hbase-table wbgrp-journal-extract-0-qa \
- --hbase-host wbgrp-svc263.us.archive.org \
- --grobid-uri http://wbgrp-svc096.us.archive.org:8070 \
- -r hadoop \
- -c mrjob.conf \
- --archive $VENVSHORT.tar.gz#venv \
- hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx
-
-## Backfill Task
-
-An example actually connecting to HBase from a local machine, with thrift
-running on a devbox:
-
- ./backfill_hbase_from_cdx.py \
- --hbase-table wbgrp-journal-extract-0-qa \
- --hbase-host wbgrp-svc263.us.archive.org \
- tests/files/example.cdx
-
-Actual invocation to run on Hadoop cluster (running on an IA devbox, where
-hadoop environment is configured):
-
- # Create tarball of virtualenv
- export PIPENV_VENV_IN_PROJECT=1
- pipenv install --deploy
- tar -czf venv-current.tar.gz -C .venv .
-
- ./backfill_hbase_from_cdx.py \
- --hbase-host wbgrp-svc263.us.archive.org \
- --hbase-table wbgrp-journal-extract-0-qa \
- -r hadoop \
- -c mrjob.conf \
- --archive $VENVSHORT.tar.gz#venv \
- hdfs:///user/bnewbold/journal_crawl_cdx/citeseerx_crawl_2017.cdx