diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-04-04 12:06:38 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-04-04 12:06:38 -0700 |
commit | 1dad0d9e54bfae93eebea47f8a3cb291cdd645c5 (patch) | |
tree | 97a8c9bcaf93734e2dbd8f431d37213520b55fbd /mapreduce/TODO | |
parent | 427dd875958c8a6d2d791d55f9dda300ebdc853b (diff) | |
download | sandcrawler-1dad0d9e54bfae93eebea47f8a3cb291cdd645c5.tar.gz sandcrawler-1dad0d9e54bfae93eebea47f8a3cb291cdd645c5.zip |
extraction -> mapreduce
Diffstat (limited to 'mapreduce/TODO')
-rw-r--r-- | mapreduce/TODO | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/mapreduce/TODO b/mapreduce/TODO new file mode 100644 index 0000000..3459752 --- /dev/null +++ b/mapreduce/TODO @@ -0,0 +1,6 @@ +- better test coverage (actually check coverage!) +- use pre-mapper command to filter down, eg, by status type? +- automation/docs for bundling virtualenv along +- think about speedups +- abstract CDX line reading and HBase stuff out into a common library +- actual GROBID_SERVER="http://wbgrp-svc096.us.archive.org:8070" |