aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--TODO8
-rw-r--r--mapreduce/TODO6
-rwxr-xr-xmapreduce/backfill_hbase_from_cdx.py7
3 files changed, 3 insertions, 18 deletions
diff --git a/TODO b/TODO
index e998728..c52ab17 100644
--- a/TODO
+++ b/TODO
@@ -1,13 +1,7 @@
-Will probably eventually refactor into top-level plus modules. Eg, "common"
-directory, "backfill" and "extraction" as sub-directories. Downside of this is
-single giant pipenv venv with all dependencies?
-
- how to get argument (like --hbase-table) into mrjob.conf, or similar?
- fix pig gitlab-ci tests (JAVA_HOME). also make fetch_deps *way* more quiet
-
-sentry:
-- https://github.com/getsentry/raven-python
+- sentry: https://github.com/getsentry/raven-python
potential helpers:
- https://github.com/martinblech/xmltodict
diff --git a/mapreduce/TODO b/mapreduce/TODO
index 3459752..4f4db16 100644
--- a/mapreduce/TODO
+++ b/mapreduce/TODO
@@ -1,6 +1,4 @@
-- better test coverage (actually check coverage!)
-- use pre-mapper command to filter down, eg, by status type?
+- quality scoring (of JSON output)
+- use pre-mapper `grep` command to filter down, eg, by status?
- automation/docs for bundling virtualenv along
- think about speedups
-- abstract CDX line reading and HBase stuff out into a common library
-- actual GROBID_SERVER="http://wbgrp-svc096.us.archive.org:8070"
diff --git a/mapreduce/backfill_hbase_from_cdx.py b/mapreduce/backfill_hbase_from_cdx.py
index 72331b0..6b2ec0b 100755
--- a/mapreduce/backfill_hbase_from_cdx.py
+++ b/mapreduce/backfill_hbase_from_cdx.py
@@ -7,13 +7,6 @@ formats.
Requires:
- happybase
- mrjob
-
-TODO:
-- argparse
-- refactor into an object
-- tests in separate file
-- nose tests
-- sentry integration for error reporting
"""
import json