From a0be9706997182b18e48000375c462856aafc5ef Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 10 Apr 2018 19:13:43 -0700 Subject: TODO updates --- TODO | 8 +------- mapreduce/TODO | 6 ++---- mapreduce/backfill_hbase_from_cdx.py | 7 ------- 3 files changed, 3 insertions(+), 18 deletions(-) diff --git a/TODO b/TODO index e998728..c52ab17 100644 --- a/TODO +++ b/TODO @@ -1,13 +1,7 @@ -Will probably eventually refactor into top-level plus modules. Eg, "common" -directory, "backfill" and "extraction" as sub-directories. Downside of this is -single giant pipenv venv with all dependencies? - - how to get argument (like --hbase-table) into mrjob.conf, or similar? - fix pig gitlab-ci tests (JAVA_HOME). also make fetch_deps *way* more quiet - -sentry: -- https://github.com/getsentry/raven-python +- sentry: https://github.com/getsentry/raven-python potential helpers: - https://github.com/martinblech/xmltodict diff --git a/mapreduce/TODO b/mapreduce/TODO index 3459752..4f4db16 100644 --- a/mapreduce/TODO +++ b/mapreduce/TODO @@ -1,6 +1,4 @@ -- better test coverage (actually check coverage!) -- use pre-mapper command to filter down, eg, by status type? +- quality scoring (of JSON output) +- use pre-mapper `grep` command to filter down, eg, by status? - automation/docs for bundling virtualenv along - think about speedups -- abstract CDX line reading and HBase stuff out into a common library -- actual GROBID_SERVER="http://wbgrp-svc096.us.archive.org:8070" diff --git a/mapreduce/backfill_hbase_from_cdx.py b/mapreduce/backfill_hbase_from_cdx.py index 72331b0..6b2ec0b 100755 --- a/mapreduce/backfill_hbase_from_cdx.py +++ b/mapreduce/backfill_hbase_from_cdx.py @@ -7,13 +7,6 @@ formats. Requires: - happybase - mrjob - -TODO: -- argparse -- refactor into an object -- tests in separate file -- nose tests -- sentry integration for error reporting """ import json -- cgit v1.2.3