aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
diff options
context:
space:
mode:
Diffstat (limited to 'TODO')
-rw-r--r--TODO57
1 files changed, 43 insertions, 14 deletions
diff --git a/TODO b/TODO
index 5e9220b..77b48c9 100644
--- a/TODO
+++ b/TODO
@@ -1,22 +1,51 @@
+## Kafka Pipelines
+
+- after network split, mass restarting import/harvest stuff seemed to
+ completely reset consumergroups (!). bunch of LeaderNotFoundError
+ => change/update consumer group config
+ => ensure we are recording timestamps to allow timestamp-based resets
+- refactor python kafka clients (slack convo with kenji+dvd)
+ => try librdkafka?
+ => switch to python-kafka?
+- monitoring/alerting of consumergroup offsets
+ => start with crude python script?
+- document: need to restart all consumers after brokers restart
+- operate on batches, using threads/async, and reduce worker (process) counts
+ dramatically
+
+source of kafka-manager weirdness?
+ Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'user_data': java.nio.BufferUnderflowException
+ Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: [error] k.m.a.c.KafkaManagedOffsetCache - Failed to get member metadata from group summary and member summary : grobid-hbase-insert : MemberSummary(pykafka-8128e0be-4952-4e79-8644-a52987421259,pykafka,/207.241.225.228,[B@6c368f37,[B@2b007e01)
+
+## Other
+
+- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash)
+- catch EOFFail fetching from wayback
+- "author counts match" in scoring
+- refactor "scorable" to "matchable"
+- look at refactoring to reduce JSON serializations
+- QA tool for matches (PDF + Crossref JSON + landing page?)
+ => python; talks directly to HBase
+- author counts should match (+/- one?)
+
+match strategies (hbase columns):
+- legacy_doi
+- url_doi
+- grobid_crossref (doi)
+- grobid_fatcat (fatcat ID)
+
+scalding:
+- better JSON library
+- less verbose sbt test output (set log level to WARN)
+- auto-formatting: addSbtPlugin("com.geirsson" % "sbt-scalafmt" % "1.6.0-RC3")
+
pig:
- potentially want to *not* de-dupe CDX lines by uniq sha1 in all cases; run
this as a second-stage filter? for example, may want many URL links in fatcat
for a single file (different links, different policies)
+- fix pig gitlab-ci tests (JAVA_HOME)
+python:
- include input file name (and chunk? and CDX?) in sentry context
-- play with test image on older releases (eg, trusty)
-
- how to get argument (like --hbase-table) into mrjob.conf, or similar?
-- fix pig gitlab-ci tests (JAVA_HOME). also make fetch_deps *way* more quiet
-- sentry: https://github.com/getsentry/raven-python
-
-potential helpers:
-- https://github.com/martinblech/xmltodict
-- https://github.com/trananhkma/fucking-awesome-python#text-processing
-- https://github.com/blaze/blaze (for catalog/analytics)
-- validation: https://github.com/pyeve/cerberus
-- testing (to replace nose):
- - https://github.com/CleanCut/green
- - pytest
- - mamba ("behavior driven")