aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-04-12 13:50:55 -0700
committerBryan Newbold <bnewbold@archive.org>2019-04-12 14:19:29 -0700
commitb23455fcc90416be370c4396c1f1e4bbe36b93d6 (patch)
treeb8ed69b91c2dffc9b3331cb68125c872dbd2b292
parentd93ebaa691f8b200a5761850b4533a153cb457ee (diff)
downloadsandcrawler-b23455fcc90416be370c4396c1f1e4bbe36b93d6.tar.gz
sandcrawler-b23455fcc90416be370c4396c1f1e4bbe36b93d6.zip
update TODO
-rw-r--r--TODO23
1 files changed, 22 insertions, 1 deletions
diff --git a/TODO b/TODO
index 1f1c2b9..77b48c9 100644
--- a/TODO
+++ b/TODO
@@ -1,4 +1,25 @@
+## Kafka Pipelines
+
+- after network split, mass restarting import/harvest stuff seemed to
+ completely reset consumergroups (!). bunch of LeaderNotFoundError
+ => change/update consumer group config
+ => ensure we are recording timestamps to allow timestamp-based resets
+- refactor python kafka clients (slack convo with kenji+dvd)
+ => try librdkafka?
+ => switch to python-kafka?
+- monitoring/alerting of consumergroup offsets
+ => start with crude python script?
+- document: need to restart all consumers after brokers restart
+- operate on batches, using threads/async, and reduce worker (process) counts
+ dramatically
+
+source of kafka-manager weirdness?
+ Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'user_data': java.nio.BufferUnderflowException
+ Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: [error] k.m.a.c.KafkaManagedOffsetCache - Failed to get member metadata from group summary and member summary : grobid-hbase-insert : MemberSummary(pykafka-8128e0be-4952-4e79-8644-a52987421259,pykafka,/207.241.225.228,[B@6c368f37,[B@2b007e01)
+
+## Other
+
- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash)
- catch EOFFail fetching from wayback
- "author counts match" in scoring
@@ -8,7 +29,7 @@
=> python; talks directly to HBase
- author counts should match (+/- one?)
-match strategies (hbase columns)
+match strategies (hbase columns):
- legacy_doi
- url_doi
- grobid_crossref (doi)