diff options
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 57 |
1 files changed, 43 insertions, 14 deletions
@@ -1,22 +1,51 @@ +## Kafka Pipelines + +- after network split, mass restarting import/harvest stuff seemed to + completely reset consumergroups (!). bunch of LeaderNotFoundError + => change/update consumer group config + => ensure we are recording timestamps to allow timestamp-based resets +- refactor python kafka clients (slack convo with kenji+dvd) + => try librdkafka? + => switch to python-kafka? +- monitoring/alerting of consumergroup offsets + => start with crude python script? +- document: need to restart all consumers after brokers restart +- operate on batches, using threads/async, and reduce worker (process) counts + dramatically + +source of kafka-manager weirdness? + Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'user_data': java.nio.BufferUnderflowException + Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: [error] k.m.a.c.KafkaManagedOffsetCache - Failed to get member metadata from group summary and member summary : grobid-hbase-insert : MemberSummary(pykafka-8128e0be-4952-4e79-8644-a52987421259,pykafka,/207.241.225.228,[B@6c368f37,[B@2b007e01) + +## Other + +- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash) +- catch EOFFail fetching from wayback +- "author counts match" in scoring +- refactor "scorable" to "matchable" +- look at refactoring to reduce JSON serializations +- QA tool for matches (PDF + Crossref JSON + landing page?) + => python; talks directly to HBase +- author counts should match (+/- one?) + +match strategies (hbase columns): +- legacy_doi +- url_doi +- grobid_crossref (doi) +- grobid_fatcat (fatcat ID) + +scalding: +- better JSON library +- less verbose sbt test output (set log level to WARN) +- auto-formatting: addSbtPlugin("com.geirsson" % "sbt-scalafmt" % "1.6.0-RC3") + pig: - potentially want to *not* de-dupe CDX lines by uniq sha1 in all cases; run this as a second-stage filter? for example, may want many URL links in fatcat for a single file (different links, different policies) +- fix pig gitlab-ci tests (JAVA_HOME) +python: - include input file name (and chunk? and CDX?) in sentry context -- play with test image on older releases (eg, trusty) - - how to get argument (like --hbase-table) into mrjob.conf, or similar? -- fix pig gitlab-ci tests (JAVA_HOME). also make fetch_deps *way* more quiet -- sentry: https://github.com/getsentry/raven-python - -potential helpers: -- https://github.com/martinblech/xmltodict -- https://github.com/trananhkma/fucking-awesome-python#text-processing -- https://github.com/blaze/blaze (for catalog/analytics) -- validation: https://github.com/pyeve/cerberus -- testing (to replace nose): - - https://github.com/CleanCut/green - - pytest - - mamba ("behavior driven") |