TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


Note: as of 2022 this file is ancient and need review
 
## Kafka Pipelines

- after network split, mass restarting import/harvest stuff seemed to
  completely reset consumergroups (!). bunch of LeaderNotFoundError
    => change/update consumer group config
    => ensure we are recording timestamps to allow timestamp-based resets
- refactor python kafka clients (slack convo with kenji+dvd)
    => try librdkafka?
    => switch to python-kafka?
- monitoring/alerting of consumergroup offsets
    => start with crude python script?
- document: need to restart all consumers after brokers restart
- operate on batches, using threads/async, and reduce worker (process) counts
  dramatically

source of kafka-manager weirdness?
    Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'user_data': java.nio.BufferUnderflowException
    Dec 02 01:05:40 wbgrp-svc263.us.archive.org kafka-manager[7032]: [error] k.m.a.c.KafkaManagedOffsetCache - Failed to get member metadata from group summary and member summary : grobid-hbase-insert : MemberSummary(pykafka-8128e0be-4952-4e79-8644-a52987421259,pykafka,/207.241.225.228,[B@6c368f37,[B@2b007e01)

## Other

- paper match heuristic: include 10.1007%2F978-3-319-49304-6_18 (URL-escaped slash)
- catch EOFFail fetching from wayback
- "author counts match" in scoring
- refactor "scorable" to "matchable"
- look at refactoring to reduce JSON serializations
- QA tool for matches (PDF + Crossref JSON + landing page?)
    => python; talks directly to HBase
- author counts should match (+/- one?)

match strategies (hbase columns):
- legacy_doi
- url_doi
- grobid_crossref (doi)
- grobid_fatcat (fatcat ID)

scalding:
- better JSON library
- less verbose sbt test output (set log level to WARN)
- auto-formatting: addSbtPlugin("com.geirsson" % "sbt-scalafmt" % "1.6.0-RC3")

pig:
- potentially want to *not* de-dupe CDX lines by uniq sha1 in all cases; run
  this as a second-stage filter? for example, may want many URL links in fatcat
  for a single file (different links, different policies)
- fix pig gitlab-ci tests (JAVA_HOME)

python:
- include input file name (and chunk? and CDX?) in sentry context
- how to get argument (like --hbase-table) into mrjob.conf, or similar?