proposals/kafka_update_pipeline.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


Want to receive a continual stream of updates from both fatcat and SIM
scanning; index the updated content; and push into elasticsearch.


## Filtering and Affordances

The `updated` and `fetched` timestamps are not immediately necessary or
implemented, but they can be used to filter updates. For example, after
re-loading from a build entity dump, could "roll back" update pipeline to only
fatcat (work) updates after the changelog index that the bulk dump is stamped
with.

At least in theory, the `fetched` timestamp could be used to prevent re-updates
of existing documents in the ES index.

The `doc_index_ts` timestamp in the ES index could be used in a future
fetch-and-reindex worker to select documents for re-indexing, or to delete
old/stale documents (eg, after SIM issue re-indexing if there were spurious
"page" type documents remaining).

## Message Types

Scholar Update Request JSON
- `key`: str - `type`: str
    - `fatcat_work`
    - `sim_issue`
- `updated`: datetime, UTC, of event resulting in this request
- `work_ident`: str (works)
- `fatcat_changelog`: int (works)
- `sim_item`: str (items)

"Heavy Intermediate" JSON (existing schema)
- key
- `fetched`: Optional[datetime], UTC, when this doc was collected

Scholar Fulltext ES JSON (existing schema)


## Kafka Topics

fatcat-ENV.work-ident-updates
    6x, long retention, key compaction
    key: doc ident
scholar-ENV.sim-updates
    6x, long retention, key compaction
    key: doc ident
scholar-ENV.update-docs
    12x, short retention (2 months?)
    key: doc ident

## Workers

scholar-fetch-docs-worker
    consumes fatcat and/or sim update requests, individually
    constructs heavy intermediate
    publishes to update-docs topic

scholar-index-docs-worker
    consumes updated "heavy intermediate" documents, in batches
    transforms to elasticsearch schema
    updates elasticsearch