1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
Want to receive a continual stream of updates from both fatcat and SIM
scanning; index the updated content; and push into elasticsearch.
## Filtering and Affordances
The `updated` and `fetched` timestamps are not immediately necessary or
implemented, but they can be used to filter updates. For example, after
re-loading from a build entity dump, could "roll back" update pipeline to only
fatcat (work) updates after the changelog index that the bulk dump is stamped
with.
At least in theory, the `fetched` timestamp could be used to prevent re-updates
of existing documents in the ES index.
The `doc_index_ts` timestamp in the ES index could be used in a future
fetch-and-reindex worker to select documents for re-indexing, or to delete
old/stale documents (eg, after SIM issue re-indexing if there were spurious
"page" type documents remaining).
## Message Types
Scholar Update Request JSON
- `key`: str
- `type`: str
- `fatcat_work`
- `sim_issue`
- `updated`: datetime, UTC, of event resulting in this request
- `work_ident`: str (works)
- `fatcat_changelog`: int (works)
- `sim_item`: str (items)
"Heavy Intermediate" JSON (existing schema)
- key
- `fetched`: Optional[datetime], UTC, when this doc was collected
Scholar Fulltext ES JSON (existing schema)
## Kafka Topics
fatcat-ENV.work-ident-updates
6x, long retention, key compaction
key: doc ident
scholar-ENV.sim-updates
6x, long retention, key compaction
key: doc ident
scholar-ENV.update-docs
12x, short retention (2 months?)
key: doc ident
## Workers
scholar-fetch-docs-worker
consumes fatcat and/or sim update requests, individually
constructs heavy intermediate
publishes to update-docs topic
scholar-index-docs-worker
consumes updated "heavy intermediate" documents, in batches
transforms to elasticsearch schema
updates elasticsearch
|