notes/maintenance.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

# Maintenance Notes

Possible maintenance improvements:

* [ ] have one code path for continuous and batch processing
* [ ] limit functionality in custom binaries (e.g. `skate-*`)
* [ ] push cleanup code upstream (data source, or some preprocessing)
* [ ] better documentation

## Continuous Update Ideas

Currently, we derive the graph from raw data blob, e.g. references, fatcat
database, open library database dump, wikipedia dump.

Goal would be to start a service and let the graph index (or whatever data
store) be updated as new data arrives.

For example:

1. new publication (P) arrives
2. it refereces articles and web pages, books, etc; we can get this information from the data or grobid
3. we lookup the title on P in some existing data store; we lookup normalized
   title in some normalized data store; we could just exact of fuzzy match
   against elasticsearch; we generate match candidates, e.g. where all references
   live (here: batch requires high performance, whereas continuous would be about
   order of 100K per day ).
4. we verify matches (here: batch needs to be fast again; 1M/min or the like)
5. we update the index and add new edges between document
6. we add all references found into the "reference store"