notes/maintenance.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

# Maintenance Notes

Possible maintenance improvements:

* [ ] have one code path for continuous and batch processing
* [ ] limit functionality in custom binaries (e.g. `skate-*`)
* [ ] push cleanup code upstream (data source, or some preprocessing)
* [ ] better documentation

## Continuous Update Ideas

Currently, we derive the graph from raw data blob, e.g. references, fatcat
database, open library database dump, wikipedia dump.

Goal would be to start a service and let the graph index (or whatever data
store) be updated as new data arrives.

For example:

1. new publication (P) arrives
2. it refereces articles and web pages, books, etc; we can get this information from the data or grobid
3. we lookup the title on P in some existing data store; we lookup normalized
   title in some normalized data store; we could just exact of fuzzy match
   against elasticsearch; we generate match candidates, e.g. where all references live
4. we verify matches
5. we update the index and add new edges between document
6. we add all references found into the "reference store"