# Maintenance Notes Possible maintenance improvements: * [ ] have one code path for continuous and batch processing * [ ] limit functionality in custom binaries (e.g. `skate-*`) * [ ] push cleanup code upstream (data source, or some preprocessing) * [ ] better documentation ## Continuous Update Ideas Currently, we derive the graph from raw data blob, e.g. references, fatcat database, open library database dump, wikipedia dump. Goal would be to start a service and let the graph index (or whatever data store) be updated as new data arrives. For example: 1. new publication (P) arrives 2. it refereces articles and web pages, books, etc; we can get this information from the data or grobid 3. we lookup the title on P in some existing data store; we lookup normalized title in some normalized data store; we could just exact of fuzzy match against elasticsearch; we generate match candidates, e.g. where all references live (here: batch requires high performance, whereas continuous would be about order of 100K per day ). 4. we verify matches (here: batch needs to be fast again; 1M/min or the like) 5. we update the index and add new edges between document 6. we add all references found into the "reference store"