blob: 16fab4d419ff1a93fe9f114f438c7c52b8d614f8 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
# Maintenance Notes
Possible maintenance improvements:
* [ ] have one code path for continuous and batch processing
* [ ] limit functionality in custom binaries (e.g. `skate-*`)
* [ ] push cleanup code upstream (data source, or some preprocessing)
* [ ] better documentation
## Continuous Update Ideas
Currently, we derive the graph from raw data blob, e.g. references, fatcat
database, open library database dump, wikipedia dump.
Goal would be to start a service and let the graph index (or whatever data
store) be updated as new data arrives.
For example:
1. new publication (P) arrives
2. it refereces articles and web pages, books, etc; we can get this information from the data or grobid
3. we lookup the title on P in some existing data store; we lookup normalized
title in some normalized data store; we could just exact of fuzzy match
against elasticsearch; we generate match candidates, e.g. where all references
live (here: batch requires high performance, whereas continuous would be about
order of 100K per day ).
4. we verify matches (here: batch needs to be fast again; 1M/min or the like)
5. we update the index and add new edges between document
6. we add all references found into the "reference store"
|