diff options
-rw-r--r-- | README.md | 20 |
1 files changed, 17 insertions, 3 deletions
@@ -4,13 +4,27 @@ Scholarly citation graph related code; maintained by [martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep all relevant code close. -* python: mostly [luigi](https://github.com/spotify/luigi) tasks (using +* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) -* skate: various Go command line tools (packaged as deb) +* [skate](skate): various Go command line tools (packaged as deb) Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21) -We use informal, internal versioning for the graph currently v2, next will be v3. +The high level goals are: + +* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata +* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links +* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report) + +The main challenges are: + +* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) +* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna) +* very partial metadata +* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.) +* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) + +We use informal, internal versioning for the graph currently v2, next will be v3/v4. ![](https://i.imgur.com/6dSaW2q.png) |