diff options
-rw-r--r-- | README.md | 14 |
1 files changed, 7 insertions, 7 deletions
@@ -6,11 +6,11 @@ all relevant code close. * [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) -* [skate](skate): various Go command line tools (packaged as deb) +* [skate](skate): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks -Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21) +Context: [fatcat](https://fatcat.wiki) -The high level goals are: +The high level goals of this project are: * deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata * beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia) @@ -19,12 +19,12 @@ The high level goals are: The main challenges are: * currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) -* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna) -* very partial metadata (requiring separate code paths) -* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) +* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space) +* partial metadata (requiring separate code paths) +* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) -We use informal, internal versioning for the graph currently v2, next will be v3/v4. +We use informal, internal versioning for the graph currently v3, next will be v4/v5. ![](https://i.imgur.com/6dSaW2q.png) |