aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--README.md20
1 files changed, 17 insertions, 3 deletions
diff --git a/README.md b/README.md
index b32e565..68b6649 100644
--- a/README.md
+++ b/README.md
@@ -4,13 +4,27 @@ Scholarly citation graph related code; maintained by
[martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
all relevant code close.
-* python: mostly [luigi](https://github.com/spotify/luigi) tasks (using
+* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
[shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* skate: various Go command line tools (packaged as deb)
+* [skate](skate): various Go command line tools (packaged as deb)
Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
-We use informal, internal versioning for the graph currently v2, next will be v3.
+The high level goals are:
+
+* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
+* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links
+* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
+
+The main challenges are:
+
+* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
+* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
+* very partial metadata
+* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.)
+* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
+
+We use informal, internal versioning for the graph currently v2, next will be v3/v4.
![](https://i.imgur.com/6dSaW2q.png)