add higher level description

author: Martin Czygan <martin.czygan@gmail.com> 2021-05-31 23:50:07 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-05-31 23:50:07 +0200
commit: 7fe8b79209a681a2f4d63a023922685b65bce203 (patch)
tree: 4a060c803eb234edf562f004086353b618984819 /README.md
parent: ef5e9e8b6787fdfd6af1226b48a4f6aeaf4cab54 (diff)
download: refcat-7fe8b79209a681a2f4d63a023922685b65bce203.tar.gz
refcat-7fe8b79209a681a2f4d63a023922685b65bce203.zip
1 files changed, 17 insertions, 3 deletions
diff --git a/README.md b/README.md
index b32e565..68b6649 100644
--- a/README.md
+++ b/README.md
@@ -4,13 +4,27 @@ Scholarly citation graph related code; maintained by
 [martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
 all relevant code close.
 
-* python: mostly [luigi](https://github.com/spotify/luigi) tasks (using
+* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
   [shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* skate: various Go command line tools (packaged as deb)
+* [skate](skate): various Go command line tools (packaged as deb)
 
 Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
 
-We use informal, internal versioning for the graph currently v2, next will be v3.
+The high level goals are:
+
+* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
+* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links
+* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
+
+The main challenges are:
+
+* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
+* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
+* very partial metadata
+* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.)
+* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
+
+We use informal, internal versioning for the graph currently v2, next will be v3/v4.
 
 ![](https://i.imgur.com/6dSaW2q.png)
author	Martin Czygan <martin.czygan@gmail.com>	2021-05-31 23:50:07 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-05-31 23:50:07 +0200
commit	7fe8b79209a681a2f4d63a023922685b65bce203 (patch)
tree	4a060c803eb234edf562f004086353b618984819 /README.md
parent	ef5e9e8b6787fdfd6af1226b48a4f6aeaf4cab54 (diff)
download	refcat-7fe8b79209a681a2f4d63a023922685b65bce203.tar.gz refcat-7fe8b79209a681a2f4d63a023922685b65bce203.zip