aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-05-31 23:50:07 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-05-31 23:50:07 +0200
commit7fe8b79209a681a2f4d63a023922685b65bce203 (patch)
tree4a060c803eb234edf562f004086353b618984819 /README.md
parentef5e9e8b6787fdfd6af1226b48a4f6aeaf4cab54 (diff)
downloadrefcat-7fe8b79209a681a2f4d63a023922685b65bce203.tar.gz
refcat-7fe8b79209a681a2f4d63a023922685b65bce203.zip
add higher level description
Diffstat (limited to 'README.md')
-rw-r--r--README.md20
1 files changed, 17 insertions, 3 deletions
diff --git a/README.md b/README.md
index b32e565..68b6649 100644
--- a/README.md
+++ b/README.md
@@ -4,13 +4,27 @@ Scholarly citation graph related code; maintained by
[martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
all relevant code close.
-* python: mostly [luigi](https://github.com/spotify/luigi) tasks (using
+* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
[shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* skate: various Go command line tools (packaged as deb)
+* [skate](skate): various Go command line tools (packaged as deb)
Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
-We use informal, internal versioning for the graph currently v2, next will be v3.
+The high level goals are:
+
+* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
+* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other links
+* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
+
+The main challenges are:
+
+* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
+* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
+* very partial metadata
+* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc.)
+* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
+
+We use informal, internal versioning for the graph currently v2, next will be v3/v4.
![](https://i.imgur.com/6dSaW2q.png)