update notes

author: Martin Czygan <martin.czygan@gmail.com> 2021-06-24 02:25:52 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-06-24 02:25:52 +0200
commit: 426af6950afe0b25c5428cefee953ec345321319 (patch)
tree: 4451d925c91cf8b5772e7567049cff890665b1ac /README.md
parent: 22b8b7847e0d31eb50e5ce4d8653feae8011abaa (diff)
download: refcat-426af6950afe0b25c5428cefee953ec345321319.tar.gz
refcat-426af6950afe0b25c5428cefee953ec345321319.zip
1 files changed, 7 insertions, 7 deletions
diff --git a/README.md b/README.md
index eb07572..21828e4 100644
--- a/README.md
+++ b/README.md
@@ -6,11 +6,11 @@ all relevant code close.
 
 * [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
   [shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* [skate](skate): various Go command line tools (packaged as deb)
+* [skate](skate): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks
 
-Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
+Context: [fatcat](https://fatcat.wiki)
 
-The high level goals are:
+The high level goals of this project are:
 
 * deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
 * beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
@@ -19,12 +19,12 @@ The high level goals are:
 The main challenges are:
 
 * currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
-* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
-* very partial metadata (requiring separate code paths)
-* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
+* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space)
+* partial metadata (requiring separate code paths)
+* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
 * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
 
-We use informal, internal versioning for the graph currently v2, next will be v3/v4.
+We use informal, internal versioning for the graph currently v3, next will be v4/v5.
 
 ![](https://i.imgur.com/6dSaW2q.png)
author	Martin Czygan <martin.czygan@gmail.com>	2021-06-24 02:25:52 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-06-24 02:25:52 +0200
commit	426af6950afe0b25c5428cefee953ec345321319 (patch)
tree	4451d925c91cf8b5772e7567049cff890665b1ac /README.md
parent	22b8b7847e0d31eb50e5ce4d8653feae8011abaa (diff)
download	refcat-426af6950afe0b25c5428cefee953ec345321319.tar.gz refcat-426af6950afe0b25c5428cefee953ec345321319.zip