aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-06-24 02:25:52 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-06-24 02:25:52 +0200
commit426af6950afe0b25c5428cefee953ec345321319 (patch)
tree4451d925c91cf8b5772e7567049cff890665b1ac /README.md
parent22b8b7847e0d31eb50e5ce4d8653feae8011abaa (diff)
downloadrefcat-426af6950afe0b25c5428cefee953ec345321319.tar.gz
refcat-426af6950afe0b25c5428cefee953ec345321319.zip
update notes
Diffstat (limited to 'README.md')
-rw-r--r--README.md14
1 files changed, 7 insertions, 7 deletions
diff --git a/README.md b/README.md
index eb07572..21828e4 100644
--- a/README.md
+++ b/README.md
@@ -6,11 +6,11 @@ all relevant code close.
* [python](python): mostly [luigi](https://github.com/spotify/luigi) tasks (using
[shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* [skate](skate): various Go command line tools (packaged as deb)
+* [skate](skate): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks
-Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
+Context: [fatcat](https://fatcat.wiki)
-The high level goals are:
+The high level goals of this project are:
* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
@@ -19,12 +19,12 @@ The high level goals are:
The main challenges are:
* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
-* currently a single machine setup (aitio.us.archive.org, 16 cores, 16TB disk mounted at /magna)
-* very partial metadata (requiring separate code paths)
-* difficult data quality (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
+* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space)
+* partial metadata (requiring separate code paths)
+* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
-We use informal, internal versioning for the graph currently v2, next will be v3/v4.
+We use informal, internal versioning for the graph currently v3, next will be v4/v5.
![](https://i.imgur.com/6dSaW2q.png)