From f5dafe6e3ceb588d7ab89bf3cbb11c5a579b6678 Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Sat, 1 May 2021 14:23:06 +0200
Subject: start overview docs

---
 README.md         |  6 ++----
 notes/overview.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 4 deletions(-)
 create mode 100644 notes/overview.md

diff --git a/README.md b/README.md
index 528dd7d..b32e565 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,5 @@
 # cgraph
 
-----
-
 Scholarly citation graph related code; maintained by
 [martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
 all relevant code close.
@@ -10,9 +8,9 @@ all relevant code close.
   [shiv](https://github.com/linkedin/shiv) for single-file deployments)
 * skate: various Go command line tools (packaged as deb)
 
-Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21).
+Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21)
 
-We use informal, internal versioning, currently v2, next will be v3.
+We use informal, internal versioning for the graph currently v2, next will be v3.
 
 ![](https://i.imgur.com/6dSaW2q.png)
 
diff --git a/notes/overview.md b/notes/overview.md
new file mode 100644
index 0000000..8cb1200
--- /dev/null
+++ b/notes/overview.md
@@ -0,0 +1,52 @@
+# Overview
+
+## Data inputs
+
+Mostly JSON, but each one different in form and quality.
+
+Core inputs:
+
+* refs schema, from metadata or grobid (1-4B)
+* fatcat release entities (100-200M)
+* open library solr export (10-50M)
+
+Other inputs:
+
+* researchgate sitemap, titles (10-30M)
+* oai-pmh harvest metadata (50-200M)
+* sim (serials in microfilm, "microfilm") metadata
+
+Inputs related to evaluation:
+
+* BASE md dump (200-300M)
+* Microsoft Academic, MAG (100-300M)
+
+Casually:
+
+* a single title, e.g. ILL related (1)
+* lists of titles (1-1M)
+
+## Targets
+
+### BiblioRef
+
+Most important high level target; basic schema for current setup; elasticsearch
+indexable, small JSON docs, allowing basic aggregations and lookups.
+
+This is not just a conversion, but may involve clustering, verification, etc.
+
+## Approach
+
+We may call it "local map-reduce", and we try to do it all in a single MR setup, e.g.
+
+* extract relevant fields and sort (map)
+* apply computation on groups (reduce)
+
+As we want performance and sometimes custom code (e.g. for finding information
+in unstructured data), we try to group code into a Go library with a suite of
+command line tools. Easy to build and deploy.
+
+If the scaffoling is good, we can plug in mappers and reducers as we go, and
+expose them in the tools.
+
+
-- 
cgit v1.2.3