From f5dafe6e3ceb588d7ab89bf3cbb11c5a579b6678 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Sat, 1 May 2021 14:23:06 +0200 Subject: start overview docs --- README.md | 6 ++---- notes/overview.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 54 insertions(+), 4 deletions(-) create mode 100644 notes/overview.md diff --git a/README.md b/README.md index 528dd7d..b32e565 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,5 @@ # cgraph ----- - Scholarly citation graph related code; maintained by [martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep all relevant code close. @@ -10,9 +8,9 @@ all relevant code close. [shiv](https://github.com/linkedin/shiv) for single-file deployments) * skate: various Go command line tools (packaged as deb) -Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21). +Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21) -We use informal, internal versioning, currently v2, next will be v3. +We use informal, internal versioning for the graph currently v2, next will be v3. ![](https://i.imgur.com/6dSaW2q.png) diff --git a/notes/overview.md b/notes/overview.md new file mode 100644 index 0000000..8cb1200 --- /dev/null +++ b/notes/overview.md @@ -0,0 +1,52 @@ +# Overview + +## Data inputs + +Mostly JSON, but each one different in form and quality. + +Core inputs: + +* refs schema, from metadata or grobid (1-4B) +* fatcat release entities (100-200M) +* open library solr export (10-50M) + +Other inputs: + +* researchgate sitemap, titles (10-30M) +* oai-pmh harvest metadata (50-200M) +* sim (serials in microfilm, "microfilm") metadata + +Inputs related to evaluation: + +* BASE md dump (200-300M) +* Microsoft Academic, MAG (100-300M) + +Casually: + +* a single title, e.g. ILL related (1) +* lists of titles (1-1M) + +## Targets + +### BiblioRef + +Most important high level target; basic schema for current setup; elasticsearch +indexable, small JSON docs, allowing basic aggregations and lookups. + +This is not just a conversion, but may involve clustering, verification, etc. + +## Approach + +We may call it "local map-reduce", and we try to do it all in a single MR setup, e.g. + +* extract relevant fields and sort (map) +* apply computation on groups (reduce) + +As we want performance and sometimes custom code (e.g. for finding information +in unstructured data), we try to group code into a Go library with a suite of +command line tools. Easy to build and deploy. + +If the scaffoling is good, we can plug in mappers and reducers as we go, and +expose them in the tools. + + -- cgit v1.2.3