aboutsummaryrefslogtreecommitdiffstats
path: root/notes/overview.md
blob: 8cb120071d4543a6e0c7b1cee5aeb847c8b4acba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Overview

## Data inputs

Mostly JSON, but each one different in form and quality.

Core inputs:

* refs schema, from metadata or grobid (1-4B)
* fatcat release entities (100-200M)
* open library solr export (10-50M)

Other inputs:

* researchgate sitemap, titles (10-30M)
* oai-pmh harvest metadata (50-200M)
* sim (serials in microfilm, "microfilm") metadata

Inputs related to evaluation:

* BASE md dump (200-300M)
* Microsoft Academic, MAG (100-300M)

Casually:

* a single title, e.g. ILL related (1)
* lists of titles (1-1M)

## Targets

### BiblioRef

Most important high level target; basic schema for current setup; elasticsearch
indexable, small JSON docs, allowing basic aggregations and lookups.

This is not just a conversion, but may involve clustering, verification, etc.

## Approach

We may call it "local map-reduce", and we try to do it all in a single MR setup, e.g.

* extract relevant fields and sort (map)
* apply computation on groups (reduce)

As we want performance and sometimes custom code (e.g. for finding information
in unstructured data), we try to group code into a Go library with a suite of
command line tools. Easy to build and deploy.

If the scaffoling is good, we can plug in mappers and reducers as we go, and
expose them in the tools.