aboutsummaryrefslogtreecommitdiffstats
path: root/skate/README.md
blob: f3a44633c00ea23369977501de655cb2c715e110 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# skate

A library and suite of command line tools related to generating a [citation
graph](https://en.wikipedia.org/wiki/Citation_graph).

> There is no standard format for the citations in bibliographies, and the
> record linkage of citations can be a time-consuming and complicated process.

## Background

Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for
generating clusters of similar documents or to do verification. An option for
the future would be to resort to [Cython](https://cython.org/). Parts of
[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this
project for performance (and we saw a 25x speedup for certain tasks).

![](static/zipkey.png)

## Overview

We follow a map-reduce style approach (on a single machine): We extract
specific keys from data. We group items with the same *key* together and apply
some computation on these groups.

Mapper is defined as function type, mapping a blob of data (e.g. a single JSON
object) to a number of fields (e.g. key, value).

```go
// Mapper maps a blob to an arbitrary number of fields, e.g. for (key,
// doc). We want fields, but we do not want to bake in TSV into each function.
type Mapper func([]byte) ([][]byte, error)
```

We can attach a serialization method to this function type to emit TSV - this
way we only have to deal with TSV only once.

```go
// AsTSV serializes the result of a field mapper as TSV. This is a slim
// adapter, e.g. to parallel.Processor, which expects this function signature.
// A newline will be appended, if not there already.
func (f Mapper) AsTSV(p []byte) ([]byte, error) {
        var (
                fields [][]byte
                err    error
                b      []byte
        )
        if fields, err = f(p); err != nil {
                return nil, err
        }
        if len(fields) == 0 {
                return nil, nil
        }
        b = bytes.Join(fields, bTab)
        if len(b) > 0 && !bytes.HasSuffix(b, bNewline) {
                b = append(b, bNewline...)
        }
        return b, nil
}
```

Reducers typically take two sorted streams of (key, doc) lines and will find
all documents sharing a key, then apply a function on this group. This is made
a bit generic in subpackage [zipkey](zipkey).

### Example Map/Reduce

* extract DOI (and other identifiers) and emit "biblioref"
* extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc.
* extract ids and find unmatched refs in the raw blob

Scale: few millions to up to few billions of docs