aboutsummaryrefslogtreecommitdiffstats
path: root/skate/README.md
blob: a63ce182d84eb89943f8860526a19e98969be3ca (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# skate

A library and suite of command line tools related to generating a [citation
graph](https://en.wikipedia.org/wiki/Citation_graph).

> There is no standard format for the citations in bibliographies, and the
> record linkage of citations can be a time-consuming and complicated process.

## Background

Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for
generating clusters of similar documents or to do verification. An option for
the future would be to resort to [Cython](https://cython.org/). Parts of
[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this
project for performance (and we saw a 25x speedup for certain tasks).

![](static/zipkey.png)

## Overview

We follow a map-reduce style approach (on a single machine): We extract
specific keys from data. We group items (via sort) with the same *key* together
and apply some computation on these groups.

Mapper is defined as function type, mapping a blob of data (e.g. a single JSON
object) to a number of fields (e.g. key, value).

```go
// Mapper maps a blob to an arbitrary number of fields, e.g. for (key,
// doc). We want fields, but we do not want to bake in TSV into each function.
type Mapper func([]byte) ([][]byte, error)
```

We can attach a serialization method to this function type to emit TSV - this
way we only have to deal with TSV only once.

```go
// AsTSV serializes the result of a field mapper as TSV. This is a slim
// adapter, e.g. to parallel.Processor, which expects this function signature.
// A newline will be appended, if not there already.
func (f Mapper) AsTSV(p []byte) ([]byte, error) {
        var (
                fields [][]byte
                err    error
                b      []byte
        )
        if fields, err = f(p); err != nil {
                return nil, err
        }
        if len(fields) == 0 {
                return nil, nil
        }
        b = bytes.Join(fields, bTab)
        if len(b) > 0 && !bytes.HasSuffix(b, bNewline) {
                b = append(b, bNewline...)
        }
        return b, nil
}
```

Reducers typically take two sorted streams of (key, doc) lines and will find
all documents sharing a key, then apply a function on this group. This is made
a bit generic in subpackage [zipkey](zipkey).

### Example Map/Reduce

* extract DOI (and other identifiers) and emit "biblioref"
* extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc.
* extract ids and find unmatched refs in the raw blob

Scale: few millions to up to few billions of docs

## TODO and Issues

+ [ ] a clearer way to deduplicate edges

Currently, we use an the `source_release_ident` and `ref_index` as an
elasticsearch key. This means that we reference docs coming from different
sources (e.g. crossref, grobid, etc.) but representing the same item must match
in their index, which we neither can guarantee nor be robust for various
sources, where indices may change frequently.

A clearer path could be:

1. group all matches (and non-matches) by source work id (already the case)
2. generate a list of unique refs, unique by source-target, w/o any index
3. for any source-target with more than one occurance, understand whićh one of these we want to include

Step 3 can now go into all depth understanding multiplicities, e.g. is it an
"ebd", "ff" type? Does it come from different source (e.g. then choose the one
most likely being correct, etc), ...