1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
|
# skate
A library and suite of command line tools related to generating a [citation
graph](https://en.wikipedia.org/wiki/Citation_graph).
> There is no standard format for the citations in bibliographies, and the
> record linkage of citations can be a time-consuming and complicated process.
## Background
Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for
generating clusters of similar documents or to do verification. An option for
the future would be to resort to [Cython](https://cython.org/). Parts of
[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this
project for performance (and we saw a 25x speedup for certain tasks).
![](static/zipkey.png)
## Overview
We follow a map-reduce style approach (on a single machine): We extract
specific keys from data. We group items (via sort) with the same *key* together
and apply some computation on these groups.
Mapper is defined as function type, mapping a blob of data (e.g. a single JSON
object) to a number of fields (e.g. key, value).
```go
// Mapper maps a blob to an arbitrary number of fields, e.g. for (key,
// doc). We want fields, but we do not want to bake in TSV into each function.
type Mapper func([]byte) ([][]byte, error)
```
We can attach a serialization method to this function type to emit TSV - this
way we only have to deal with TSV only once.
```go
// AsTSV serializes the result of a field mapper as TSV. This is a slim
// adapter, e.g. to parallel.Processor, which expects this function signature.
// A newline will be appended, if not there already.
func (f Mapper) AsTSV(p []byte) ([]byte, error) {
var (
fields [][]byte
err error
b []byte
)
if fields, err = f(p); err != nil {
return nil, err
}
if len(fields) == 0 {
return nil, nil
}
b = bytes.Join(fields, bTab)
if len(b) > 0 && !bytes.HasSuffix(b, bNewline) {
b = append(b, bNewline...)
}
return b, nil
}
```
Reducers typically take two sorted streams of (key, doc) lines and will find
all documents sharing a key, then apply a function on this group. This is made
a bit generic in subpackage [zipkey](zipkey).
### Example Map/Reduce
* extract DOI (and other identifiers) and emit "biblioref"
* extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc.
* extract ids and find unmatched refs in the raw blob
Scale: few millions to up to few billions of docs
## TODO and Issues
+ [ ] a clearer way to deduplicate edges
Currently, we use an the `source_release_ident` and `ref_index` as an
elasticsearch key. This means that we reference docs coming from different
sources (e.g. crossref, grobid, etc.) but representing the same item must match
in their index, which we neither can guarantee nor be robust for various
sources, where indices may change frequently.
A clearer path could be:
1. group all matches (and non-matches) by source work id (already the case)
2. generate a list of unique refs, unique by source-target, w/o any index
3. for any source-target with more than one occurance, understand whićh one of these we want to include
Step 3 can now go into all depth understanding multiplicities, e.g. is it an
"ebd", "ff" type? Does it come from different source (e.g. then choose the one
most likely being correct, etc), ...
### A simple pattern
We have a typical pattern:
* two medium size data sets with different schemas
* mapper functions per schema
* a reduce function streaming through the "mapped" files of both schemas
Basically a `GROUP BY`, where we might want to group by a value that we need to
compute first (e.g. title normalization, SOUNDEX, NYSIIS, ...); where the
aggregation is a full function, e.g. able to generate document in a third
schema (e.g. a biblioref document), etc.
We could look into something like PG and add custom functions. Load JSON files,
load functions, run. Or keep data at rest and try to implement a performant
scan over it, manually.
Some type that encapsulates schema, extraction and reduction into a single,
runnable entity.
|