1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
|
## Transformation
We take jsonlines as input and extract id and derive the key. The resulting
file will be a TSV of the shape:
```
ID KEY DOC
```
The key will be sorted (optionally, but typical for the use case).
## Why an extra command?
We had a python program for this, which we parallelized with the great [GNU
parallel](https://www.gnu.org/software/parallel/) - however, when sharding the
input with parallel the program worked on each chunk; hence probably miss
clusters (not a problem of parallel, but our code, but still).
## Usage
```
$ skate-derive-key < release_entities.jsonl | sort -k2,2 | skate-cluster > cluster.jsonl
```
A few options:
```
$ skate-derive-key -h
Usage of skate-derive-key:
-b int
batch size (default 50000)
-f string
key function name, other: title, tnorm, tnysi (default "tsand")
-verbose
show progress
-w int
number of workers (default 8)
```
Clusters are json lines;
* a single string as key `k`
* a list of documents as values `v`
The reason to include the complete documents is performance - for simplicity
and (typically) sequential reads, a "file" seems to be a good option.
```json
{
"k": "植字手引",
"v": [
{
"abstracts": [],
"refs": [],
"contribs": [
{
"index": 0,
"raw_name": "大久保, 猛雄",
"given_name": "大久保, 猛雄",
"role": "author"
}
],
"language": "ja",
"publisher": "広島植字研究会",
"ext_ids": {
"doi": "10.11501/1189671"
},
"release_year": 1929,
"release_stage": "published",
"release_type": "article-journal",
"webcaptures": [],
"filesets": [],
"files": [],
"work_id": "aaaab7poljf25dg4322ebsgism",
"title": "植字手引",
"state": "active",
"ident": "bc5mykteevcy3masrst3zjqgwq",
"revision": "97846ea8-41e5-40aa-9d41-e8c4b45f67e4",
"extra": {
"jalc": {}
}
}
]
}
```
Options:
```
$ skate-cluster -h
Usage of skate-cluster:
-d int
which column contains the doc (default 3)
-k int
which column contains the key (one based) (default 2)
```
## Performance notes
* key extraction with parallel jsoniter at about 130MB/s
* having pipes in Go, on the shell or not at all seems to make little difference
* having a large sort buffer is key, then using pipes, the default is 1K
Note: need to debug performance at some point; e.g.
```
$ zstdcat -T0 refs_titles.tsv.zst | TMPDIR=/fast/tmp LC_ALL=C sort -S20% | \
LC_ALL=C uniq -c | zstd -c9 > refs_titles_unique.tsv.zst
```
takes 46min, and we can iterate of 2-5M lines/s.
## Misc
The `skate-ref-to-release` command is a simple one-off schema converter (mostly
decode and encode), which runs over ~1.7B docs in 81min - about 349794 docs/s.
The `skate-verify` command is a port of the fuzzycat.verify (Python)
implementation; it can run 70K verifications/s; e.g. when running over refs, we
can verify 1M clusters in less than 3min (and a full 40M set in less than 2h;
that's 25x faster than the Python/Parallel version).
![](static/skate.png)
|