aboutsummaryrefslogtreecommitdiffstats
path: root/skate/notes/misc.md
blob: 79ccd39a03a18d247ab1209fa19f972e073129c7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
## Transformation

We take jsonlines as input and extract id and derive the key. The resulting
file will be a TSV of the shape:

```
ID    KEY    DOC
```

The key will be sorted (optionally, but typical for the use case).

## Why an extra command?

We had a python program for this, which we parallelized with the great [GNU
parallel](https://www.gnu.org/software/parallel/) - however, when sharding the
input with parallel the program worked on each chunk; hence probably miss
clusters (not a problem of parallel, but our code, but still).

## Usage

```
$ skate-derive-key < release_entities.jsonl | sort -k2,2 | skate-cluster > cluster.jsonl
```

A few options:

```
$ skate-derive-key -h
Usage of skate-derive-key:
  -b int
        batch size (default 50000)
  -f string
        key function name, other: title, tnorm, tnysi (default "tsand")
  -verbose
        show progress
  -w int
        number of workers (default 8)
```

Clusters are json lines;

* a single string as key `k`
* a list of documents as values `v`

The reason to include the complete documents is performance - for simplicity
and (typically) sequential reads, a "file" seems to be a good option.

```json
{
  "k": "植字手引",
  "v": [
    {
      "abstracts": [],
      "refs": [],
      "contribs": [
        {
          "index": 0,
          "raw_name": "大久保, 猛雄",
          "given_name": "大久保, 猛雄",
          "role": "author"
        }
      ],
      "language": "ja",
      "publisher": "広島植字研究会",
      "ext_ids": {
        "doi": "10.11501/1189671"
      },
      "release_year": 1929,
      "release_stage": "published",
      "release_type": "article-journal",
      "webcaptures": [],
      "filesets": [],
      "files": [],
      "work_id": "aaaab7poljf25dg4322ebsgism",
      "title": "植字手引",
      "state": "active",
      "ident": "bc5mykteevcy3masrst3zjqgwq",
      "revision": "97846ea8-41e5-40aa-9d41-e8c4b45f67e4",
      "extra": {
        "jalc": {}
      }
    }
  ]
}
```

Options:

```
$ skate-cluster -h
Usage of skate-cluster:
  -d int
        which column contains the doc (default 3)
  -k int
        which column contains the key (one based) (default 2)
```

## Performance notes

* key extraction with parallel jsoniter at about 130MB/s
* having pipes in Go, on the shell or not at all seems to make little difference
* having a large sort buffer is key, then using pipes, the default is 1K

Note: need to debug performance at some point; e.g.

```
$ zstdcat -T0 refs_titles.tsv.zst | TMPDIR=/fast/tmp LC_ALL=C sort -S20% | \
    LC_ALL=C uniq -c | zstd -c9 > refs_titles_unique.tsv.zst
```

takes 46min, and we can iterate of 2-5M lines/s.

## Misc

The `skate-ref-to-release` command is a simple one-off schema converter (mostly
decode and encode), which runs over ~1.7B docs in 81min - about 349794 docs/s.

The `skate-verify` command is a port of the fuzzycat.verify (Python)
implementation; it can run 70K verifications/s; e.g. when running over refs, we
can verify 1M clusters in less than 3min (and a full 40M set in less than 2h;
that's 25x faster than the Python/Parallel version).

![](static/skate.png)