1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
|
# skate
A library and suite of command line tools related to generating a [citation
graph](https://en.wikipedia.org/wiki/Citation_graph).
> There is no standard format for the citations in bibliographies, and the
> record linkage of citations can be a time-consuming and complicated process.
## Background
Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for
generating clusters of similar documents or to do verification. An option for
the future would be to resort to [Cython](https://cython.org/). Parts of
[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this
project for performance (and we saw a 25x speedup for certain tasks).
![](static/zipkey.png)
## Overview
First, generate a "sorted key file" - for our purposes a TSV containing a key
and the original document. Various mappers are implemented and it is relatively
easy to add another one.
```
$ skate-map -m ts < file.jsonl | sort -k1,1 > map.tsv
```
Repeat the mapping for any file you want to compare against the catalog. Then,
decide which *reduce* mode is desired.
```
$ skate-reduce -r bref -f file.1 -g file.2
```
Depending on what the reducer does, it can generate a verification status or
some export schema.
WIP: ...
## Core Utils
* `skate-derive-key`, will be: `skate-map`
* `skate-cluster`
* `skate-verify-*`
The `skate-derive-key` tool derives a key from release entity JSON documents.
```
$ skate-derive-key < release_entities.jsonlines > docs.tsv
```
Result will be a three column TSV (ident, key, doc).
```
---- ident --------------- ---- key --------- ---- doc ----------
4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
```
After this step:
* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
* cluster, e.g. `skate-cluster ...`
----
The `skate-cluster` tool converts a sorted key output into a jsonlines
clusters.
For example, this:
id123 somekey123 {"a":"b", ...}
id391 somekey123 {"x":"y", ...}
would turn into (a single line containing all docs with the same key).
{"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]}
A single line cluster is easier to parallelize (e.g. for verification, etc.).
----
The `skate-verify-*` tools run various matching and verification algorithms.
## Extra
* skate-wikipedia-doi
> TSV (page title, DOI, doc) from wikipedia refs.
```
$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi
Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ...
Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ...
```
* skate-bref-id
> Temporary helper to add a key to a biblioref document.
* skate-from-unstructured
> Takes a refs file and plucks out identifiers from unstructured field.
* skate-conv
> Converts a ref (or open library) document to a release. Part of first step,
> merging refs and releases.
* skate-to-doi
> Sanitize DOI in tabular file.
## Misc
Handling a TB of JSON and billions of documents, especially for the following
use case:
* deriving a key from a document
* sort documents by (that) key
* clustering and verifing documents in clusters
The main use case is match candidate generation and verification for fuzzy
matching, especially for building a citation graph dataset from
[fatcat](https://fatcat.wiki).
![](static/two_cluster_synopsis.png)
|