aboutsummaryrefslogtreecommitdiffstats
path: root/skate/README.md
blob: 6fa4ae2814e9915782de96dd64e8802aa953a678 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# skate

The skate suite of command line tools have been written for various parts of the
citation graph pipeline.

## Tools

### skate-wikipedia-doi

TSV (page title, DOI, doc) from wikipedia refs.

```
$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi
Rational point  10.1515/crll.1988.386.32        {"type_of_citation" ...
Cubic surface   10.2140/ant.2007.1.393          {"type_of_citation" ...
```

### skate-bref-id

Temporary helper to add a key to a biblioref document.

### skate-cluster

Converts a sorted key output into a jsonlines clusters.

For example, this:

    id123    somekey123    {"a":"b", ...}
    id391    somekey123    {"x":"y", ...}

would turn into (a single line containing all docs with the same key).

    {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]}

A single line cluster is easier to parallelize (e.g. for verification, etc.).

### skate-derive-key

skate-derive-key derives a key from release entity JSON documents.

```
$ skate-derive-key < release_entities.jsonlines > docs.tsv
```

Result will be a three column TSV (ident, key, doc).

```
---- ident --------------- ---- key --------- ---- doc ----------

4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
```

After this step:

* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
* cluster, e.g. `skate-cluster ...`

### skate-from-unstructured

Takes a refs file and plucks out identifiers from unstructured field.

### skate-ref-to-release

Converts a ref document to a release. Part of first run, merging refs and releases.

### skate-to-doi

Sanitize DOI in tabular file.

### skate-verify

Run various matching and verification algorithms.

## Problem

Handling a TB of JSON and billions of documents, especially for the following
use case:

* deriving a key from a document
* sort documents by (that) key
* clustering and verifing documents in clusters

The main use case is match candidate generation and verification for fuzzy
matching, especially for building a citation graph dataset from
[fatcat](https://fatcat.wiki).

![](static/two_cluster_synopsis.png)