skate/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

# skate

The skate suite of command line tools have been written for various parts of the
citation graph pipeline.

## Tools

### skate-biblioref
### skate-biblioref-from-wikipedia
### skate-bref-id
### skate-cluster
### skate-cluster-stats
### skate-derive-key
### skate-from-unstructured
### skate-ref-to-release
### skate-to-doi
### skate-verify


Goal: make key extraction and comparisons fast for billions of records on a
single machine to support deduplication work for [fatcat](https://fatcat.wiki)
metadata.

## Problem

Handling a TB of JSON and billions of documents, especially for the following
use case:

* deriving a key from a document
* sort documents by (that) key
* clustering and verifing documents in clusters

The main use case is match candidate generation and verification for fuzzy
matching, especially for building a citation graph dataset from
[fatcat](https://fatcat.wiki).

![](static/two_cluster_synopsis.png)