aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: 77c0eda44cd8d480a8d7782584e3b227e9f5de71 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# fuzzycat (wip)

Fuzzy matching publications for [fatcat](https://fatcat.wiki).

* [fuzzycat](https://pypi.org/project/fuzzycat/)

Note: This is currently work-in-progress.

# Use cases

* [ ] take a release entity database dump as JSON lines and cluster releases
  (according to various algorithms)
* [ ] take cluster information and run a verification step (misc algorithms)

# Usage

Release clusters start with release entities json lines.

```shell
$ cat data/sample.json | python -m fuzzycat.main cluster -t title > out.json
```

Clustering 100k records takes about 6s.

```shell
$ head -1 out.json
{
  "c": "release_key_title",
  "v": [
    "7ufkzsjywzejvjzsyegugradoa",
    "harjqexl5vagxc54zjfen5zlve",
    "i5jrdoxqmjfs3fk2dcpnqxqb2e",
    "i62bo63qqzggjjk7pf77z26djm",
    "omo3z5y7qvh6hbl7wjacinsfiq",
    "prkik3s5vzejnfe4u26g2vt2wu",
    "pyqss6ifnvgqjeqohlampswvkm",
    "spr2b23fk5asph7v6shrd6okt4",
    "togokylwfvcvzilhnx4jir2hfm",
    "us4artv2hbc5bljuwaopquicfu",
    "ycargjj4lzddnmyzbh2e22wsii"
  ],
  "k": "裏表紙"
}
```

Using GNU parallel to make it faster.

```
$ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title
```

Interestingly, the parallel variants detects fewer clusters (because data is
split and clusters are searched within each batch).