aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: 412495ce68c3dc74237a454c0cea1a21a81263d4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# fuzzycat (wip)

Fuzzy matching publications for [fatcat](https://fatcat.wiki).

* [fuzzycat](https://pypi.org/project/fuzzycat/)

Note: This is currently work-in-progress.

# Example Run

Run any clustering algorithm.

```
$ time python -m fuzzycat cluster -t tsandcrawler < data/sample10m.json | \
    zstd -c9 > sample_cluster.json.zst
2020-11-18 00:19:48.194 DEBUG __main__ - run_cluster:
    {"key_fail": 0, "key_ok": 9999938, "key_empty": 62, "key_denylist": 0, "num_clusters": 9040789}

real    75m23.045s
user    95m14.455s
sys     3m39.121s
```

Run verification.

```
$ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt

real    7m56.713s
user    8m50.703s
sys     0m29.262s
```


Example results over 10M docs:

```json
{
  "miss.appendix": 176,
  "miss.arxiv_version": 25,
  "miss.blacklisted": 12082,
  "miss.blacklisted_fragment": 5,
  "miss.book_chapter": 46733,
  "miss.component": 1567,
  "miss.contrib_intersection_empty": 47691,
  "miss.dataset_doi": 30806,
  "miss.num_diff": 1,
  "miss.release_type": 157718,
  "miss.short_title": 16263,
  "miss.subtitle": 6013,
  "miss.title_filename": 57,
  "miss.year": 148755,
  "ok.arxiv_version": 93,
  "ok.dummy": 88294,
  "ok.preprint_published": 110,
  "ok.slug_title_author_match": 15818,
  "ok.title_author_match": 93240,
  "skip.container_name_blacklist": 20,
  "skip.publisher_blacklist": 456,
  "skip.too_large": 7430,
  "skip.unique": 8808462,
  "total": 9481815
}
```


# Use cases

* [ ] take a release entity database dump as JSON lines and cluster releases
  (according to various algorithms)
* [ ] take cluster information and run a verification step (misc algorithms)
* [ ] create a dataset that contains grouping of releases under works
* [ ] command line tools to generate cache keys, e.g. to match reference
  strings to release titles (this needs some transparent setup, e.g. filling of
a cache before ops)

# Usage

Release clusters start with release entities json lines.

```shell
$ cat data/sample.json | python -m fuzzycat cluster -t title > out.json
```

Clustering 1M records (single core) takes about 64s (15K docs/s).

```shell
$ head -1 out.json
{
  "c": "release_key_title",
  "v": [
    "7ufkzsjywzejvjzsyegugradoa",
    "harjqexl5vagxc54zjfen5zlve",
    "i5jrdoxqmjfs3fk2dcpnqxqb2e",
    "i62bo63qqzggjjk7pf77z26djm",
    "omo3z5y7qvh6hbl7wjacinsfiq",
    "prkik3s5vzejnfe4u26g2vt2wu",
    "pyqss6ifnvgqjeqohlampswvkm",
    "spr2b23fk5asph7v6shrd6okt4",
    "togokylwfvcvzilhnx4jir2hfm",
    "us4artv2hbc5bljuwaopquicfu",
    "ycargjj4lzddnmyzbh2e22wsii"
  ],
  "k": "裏表紙"
}
```

Using GNU parallel to make it faster.

```
$ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title
```

Interestingly, the parallel variants detects fewer clusters (because data is
split and clusters are searched within each batch). TODO(miku): sort out sharding bug.


## Cluster

```shell
usage: fuzzycat command [options] cluster [-h] [--prefix PREFIX]
                                          [--tmpdir TMPDIR] [-P] [-f FILES]
                                          [-t TYPE]
                                          {cluster,verify} ...

positional arguments:
  {cluster,verify}
    cluster             group entities
    verify              verify groups

optional arguments:
  -h, --help            show this help message and exit
  --prefix PREFIX       temp file prefix
  --tmpdir TMPDIR       temporary directory
  -P, --profile         profile program
  -f FILES, --files FILES
                        output files
  -t TYPE, --type TYPE  cluster algorithm: title, tnorm, tnysi
```