1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
|
# fuzzycat (wip)
Fuzzy matching publications for [fatcat](https://fatcat.wiki).
* [fuzzycat](https://pypi.org/project/fuzzycat/)
Note: This is currently work-in-progress.
# Example Run
Run any clustering algorithm.
```
$ time python -m fuzzycat cluster -t tsandcrawler < data/sample10m.json | \
zstd -c9 > sample_cluster.json.zst
2020-11-18 00:19:48.194 DEBUG __main__ - run_cluster:
{"key_fail": 0, "key_ok": 9999938, "key_empty": 62, "key_denylist": 0, "num_clusters": 9040789}
real 75m23.045s
user 95m14.455s
sys 3m39.121s
```
Run verification.
```
$ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt
real 7m56.713s
user 8m50.703s
sys 0m29.262s
```
Example results over 10M docs:
```json
{
"miss.appendix": 176,
"miss.arxiv_version": 25,
"miss.blacklisted": 12082,
"miss.blacklisted_fragment": 5,
"miss.book_chapter": 46733,
"miss.component": 1567,
"miss.contrib_intersection_empty": 47691,
"miss.dataset_doi": 30806,
"miss.num_diff": 1,
"miss.release_type": 157718,
"miss.short_title": 16263,
"miss.subtitle": 6013,
"miss.title_filename": 57,
"miss.year": 148755,
"ok.arxiv_version": 93,
"ok.dummy": 88294,
"ok.preprint_published": 110,
"ok.slug_title_author_match": 15818,
"ok.title_author_match": 93240,
"skip.container_name_blacklist": 20,
"skip.publisher_blacklist": 456,
"skip.too_large": 7430,
"skip.unique": 8808462,
"total": 9481815
}
```
# Use cases
* [ ] take a release entity database dump as JSON lines and cluster releases
(according to various algorithms)
* [ ] take cluster information and run a verification step (misc algorithms)
* [ ] create a dataset that contains grouping of releases under works
* [ ] command line tools to generate cache keys, e.g. to match reference
strings to release titles (this needs some transparent setup, e.g. filling of
a cache before ops)
# Usage
Release clusters start with release entities json lines.
```shell
$ cat data/sample.json | python -m fuzzycat cluster -t title > out.json
```
Clustering 1M records (single core) takes about 64s (15K docs/s).
```shell
$ head -1 out.json
{
"c": "release_key_title",
"v": [
"7ufkzsjywzejvjzsyegugradoa",
"harjqexl5vagxc54zjfen5zlve",
"i5jrdoxqmjfs3fk2dcpnqxqb2e",
"i62bo63qqzggjjk7pf77z26djm",
"omo3z5y7qvh6hbl7wjacinsfiq",
"prkik3s5vzejnfe4u26g2vt2wu",
"pyqss6ifnvgqjeqohlampswvkm",
"spr2b23fk5asph7v6shrd6okt4",
"togokylwfvcvzilhnx4jir2hfm",
"us4artv2hbc5bljuwaopquicfu",
"ycargjj4lzddnmyzbh2e22wsii"
],
"k": "裏表紙"
}
```
Using GNU parallel to make it faster.
```
$ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title
```
Interestingly, the parallel variants detects fewer clusters (because data is
split and clusters are searched within each batch). TODO(miku): sort out sharding bug.
## Cluster
```shell
usage: fuzzycat command [options] cluster [-h] [--prefix PREFIX]
[--tmpdir TMPDIR] [-P] [-f FILES]
[-t TYPE]
{cluster,verify} ...
positional arguments:
{cluster,verify}
cluster group entities
verify verify groups
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX temp file prefix
--tmpdir TMPDIR temporary directory
-P, --profile profile program
-f FILES, --files FILES
output files
-t TYPE, --type TYPE cluster algorithm: title, tnorm, tnysi
```
|