notes/clustering.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

# Clustering

Original dataset:

```
$ sha1sum release_export_expanded.json.zst
fa7ce335e27bbf6ccee227992ecd9b860e8e36af  release_export_expanded.json.zst

$ zstdcat -T0 release_export_expanded.json.zst | wc -l
```

Various clusters (title, title normalized, title nysiis (New York State
Identification and Intelligence System, ...):

```
$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json
```

Parallel (TODO: use `--pipepart`):

```
$ zstdcat -T0 release_export_expanded.json.zst | \
    parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \
    fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json
```

Numbers of clusters:

```
  141022216 cluster_title.json
  134709771 cluster_title_normalized.json
  119829458 cluster_title_nysiis.json
```

The number of duplicate record goes up as number of clusters go down:

```
   2858088 cluster_title_dups.json
   5818143 cluster_title_normalized_dups.json
   6274940 cluster_title_nysiis_dups.json
```

# Cluster numbers

Using normalized title as example:

* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger

```
             size         len
count 5818143.000 5818143.000
mean        4.350      52.120
std       196.347      35.026
min         2.000       0.000
25%         2.000      24.000
50%         2.000      46.000
75%         3.000      72.000
max    151383.000   11686.000
```

Around 448170 clusters with size 5 or more (with some example titles):

```
Medical Notes
日本鉄鋼協会第97回講演大会講演概要
Boutades
Allergic Contact Dermatitis
Comité international
Incontinence
Efficient Uncertainty Minimization for Fuzzy Spectral Clustering
Early Intervention
CURRENT READINGS IN NUCLEAR MEDICINE
Nannocystis exedens
```

Grouping. API, hide.

* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart";
  batching; "read from a file"; scan a file; "chunking"

# TODO

* [ ] do a SS like clustering, using title and author ngrams
* [ ] cluster by doi without "vX" suffix

# Verification

* we only need to look at identified duplicates, which will be a few millions
* we want fast access to all release JSON blob via ident, maybe do a
  "fuzzycat-cache" that copies relevant files into the fs, e.g.
"~/.cache/fuzzycat/releases/d9/e4d4be49faafc750563351a126e7bafe29.json or via microblob (but http we do not need), or sqlite3 (https://www.sqlite.org/fasterthanfs.html)

For verification we need to have the cached json blobs in some fast,
thread-safe store. Estimated: 1K/s accesses, we still would need a few hours
for a run.

* [ ] find all ids we need, generate cache, maybe reduce number of fields
* [ ] run verification on each cluster; generate a file of same format of
  "verified" clusters; take note the clustering and verification method

Overall, we can combine various clustering and verification methods. We can
also put together a list of maybe 100-200 test cases and evaluate methods.