1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
|
# Known issues
Both the clustering and verification stage are not perfect. Here, some known
cases are documented.
# General observations
## One article included in different publications
A DOI prefix (10.1210, The Endocrine Society) may choose to include the same
document in different publications:
* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4
* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4
* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq
## Book or Dataset
Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g. "Unold, Max"
* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq
* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm
## Variation in authors
* https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm
* https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy
## Article and Erratum
* https://fatcat.wiki/release/s5a6e6wnlvdelge256xpha6oqu
* https://fatcat.wiki/release/zoeto2mymzhi3l74fr2ps5qjyy
We think "EXACT", but is an errata and an article an exact match? They should
belong to the same cluster, that's probably ok.
# Ideas for fixes
* [x] when title and authors match, check the year, and maybe the doi prefix;
doi with the same prefix may not be duplicates
* [x] detect arxiv versions directly
* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting
Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College
London" - will overlap with any other author including "Imperial College
London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`,
https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a,
https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym
* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m
* [x] if title and publisher matches, but DOI and year is different, assume
different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty,
https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or
https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and
https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published
* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x
* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye
* [ ] zenodo has no explicit versions, but ids might be closeby, e.g.
https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga,
https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga
# Clustering
# Verification
## A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers
* https://fatcat.wiki/release/izaz6gjnfzhgnaetizf4bt2r24
* https://fatcat.wiki/release/vwfepcqcdzfwjnsoym7o5o75yu
## Book-Chapter yields VERSIONED DOI
```
$ python -m fuzzycat verify-single | jq .
{
"extra": {
"q": "https://fatcat.wiki/release/search?q=Beardmore"
},
"a": "https://fatcat.wiki/release/zrkabzp4vjbwfdixvjkohgeh3a",
"b": "https://fatcat.wiki/release/ojcucauvkvhg5cazfhzplcot7q",
"r": [
"strong",
"versioned_doi"
]
}
```
* https://fatcat.wiki/release/zrkabzp4vjbwfdixvjkohgeh3a (book)
* https://fatcat.wiki/release/ojcucauvkvhg5cazfhzplcot7q (chapter)
## Tokenized authors is flaky
```
$ python -m fuzzycat verify_single | jq .
{
"extra": {
"q": "https://fatcat.wiki/release/search?q=cleaves"
},
"a": "https://fatcat.wiki/release/mi6y2jtl55egxi5qfhovswxcba",
"b": "https://fatcat.wiki/release/7hjisijl7nczhbghdd6l56n6py",
"r": [
"strong",
"tokenized_authors"
]
}
```
## "jaccard authors" can be too weak
```
$ python -m fuzzycat verify_single | jq .
{
"extra": {
"q": "https://fatcat.wiki/release/search?q=canes"
},
"a": "https://fatcat.wiki/release/ivhoiqvjt5cpxbdzzbuco7eciq",
"b": "https://fatcat.wiki/release/hprvn76ls5cbbkl2ypsyijojmu",
"r": [
"strong",
"jaccard_authors"
]
}
```
|