aboutsummaryrefslogtreecommitdiffstats
path: root/notes/known_issues.md
blob: 991de5ca4947023de184ef6a0c10700fee2c9dd4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# Known issues

Both the clustering and verification stage are not perfect. Here, some known
cases are documented.

# General observations

## One article included in different publications

A DOI prefix (10.1210, The Endocrine Society)  may choose to include the same
document in different publications:

* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4
* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4
* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq

## Book or Dataset

Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g. "Unold, Max"

* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq
* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm

## Variation in authors

* https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm
* https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy

## Article and Erratum

* https://fatcat.wiki/release/s5a6e6wnlvdelge256xpha6oqu
* https://fatcat.wiki/release/zoeto2mymzhi3l74fr2ps5qjyy

We think "EXACT", but is an errata and an article an exact match? They should
belong to the same cluster, that's probably ok.

# Ideas for fixes

* [x] when title and authors match, check the year, and maybe the doi prefix;
  doi with the same prefix may not be duplicates
* [x] detect arxiv versions directly
* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting
  Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College
London" - will overlap with any other author including "Imperial College
London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`,
https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a,
https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym
* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m
* [x] if title and publisher matches, but DOI and year is different, assume
different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty,
https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or
https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and
https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published
* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x
* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye
* [ ] zenodo has no explicit versions, but ids might be closeby, e.g.
  https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga,
https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga

# Clustering

# Verification

## A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers

* https://fatcat.wiki/release/izaz6gjnfzhgnaetizf4bt2r24
* https://fatcat.wiki/release/vwfepcqcdzfwjnsoym7o5o75yu

## Book-Chapter yields VERSIONED DOI

```
$ python -m fuzzycat verify-single | jq .
{
  "extra": {
    "q": "https://fatcat.wiki/release/search?q=Beardmore"
  },
  "a": "https://fatcat.wiki/release/zrkabzp4vjbwfdixvjkohgeh3a",
  "b": "https://fatcat.wiki/release/ojcucauvkvhg5cazfhzplcot7q",
  "r": [
    "strong",
    "versioned_doi"
  ]
}
```

* https://fatcat.wiki/release/zrkabzp4vjbwfdixvjkohgeh3a (book)
* https://fatcat.wiki/release/ojcucauvkvhg5cazfhzplcot7q (chapter)

## Tokenized authors is flaky

```
$ python -m fuzzycat verify_single | jq .
{
  "extra": {
    "q": "https://fatcat.wiki/release/search?q=cleaves"
  },
  "a": "https://fatcat.wiki/release/mi6y2jtl55egxi5qfhovswxcba",
  "b": "https://fatcat.wiki/release/7hjisijl7nczhbghdd6l56n6py",
  "r": [
    "strong",
    "tokenized_authors"
  ]
}
```

## "jaccard authors" can be too weak

```
$ python -m fuzzycat verify_single | jq .
{
  "extra": {
    "q": "https://fatcat.wiki/release/search?q=canes"
  },
  "a": "https://fatcat.wiki/release/ivhoiqvjt5cpxbdzzbuco7eciq",
  "b": "https://fatcat.wiki/release/hprvn76ls5cbbkl2ypsyijojmu",
  "r": [
    "strong",
    "jaccard_authors"
  ]
}
```