diff options
Diffstat (limited to 'notes')
-rw-r--r-- | notes/2020_11_testruns.md | 20 | ||||
-rw-r--r-- | notes/known_issues.md | 46 |
2 files changed, 62 insertions, 4 deletions
diff --git a/notes/2020_11_testruns.md b/notes/2020_11_testruns.md index ec186ac..9e700c9 100644 --- a/notes/2020_11_testruns.md +++ b/notes/2020_11_testruns.md @@ -1,6 +1,18 @@ -# Test runs +# Test run notes -## Using --min-cluster-size +## 12/2020 + + + +### More versions + +* https://fatcat.wiki/release/3n4enptukfg5hpsomskg7ebh2e +* https://fatcat.wiki/release/b34keiknkvf5ril7fcajuzzt4a + + +## 11/2020 + +### Using --min-cluster-size Skipping writes of single element clusters cuts clustering from ~42h to ~22h. @@ -98,7 +110,7 @@ Preliminary case distribution: 4 OK.ARXIV_VERSION ``` -## Case Mining +### Case Mining > "-" ignore, "x" done @@ -257,7 +269,7 @@ Different reviews. More patterns: -### Chapter vs Book +Chapter vs Book * https://fatcat.wiki/release/ameuzneqizg3ff7ep4bmg4io6m * https://fatcat.wiki/release/s2thvzarsfbodd52w46zy2xple diff --git a/notes/known_issues.md b/notes/known_issues.md index 46403a0..e80acdc 100644 --- a/notes/known_issues.md +++ b/notes/known_issues.md @@ -3,6 +3,52 @@ Both the clustering and verification stage are not perfect. Here, some known cases are documented. +# General observations + +## One article included in different publications + +A DOI prefix (10.1210, The Endocrine Society) may choose to include the same +document in different publications: + +* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4 +* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4 +* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq + +## Book or Dataset + +Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g. "Unold, Max" + +* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq +* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm + +## Variation in authors + +* https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm +* https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy + +# Ideas for fixes + +* [x] when title and authors match, check the year, and maybe the doi prefix; + doi with the same prefix may not be duplicates +* [x] detect arxiv versions directly +* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting + Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College +London" - will overlap with any other author including "Imperial College +London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`, +https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a, +https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym +* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m +* [x] if title and publisher matches, but DOI and year is different, assume +different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty, +https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or +https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and +https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published +* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x +* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye +* [ ] zenodo has no explicit versions, but ids might be closeby, e.g. + https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga, +https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga + # Clustering # Verification |