From aa1505172f85ecc434fd5d5b1aa7fc4521074e38 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 17 Dec 2020 00:11:33 +0100 Subject: wip: notes --- README.md | 91 ++--------------------------------------------- notes/2020_11_testruns.md | 20 ++++++++--- notes/known_issues.md | 46 ++++++++++++++++++++++++ 3 files changed, 64 insertions(+), 93 deletions(-) diff --git a/README.md b/README.md index fbe144e..d095994 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ Fuzzy matching publications for [fatcat](https://fatcat.wiki). +![https://pypi.org/project/fuzzycat/](https://img.shields.io/pypi/v/fuzzycat?style=flat-square) + # Example Run Run any clustering algorithm. @@ -182,92 +184,3 @@ Notes on cadd28a version clustering (nysiis) and verification. 4 OK.ARXIV_VERSION ``` - -#### Cases - -* common title, "Books by Our Readers", https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq, https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq -* common title, "The Future of Imprisonment" -* common title, "In This Issue/Research Watch/News-in-Brief/News from the IASLC Tobacco Control Committee" -* common title, "IEEE Transactions on Wireless Communications", same publisher, different year -* common title, "ASMS News" (also different year) -* common title, "AMERICAN INSTITUTE OF INSTRUCTION" -* common title, "Contents lists" -* common title, "Submissions" -* same, except DOI, but maybe the same item, after all? https://fatcat.wiki/release/kxgsbh66v5bwhobcaiuh4i7dwy, https://fatcat.wiki/release/thl7o44z3jgk3njdypixwrdbve - -Authors may be messy: - -* IR and published, be we currently yield `Miss.CONTRIB_INTERSECTION_EMPTY` - - https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm, -https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy - may need to tokenize authors - -A DOI prefix (10.1210, The Endocrine Society) may choose to include the same -document in different publications: - -* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4 -* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4 -* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq - -Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g.: - -* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq -* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm - -#### Possible fixes - -* [ ] when title and authors match, check the year, and maybe the doi prefix; doi with the same prefix may not be duplicates -* [x] detect arxiv versions directly -* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting - Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College -London" - will overlap with any other author including "Imperial College -London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`, -https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a, -https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym -* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m -* [ ] if title and publisher matches, but DOI and year is different, assume -different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty, -https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or -https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and -https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published -* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x -* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye -* [ ] zenodo has no explicit versions, but ids might be closeby, e.g. https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga, https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga - -#### 100 examples - -* accuracy at around 0.8 -* while the results look ok, the reasons are not always the ones that stand out - the most (while checking manually) - -``` -78 [x] -11 [o] -11 [ ] -``` - -Ok cases are now in [verify.csv](https://github.com/miku/fuzzycat/blob/master/tests/data/verify.csv). - -* [ ] https://fatcat.wiki/release/i2ziaqjrovh3rfrojcaf2xqidy https://fatcat.wiki/release/4rbsv4kplnf4tny22px5z35vty Status.DIFFERENT Miss.CONTRIB_INTERSECTION_EMPTY -* [o] https://fatcat.wiki/release/65qk35lrxfbqxnpjfpra3ankxe https://fatcat.wiki/release/tovzgangzbfm5bc2qriyh2k6da Status.AMBIGUOUS OK.DUMMY -* [ ] https://fatcat.wiki/release/qvlzvflp6vhojdm3uyvj2d6keq https://fatcat.wiki/release/vynqlyi2xjdexmf54a5yfidx6m Status.DIFFERENT Miss.RELEASE_TYPE -* [o] https://fatcat.wiki/release/hfewgpty4ne3zn7rg32z5npdxy https://fatcat.wiki/release/3djtma4xrjh2pcxy4gu6pafqji Status.AMBIGUOUS OK.DUMMY -* [ ] https://fatcat.wiki/release/ybxygpeypbaq5pfrztu3z2itw4 https://fatcat.wiki/release/2c2ztrtlkzdhfmzpf7fbindpjq Status.DIFFERENT Miss.DATASET_DOI -* [o] https://fatcat.wiki/release/eyol2bjf6jawhjnote73ej5v24 https://fatcat.wiki/release/jowohxiuuncqbdidvqjrrb5324 Status.AMBIGUOUS OK.DUMMY -* [ ] https://fatcat.wiki/release/d5bqydkylzelpmdfcks2v5th7q https://fatcat.wiki/release/lzcgl52npjaf3etfhhnb3d46da Status.DIFFERENT Miss.DATASET_DOI -* [o] https://fatcat.wiki/release/5ysvoxjj4jcxbji42nnzapr6n4 https://fatcat.wiki/release/dx6wevs345cjfejokze2te6sia Status.AMBIGUOUS OK.DUMMY -* [o] https://fatcat.wiki/release/xdclbyjgjnbehchrl7l2vi3274 https://fatcat.wiki/release/t3kqh6lfprfaff5zovh6qlodxy Status.AMBIGUOUS OK.DUMMY -* [o] https://fatcat.wiki/release/aogvyiw67vdsnf26bufauy2rqa https://fatcat.wiki/release/aofedljjhbhajmx5doxfcv43fa Status.AMBIGUOUS OK.DUMMY -* [o] https://fatcat.wiki/release/cjal2f6k5zesxcnrnyhc6ftg5e https://fatcat.wiki/release/oi5kzjlku5gpxjc247v6zjzosa Status.AMBIGUOUS OK.DUMMY -* [o] https://fatcat.wiki/release/o6e6yf37y5bttbrpo4piska4gq https://fatcat.wiki/release/pchjd5fwqjdqfevphjff7ydeae Status.AMBIGUOUS OK.DUMMY -* [ ] https://fatcat.wiki/release/l4fyyvsckneuxkq7d3y2zvkvbe https://fatcat.wiki/release/gf5hriyvuvarhcvttnooaffksi Status.DIFFERENT Miss.RELEASE_TYPE -* [ ] https://fatcat.wiki/release/7nbcgsohrrak5cuyk6dnit6ega https://fatcat.wiki/release/q66xv7drk5fnph7enwwlkyuwqm Status.DIFFERENT Miss.CONTRIB_INTERSECTION_EMPTY -* [ ] https://fatcat.wiki/release/2tzvdvx4t5hfxnqlnyt4rqenly https://fatcat.wiki/release/houszjo2ejbjhljxvxz23whgua Status.DIFFERENT Miss.DATASET_DOI -* [ ] https://fatcat.wiki/release/qsxbwvreu5ehrbz65ngh2ghcra https://fatcat.wiki/release/xjvo37ynxvc3zm55bxoa545gvq Status.EXACT OK.TITLE_AUTHOR_MATCH -* [ ] https://fatcat.wiki/release/ggzzwt6deneyrna5h65mvv7sfe https://fatcat.wiki/release/h4rnaxua75dndmq4x4snnw3qxe Status.AMBIGUOUS Miss.SHORT_TITLE -* [ ] https://fatcat.wiki/release/skxiyp7qmraqhe2o4zvo7iq6sq https://fatcat.wiki/release/qyqre3mzgbha7hhfarn5absqnq Status.EXACT OK.TITLE_AUTHOR_MATCH -* [o] https://fatcat.wiki/release/am53f7iyyvcjnjsgjbz7pu7dii https://fatcat.wiki/release/kdubht33hfb4dmghm2g27ck24i Status.AMBIGUOUS OK.DUMMY -* [ ] https://fatcat.wiki/release/ofmeeajuovbqbhkgh4rujkd3xu https://fatcat.wiki/release/r6bvy6cglfe5xgafvdcokawkue Status.DIFFERENT Miss.RELEASE_TYPE -* [o] https://fatcat.wiki/release/lezvxt2oong6xm3e3cgp47wsla https://fatcat.wiki/release/aad6r5am6vfxpbfwycmyudp2qe Status.AMBIGUOUS OK.DUMMY -* [o] https://fatcat.wiki/release/5mzzswgebze2tk4apmbwjahp34 https://fatcat.wiki/release/vl7r3uewvvbo5i2gntocy3y2ey Status.AMBIGUOUS OK.DUMMY - - diff --git a/notes/2020_11_testruns.md b/notes/2020_11_testruns.md index ec186ac..9e700c9 100644 --- a/notes/2020_11_testruns.md +++ b/notes/2020_11_testruns.md @@ -1,6 +1,18 @@ -# Test runs +# Test run notes -## Using --min-cluster-size +## 12/2020 + + + +### More versions + +* https://fatcat.wiki/release/3n4enptukfg5hpsomskg7ebh2e +* https://fatcat.wiki/release/b34keiknkvf5ril7fcajuzzt4a + + +## 11/2020 + +### Using --min-cluster-size Skipping writes of single element clusters cuts clustering from ~42h to ~22h. @@ -98,7 +110,7 @@ Preliminary case distribution: 4 OK.ARXIV_VERSION ``` -## Case Mining +### Case Mining > "-" ignore, "x" done @@ -257,7 +269,7 @@ Different reviews. More patterns: -### Chapter vs Book +Chapter vs Book * https://fatcat.wiki/release/ameuzneqizg3ff7ep4bmg4io6m * https://fatcat.wiki/release/s2thvzarsfbodd52w46zy2xple diff --git a/notes/known_issues.md b/notes/known_issues.md index 46403a0..e80acdc 100644 --- a/notes/known_issues.md +++ b/notes/known_issues.md @@ -3,6 +3,52 @@ Both the clustering and verification stage are not perfect. Here, some known cases are documented. +# General observations + +## One article included in different publications + +A DOI prefix (10.1210, The Endocrine Society) may choose to include the same +document in different publications: + +* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4 +* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4 +* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq + +## Book or Dataset + +Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g. "Unold, Max" + +* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq +* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm + +## Variation in authors + +* https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm +* https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy + +# Ideas for fixes + +* [x] when title and authors match, check the year, and maybe the doi prefix; + doi with the same prefix may not be duplicates +* [x] detect arxiv versions directly +* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting + Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College +London" - will overlap with any other author including "Imperial College +London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`, +https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a, +https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym +* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m +* [x] if title and publisher matches, but DOI and year is different, assume +different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty, +https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or +https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and +https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published +* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x +* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye +* [ ] zenodo has no explicit versions, but ids might be closeby, e.g. + https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga, +https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga + # Clustering # Verification -- cgit v1.2.3