aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-12-17 00:11:33 +0100
committerMartin Czygan <martin.czygan@gmail.com>2020-12-17 00:11:33 +0100
commitaa1505172f85ecc434fd5d5b1aa7fc4521074e38 (patch)
tree8c44c7a87fb9da313b8c16f1d2b8c1a146a02a8a
parent34c8934b5d5204241dae38995781c932dd5eacf1 (diff)
downloadfuzzycat-aa1505172f85ecc434fd5d5b1aa7fc4521074e38.tar.gz
fuzzycat-aa1505172f85ecc434fd5d5b1aa7fc4521074e38.zip
wip: notes
-rw-r--r--README.md91
-rw-r--r--notes/2020_11_testruns.md20
-rw-r--r--notes/known_issues.md46
3 files changed, 64 insertions, 93 deletions
diff --git a/README.md b/README.md
index fbe144e..d095994 100644
--- a/README.md
+++ b/README.md
@@ -2,6 +2,8 @@
Fuzzy matching publications for [fatcat](https://fatcat.wiki).
+![https://pypi.org/project/fuzzycat/](https://img.shields.io/pypi/v/fuzzycat?style=flat-square)
+
# Example Run
Run any clustering algorithm.
@@ -182,92 +184,3 @@ Notes on cadd28a version clustering (nysiis) and verification.
4 OK.ARXIV_VERSION
```
-
-#### Cases
-
-* common title, "Books by Our Readers", https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq, https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq
-* common title, "The Future of Imprisonment"
-* common title, "In This Issue/Research Watch/News-in-Brief/News from the IASLC Tobacco Control Committee"
-* common title, "IEEE Transactions on Wireless Communications", same publisher, different year
-* common title, "ASMS News" (also different year)
-* common title, "AMERICAN INSTITUTE OF INSTRUCTION"
-* common title, "Contents lists"
-* common title, "Submissions"
-* same, except DOI, but maybe the same item, after all? https://fatcat.wiki/release/kxgsbh66v5bwhobcaiuh4i7dwy, https://fatcat.wiki/release/thl7o44z3jgk3njdypixwrdbve
-
-Authors may be messy:
-
-* IR and published, be we currently yield `Miss.CONTRIB_INTERSECTION_EMPTY` -
- https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm,
-https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy - may need to tokenize authors
-
-A DOI prefix (10.1210, The Endocrine Society) may choose to include the same
-document in different publications:
-
-* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4
-* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4
-* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq
-
-Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g.:
-
-* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq
-* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm
-
-#### Possible fixes
-
-* [ ] when title and authors match, check the year, and maybe the doi prefix; doi with the same prefix may not be duplicates
-* [x] detect arxiv versions directly
-* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting
- Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College
-London" - will overlap with any other author including "Imperial College
-London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`,
-https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a,
-https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym
-* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m
-* [ ] if title and publisher matches, but DOI and year is different, assume
-different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty,
-https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or
-https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and
-https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published
-* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x
-* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye
-* [ ] zenodo has no explicit versions, but ids might be closeby, e.g. https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga, https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga
-
-#### 100 examples
-
-* accuracy at around 0.8
-* while the results look ok, the reasons are not always the ones that stand out
- the most (while checking manually)
-
-```
-78 [x]
-11 [o]
-11 [ ]
-```
-
-Ok cases are now in [verify.csv](https://github.com/miku/fuzzycat/blob/master/tests/data/verify.csv).
-
-* [ ] https://fatcat.wiki/release/i2ziaqjrovh3rfrojcaf2xqidy https://fatcat.wiki/release/4rbsv4kplnf4tny22px5z35vty Status.DIFFERENT Miss.CONTRIB_INTERSECTION_EMPTY
-* [o] https://fatcat.wiki/release/65qk35lrxfbqxnpjfpra3ankxe https://fatcat.wiki/release/tovzgangzbfm5bc2qriyh2k6da Status.AMBIGUOUS OK.DUMMY
-* [ ] https://fatcat.wiki/release/qvlzvflp6vhojdm3uyvj2d6keq https://fatcat.wiki/release/vynqlyi2xjdexmf54a5yfidx6m Status.DIFFERENT Miss.RELEASE_TYPE
-* [o] https://fatcat.wiki/release/hfewgpty4ne3zn7rg32z5npdxy https://fatcat.wiki/release/3djtma4xrjh2pcxy4gu6pafqji Status.AMBIGUOUS OK.DUMMY
-* [ ] https://fatcat.wiki/release/ybxygpeypbaq5pfrztu3z2itw4 https://fatcat.wiki/release/2c2ztrtlkzdhfmzpf7fbindpjq Status.DIFFERENT Miss.DATASET_DOI
-* [o] https://fatcat.wiki/release/eyol2bjf6jawhjnote73ej5v24 https://fatcat.wiki/release/jowohxiuuncqbdidvqjrrb5324 Status.AMBIGUOUS OK.DUMMY
-* [ ] https://fatcat.wiki/release/d5bqydkylzelpmdfcks2v5th7q https://fatcat.wiki/release/lzcgl52npjaf3etfhhnb3d46da Status.DIFFERENT Miss.DATASET_DOI
-* [o] https://fatcat.wiki/release/5ysvoxjj4jcxbji42nnzapr6n4 https://fatcat.wiki/release/dx6wevs345cjfejokze2te6sia Status.AMBIGUOUS OK.DUMMY
-* [o] https://fatcat.wiki/release/xdclbyjgjnbehchrl7l2vi3274 https://fatcat.wiki/release/t3kqh6lfprfaff5zovh6qlodxy Status.AMBIGUOUS OK.DUMMY
-* [o] https://fatcat.wiki/release/aogvyiw67vdsnf26bufauy2rqa https://fatcat.wiki/release/aofedljjhbhajmx5doxfcv43fa Status.AMBIGUOUS OK.DUMMY
-* [o] https://fatcat.wiki/release/cjal2f6k5zesxcnrnyhc6ftg5e https://fatcat.wiki/release/oi5kzjlku5gpxjc247v6zjzosa Status.AMBIGUOUS OK.DUMMY
-* [o] https://fatcat.wiki/release/o6e6yf37y5bttbrpo4piska4gq https://fatcat.wiki/release/pchjd5fwqjdqfevphjff7ydeae Status.AMBIGUOUS OK.DUMMY
-* [ ] https://fatcat.wiki/release/l4fyyvsckneuxkq7d3y2zvkvbe https://fatcat.wiki/release/gf5hriyvuvarhcvttnooaffksi Status.DIFFERENT Miss.RELEASE_TYPE
-* [ ] https://fatcat.wiki/release/7nbcgsohrrak5cuyk6dnit6ega https://fatcat.wiki/release/q66xv7drk5fnph7enwwlkyuwqm Status.DIFFERENT Miss.CONTRIB_INTERSECTION_EMPTY
-* [ ] https://fatcat.wiki/release/2tzvdvx4t5hfxnqlnyt4rqenly https://fatcat.wiki/release/houszjo2ejbjhljxvxz23whgua Status.DIFFERENT Miss.DATASET_DOI
-* [ ] https://fatcat.wiki/release/qsxbwvreu5ehrbz65ngh2ghcra https://fatcat.wiki/release/xjvo37ynxvc3zm55bxoa545gvq Status.EXACT OK.TITLE_AUTHOR_MATCH
-* [ ] https://fatcat.wiki/release/ggzzwt6deneyrna5h65mvv7sfe https://fatcat.wiki/release/h4rnaxua75dndmq4x4snnw3qxe Status.AMBIGUOUS Miss.SHORT_TITLE
-* [ ] https://fatcat.wiki/release/skxiyp7qmraqhe2o4zvo7iq6sq https://fatcat.wiki/release/qyqre3mzgbha7hhfarn5absqnq Status.EXACT OK.TITLE_AUTHOR_MATCH
-* [o] https://fatcat.wiki/release/am53f7iyyvcjnjsgjbz7pu7dii https://fatcat.wiki/release/kdubht33hfb4dmghm2g27ck24i Status.AMBIGUOUS OK.DUMMY
-* [ ] https://fatcat.wiki/release/ofmeeajuovbqbhkgh4rujkd3xu https://fatcat.wiki/release/r6bvy6cglfe5xgafvdcokawkue Status.DIFFERENT Miss.RELEASE_TYPE
-* [o] https://fatcat.wiki/release/lezvxt2oong6xm3e3cgp47wsla https://fatcat.wiki/release/aad6r5am6vfxpbfwycmyudp2qe Status.AMBIGUOUS OK.DUMMY
-* [o] https://fatcat.wiki/release/5mzzswgebze2tk4apmbwjahp34 https://fatcat.wiki/release/vl7r3uewvvbo5i2gntocy3y2ey Status.AMBIGUOUS OK.DUMMY
-
-
diff --git a/notes/2020_11_testruns.md b/notes/2020_11_testruns.md
index ec186ac..9e700c9 100644
--- a/notes/2020_11_testruns.md
+++ b/notes/2020_11_testruns.md
@@ -1,6 +1,18 @@
-# Test runs
+# Test run notes
-## Using --min-cluster-size
+## 12/2020
+
+
+
+### More versions
+
+* https://fatcat.wiki/release/3n4enptukfg5hpsomskg7ebh2e
+* https://fatcat.wiki/release/b34keiknkvf5ril7fcajuzzt4a
+
+
+## 11/2020
+
+### Using --min-cluster-size
Skipping writes of single element clusters cuts clustering from ~42h to ~22h.
@@ -98,7 +110,7 @@ Preliminary case distribution:
4 OK.ARXIV_VERSION
```
-## Case Mining
+### Case Mining
> "-" ignore, "x" done
@@ -257,7 +269,7 @@ Different reviews.
More patterns:
-### Chapter vs Book
+Chapter vs Book
* https://fatcat.wiki/release/ameuzneqizg3ff7ep4bmg4io6m
* https://fatcat.wiki/release/s2thvzarsfbodd52w46zy2xple
diff --git a/notes/known_issues.md b/notes/known_issues.md
index 46403a0..e80acdc 100644
--- a/notes/known_issues.md
+++ b/notes/known_issues.md
@@ -3,6 +3,52 @@
Both the clustering and verification stage are not perfect. Here, some known
cases are documented.
+# General observations
+
+## One article included in different publications
+
+A DOI prefix (10.1210, The Endocrine Society) may choose to include the same
+document in different publications:
+
+* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4
+* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4
+* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq
+
+## Book or Dataset
+
+Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g. "Unold, Max"
+
+* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq
+* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm
+
+## Variation in authors
+
+* https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm
+* https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy
+
+# Ideas for fixes
+
+* [x] when title and authors match, check the year, and maybe the doi prefix;
+ doi with the same prefix may not be duplicates
+* [x] detect arxiv versions directly
+* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting
+ Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College
+London" - will overlap with any other author including "Imperial College
+London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`,
+https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a,
+https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym
+* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m
+* [x] if title and publisher matches, but DOI and year is different, assume
+different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty,
+https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or
+https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and
+https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published
+* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x
+* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye
+* [ ] zenodo has no explicit versions, but ids might be closeby, e.g.
+ https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga,
+https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga
+
# Clustering
# Verification