aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
Diffstat (limited to 'notes')
-rw-r--r--notes/2020_11_testruns.md20
-rw-r--r--notes/known_issues.md46
2 files changed, 62 insertions, 4 deletions
diff --git a/notes/2020_11_testruns.md b/notes/2020_11_testruns.md
index ec186ac..9e700c9 100644
--- a/notes/2020_11_testruns.md
+++ b/notes/2020_11_testruns.md
@@ -1,6 +1,18 @@
-# Test runs
+# Test run notes
-## Using --min-cluster-size
+## 12/2020
+
+
+
+### More versions
+
+* https://fatcat.wiki/release/3n4enptukfg5hpsomskg7ebh2e
+* https://fatcat.wiki/release/b34keiknkvf5ril7fcajuzzt4a
+
+
+## 11/2020
+
+### Using --min-cluster-size
Skipping writes of single element clusters cuts clustering from ~42h to ~22h.
@@ -98,7 +110,7 @@ Preliminary case distribution:
4 OK.ARXIV_VERSION
```
-## Case Mining
+### Case Mining
> "-" ignore, "x" done
@@ -257,7 +269,7 @@ Different reviews.
More patterns:
-### Chapter vs Book
+Chapter vs Book
* https://fatcat.wiki/release/ameuzneqizg3ff7ep4bmg4io6m
* https://fatcat.wiki/release/s2thvzarsfbodd52w46zy2xple
diff --git a/notes/known_issues.md b/notes/known_issues.md
index 46403a0..e80acdc 100644
--- a/notes/known_issues.md
+++ b/notes/known_issues.md
@@ -3,6 +3,52 @@
Both the clustering and verification stage are not perfect. Here, some known
cases are documented.
+# General observations
+
+## One article included in different publications
+
+A DOI prefix (10.1210, The Endocrine Society) may choose to include the same
+document in different publications:
+
+* https://fatcat.wiki/release/52lwj4ip3nbdbgrgk4uwolbjt4
+* https://fatcat.wiki/release/6tbrmc3pq5axzf3yhqayq256a4
+* https://fatcat.wiki/release/457lzlw7czeo7aspcyttccvyrq
+
+## Book or Dataset
+
+Sometimes, a lexicon entry is a "dataset", sometimes a "book", e.g. "Unold, Max"
+
+* https://fatcat.wiki/release/7ah6efvk2ncjzgywch2cmtfumq
+* https://fatcat.wiki/release/nj7v4e3cxbfybozjmdiuwqo4sm
+
+## Variation in authors
+
+* https://fatcat.wiki/release/2kpa6ynwjzhtbbokqyxcl25gmm
+* https://fatcat.wiki/release/o4dh7w7nqvdknm4j336yrom4wy
+
+# Ideas for fixes
+
+* [x] when title and authors match, check the year, and maybe the doi prefix;
+ doi with the same prefix may not be duplicates
+* [x] detect arxiv versions directly
+* [ ] if multiple authors, may require more than one overlap, e.g. "by Yuting
+ Yao, Yuting Yao, Yuting Yao, Imperial College London, Imperial College
+London" - will overlap with any other author including "Imperial College
+London" -- we label `OK.SLUG_TITLE_AUTHOR_MATCH`,
+https://fatcat.wiki/release/6qbne2adybegdf6plgb7dnly2a,
+https://fatcat.wiki/release/v6cjc6kxzncztebmfgzxwov7ym
+* [ ] "article-journal" and "article" `release_type` should be treated the same, https://fatcat.wiki/release/k5zdpb45ufcy7grrppqndtxxji, https://fatcat.wiki/release/ypyse6ff4nbzrfd44resyav25m
+* [x] if title and publisher matches, but DOI and year is different, assume
+different, e.g. https://fatcat.wiki/release/k3hutukomngptcuwdys5omv2ty,
+https://fatcat.wiki/release/xmkiqj4bizcwdaq5hljpglkzqe, or
+https://fatcat.wiki/release/phuhxsj425fshp2jxfwlp5xnge and
+https://fatcat.wiki/release/2ncazub5tngkjn5ncdk65jyr4u -- these might be repeatedly published
+* [ ] article and "reply", https://pubmed.ncbi.nlm.nih.gov/5024865/, https://onlinelibrary.wiley.com/doi/abs/10.5694/j.1326-5377.1972.tb47249.x
+* [ ] figshare uses versions, too, https://fatcat.wiki/release/zmivcpjvhba25ldkx27d24oefa, https://fatcat.wiki/release/mjapiqe2nzcy3fs3hriw253dye
+* [ ] zenodo has no explicit versions, but ids might be closeby, e.g.
+ https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga,
+https://fatcat.wiki/release/mbnr3nrdijerto6wfjnlsmfhga
+
# Clustering
# Verification