aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-06-15 18:12:41 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-06-15 18:12:41 +0200
commit4cdce8ce95df74c706394bbcd341283e4dee0525 (patch)
tree8709079bca30e27c49b3d1684ea24a5f20c6112a /python/notes
parent3a75214764505929884147e66ca0ac66eadfff05 (diff)
downloadrefcat-4cdce8ce95df74c706394bbcd341283e4dee0525.tar.gz
refcat-4cdce8ce95df74c706394bbcd341283e4dee0525.zip
notes: duplicates
Diffstat (limited to 'python/notes')
-rw-r--r--python/notes/version_4.md54
1 files changed, 54 insertions, 0 deletions
diff --git a/python/notes/version_4.md b/python/notes/version_4.md
index 10533fd..e504b2a 100644
--- a/python/notes/version_4.md
+++ b/python/notes/version_4.md
@@ -767,3 +767,57 @@ refs (2017 only)
</monogr>
</biblStruct>
```
+
+## Duplicates in combined dataset
+
+When we merge matches with refs, we find duplicates, e.g.:
+
+```
+{
+ "_id": "4kg2dejsgzaf3cszs2lt5hz4by_9",
+ "indexed_ts": "2021-06-15T15:30:42Z",
+ "source_release_ident": "4kg2dejsgzaf3cszs2lt5hz4by",
+ "source_work_ident": "2222jduvonfg3p2no5gvvf2sj4",
+ "source_year": "2011",
+ "ref_index": 9,
+ "ref_key": "ref9",
+ "target_release_ident": "itntzdjbczfmhhaynqvqcwp6wm",
+ "target_work_ident": "csff4o7yjzbz3mfszl4zvfkcua",
+ "match_provenance": "crossref",
+ "match_status": "exact",
+ "match_reason": "doi"
+}
+{
+ "_id": "4kg2dejsgzaf3cszs2lt5hz4by_9",
+ "indexed_ts": "2021-06-15T15:30:42Z",
+ "source_release_ident": "4kg2dejsgzaf3cszs2lt5hz4by",
+ "source_work_ident": "2222jduvonfg3p2no5gvvf2sj4",
+ "source_year": "2011",
+ "ref_index": 9,
+ "ref_key": "b9",
+ "match_status": "unmatched",
+ "match_reason": "unknown",
+ "target_unstructured": "Danaceau JP, Deering CE, Day JE, Smeal SJ, Johnson-Davis KL, et al. (2007) Persistence of tolerance to methamphetamine-induced monoamine deficits. Eur J Pharmacol 559: 46-54."
+}
+{
+ "_id": "4kg2dejsgzaf3cszs2lt5hz4by_9",
+ "indexed_ts": "2021-06-15T15:30:42Z",
+ "source_release_ident": "4kg2dejsgzaf3cszs2lt5hz4by",
+ "source_work_ident": "2222jduvonfg3p2no5gvvf2sj4",
+ "source_year": "2011",
+ "ref_index": 9,
+ "ref_key": "ref10",
+ "match_status": "unmatched",
+ "match_reason": "unknown"
+}
+```
+
+Here, the ref index is 9, but ref keys are different, which might come from a
+different grobid run. I feel, we should not depend on a value that we have
+little control over.
+
+As a mititgation, we'll run a final deduplication step; but that won't catch
+all duplicates, e.g. when the indices are different, but the reference is
+actually the same.
+
+Would need to "uniq" tool for the whole ref blob or something like that.