aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-06-21 20:03:45 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-06-21 20:03:45 +0200
commit1d24518ddd1b61d8291af2b8ca5b1a5ac7ef705b (patch)
tree05dad31420f1142f143209f0cfe36e2e8249f8a0 /python/notes
parente82983633b58ead6ed2ce82e36a17af227d5f5ed (diff)
downloadrefcat-1d24518ddd1b61d8291af2b8ca5b1a5ac7ef705b.tar.gz
refcat-1d24518ddd1b61d8291af2b8ca5b1a5ac7ef705b.zip
update script, notes
Diffstat (limited to 'python/notes')
-rw-r--r--python/notes/version_4.md20
1 files changed, 20 insertions, 0 deletions
diff --git a/python/notes/version_4.md b/python/notes/version_4.md
index e504b2a..2e273f8 100644
--- a/python/notes/version_4.md
+++ b/python/notes/version_4.md
@@ -821,3 +821,23 @@ all duplicates, e.g. when the indices are different, but the reference is
actually the same.
Would need to "uniq" tool for the whole ref blob or something like that.
+
+----
+
+## QA: duplicates
+
+There seem to be many self-links in the dataset:
+
+* sample: 25668733, duplicate rows: 1913155; about 8% (although only 145030 uniq; many repetitions)
+
+```
+$ LC_ALL=C awk '$1 == $2' bref_tabs.tsv # ....
+56fbxcue6rdxlmxqto7vibg2xi 56fbxcue6rdxlmxqto7vibg2xi exact doi crossref
+o2juqzskxzdtpbait5gxg3yf4q o2juqzskxzdtpbait5gxg3yf4q exact doi crossref
+6mwdlhvbljgtdntz5qifywhsn4 6mwdlhvbljgtdntz5qifywhsn4 exact doi crossref
+t7vluqxmgbe4pipf4nkfcayedq t7vluqxmgbe4pipf4nkfcayedq exact doi crossref
+iofm6brptvczlnrys5vxw34x3i iofm6brptvczlnrys5vxw34x3i exact doi crossref
+soa44abzivcnfnsx4ymxvbyg44 soa44abzivcnfnsx4ymxvbyg44 exact doi crossref
+7fs4c3u2ofcmxie344o5e4wuxi 7fs4c3u2ofcmxie344o5e4wuxi exact doi crossref
+igyewr6er5epfozhk7dyfqa5tu igyewr6er5epfozhk7dyfqa5tu exact doi crossref
+```