diff options
Diffstat (limited to 'python/notes/version_4.md')
-rw-r--r-- | python/notes/version_4.md | 20 |
1 files changed, 20 insertions, 0 deletions
diff --git a/python/notes/version_4.md b/python/notes/version_4.md index e504b2a..2e273f8 100644 --- a/python/notes/version_4.md +++ b/python/notes/version_4.md @@ -821,3 +821,23 @@ all duplicates, e.g. when the indices are different, but the reference is actually the same. Would need to "uniq" tool for the whole ref blob or something like that. + +---- + +## QA: duplicates + +There seem to be many self-links in the dataset: + +* sample: 25668733, duplicate rows: 1913155; about 8% (although only 145030 uniq; many repetitions) + +``` +$ LC_ALL=C awk '$1 == $2' bref_tabs.tsv # .... +56fbxcue6rdxlmxqto7vibg2xi 56fbxcue6rdxlmxqto7vibg2xi exact doi crossref +o2juqzskxzdtpbait5gxg3yf4q o2juqzskxzdtpbait5gxg3yf4q exact doi crossref +6mwdlhvbljgtdntz5qifywhsn4 6mwdlhvbljgtdntz5qifywhsn4 exact doi crossref +t7vluqxmgbe4pipf4nkfcayedq t7vluqxmgbe4pipf4nkfcayedq exact doi crossref +iofm6brptvczlnrys5vxw34x3i iofm6brptvczlnrys5vxw34x3i exact doi crossref +soa44abzivcnfnsx4ymxvbyg44 soa44abzivcnfnsx4ymxvbyg44 exact doi crossref +7fs4c3u2ofcmxie344o5e4wuxi 7fs4c3u2ofcmxie344o5e4wuxi exact doi crossref +igyewr6er5epfozhk7dyfqa5tu igyewr6er5epfozhk7dyfqa5tu exact doi crossref +``` |