diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:33:14 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:33:14 -0800 |
commit | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (patch) | |
tree | 7d3934e4922439402f882a374fe477906fd41aae /notes/cleanups/double_slash_dois.md | |
parent | ec2809ef2ac51c992463839c1e3451927f5e1661 (diff) | |
download | fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.tar.gz fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.zip |
move 'cleanups' directory from notes to extra/
Diffstat (limited to 'notes/cleanups/double_slash_dois.md')
-rw-r--r-- | notes/cleanups/double_slash_dois.md | 46 |
1 files changed, 0 insertions, 46 deletions
diff --git a/notes/cleanups/double_slash_dois.md b/notes/cleanups/double_slash_dois.md deleted file mode 100644 index d4e9ded6..00000000 --- a/notes/cleanups/double_slash_dois.md +++ /dev/null @@ -1,46 +0,0 @@ - -Relevant github issue: https://github.com/internetarchive/fatcat/issues/48 - - -## Investigate - -At least some of these DOIs actually seem valid, like -`10.1026//1616-1041.3.2.86`. So shouldn't be re-writing them! - - zcat release_extid.tsv.gz \ - | cut -f1,3 \ - | rg '\t10\.\d+//' \ - | wc -l - # 59,904 - - zcat release_extid.tsv.gz \ - | cut -f1,3 \ - | rg '\t10\.\d+//' \ - | pv -l \ - > doubleslash_dois.tsv - -Which prefixes have the most double slashes? - - cat doubleslash_dois.tsv | cut -f2 | cut -d/ -f1 | sort | uniq -c | sort -nr | head - 51220 10.1037 - 2187 10.1026 - 1316 10.1024 - 826 10.1027 - 823 10.14505 - 443 10.17010 - 186 10.46925 - 163 10.37473 - 122 10.18376 - 118 10.29392 - [...] - -All of the 10.1037 DOIs seem to be registered with Crossref, and at least some -have redirects to the not-with-double-slash versions. Not all doi.org lookups -include a redirect. - -I think the "correct thing to do" here is to add special-case handling for the -pubmed and crossref importers, and in any other case allow double slashes. - -Not clear that there are any specific cleanups to be done for now. A broader -"verify that DOIs are actually valid" push and cleanup would make sense; if -that happens checking for mangled double-slash DOIs would make sense. |