diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 18:14:58 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 18:49:46 -0800 |
commit | 23fd36a3e8505c1ed6d13367a3fb62a8bf2242d7 (patch) | |
tree | a42efe1b89edb8c8205b9bdd480615510f7814ae /notes/cleanups | |
parent | 1024e688bb12d64648ceb638daf049d508f87561 (diff) | |
download | fatcat-23fd36a3e8505c1ed6d13367a3fb62a8bf2242d7.tar.gz fatcat-23fd36a3e8505c1ed6d13367a3fb62a8bf2242d7.zip |
add notes about 'double slash in DOI' issue
Diffstat (limited to 'notes/cleanups')
-rw-r--r-- | notes/cleanups/double_slash_dois.md | 46 |
1 files changed, 46 insertions, 0 deletions
diff --git a/notes/cleanups/double_slash_dois.md b/notes/cleanups/double_slash_dois.md new file mode 100644 index 00000000..d4e9ded6 --- /dev/null +++ b/notes/cleanups/double_slash_dois.md @@ -0,0 +1,46 @@ + +Relevant github issue: https://github.com/internetarchive/fatcat/issues/48 + + +## Investigate + +At least some of these DOIs actually seem valid, like +`10.1026//1616-1041.3.2.86`. So shouldn't be re-writing them! + + zcat release_extid.tsv.gz \ + | cut -f1,3 \ + | rg '\t10\.\d+//' \ + | wc -l + # 59,904 + + zcat release_extid.tsv.gz \ + | cut -f1,3 \ + | rg '\t10\.\d+//' \ + | pv -l \ + > doubleslash_dois.tsv + +Which prefixes have the most double slashes? + + cat doubleslash_dois.tsv | cut -f2 | cut -d/ -f1 | sort | uniq -c | sort -nr | head + 51220 10.1037 + 2187 10.1026 + 1316 10.1024 + 826 10.1027 + 823 10.14505 + 443 10.17010 + 186 10.46925 + 163 10.37473 + 122 10.18376 + 118 10.29392 + [...] + +All of the 10.1037 DOIs seem to be registered with Crossref, and at least some +have redirects to the not-with-double-slash versions. Not all doi.org lookups +include a redirect. + +I think the "correct thing to do" here is to add special-case handling for the +pubmed and crossref importers, and in any other case allow double slashes. + +Not clear that there are any specific cleanups to be done for now. A broader +"verify that DOIs are actually valid" push and cleanup would make sense; if +that happens checking for mangled double-slash DOIs would make sense. |