summaryrefslogtreecommitdiffstats
path: root/extra/cleanups/double_slash_dois.md
blob: d4e9ded683c1b7d01f088d4b5d5f01729eca073d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

Relevant github issue: https://github.com/internetarchive/fatcat/issues/48


## Investigate

At least some of these DOIs actually seem valid, like
`10.1026//1616-1041.3.2.86`. So shouldn't be re-writing them!

    zcat release_extid.tsv.gz \
        | cut -f1,3 \
        | rg '\t10\.\d+//' \
        | wc -l
    # 59,904

    zcat release_extid.tsv.gz \
        | cut -f1,3 \
        | rg '\t10\.\d+//' \
        | pv -l \
        > doubleslash_dois.tsv

Which prefixes have the most double slashes?

    cat doubleslash_dois.tsv | cut -f2 | cut -d/ -f1 | sort | uniq -c | sort -nr | head
      51220 10.1037
       2187 10.1026
       1316 10.1024
        826 10.1027
        823 10.14505
        443 10.17010
        186 10.46925
        163 10.37473
        122 10.18376
        118 10.29392
        [...]

All of the 10.1037 DOIs seem to be registered with Crossref, and at least some
have redirects to the not-with-double-slash versions. Not all doi.org lookups
include a redirect.

I think the "correct thing to do" here is to add special-case handling for the
pubmed and crossref importers, and in any other case allow double slashes.

Not clear that there are any specific cleanups to be done for now. A broader
"verify that DOIs are actually valid" push and cleanup would make sense; if
that happens checking for mangled double-slash DOIs would make sense.