notes/2021_openalex.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115


With raw_issn:

    Counter({'total': 49057, 'inserted': 36959, 'missing-issn': 11749, 'no-match': 198, 'duplicate': 151})

With issnl:

    Counter({'total': 49057, 'inserted': 36959, 'missing-issn': 11947, 'duplicate': 151})

---

Running some quick initial metadata quality checks on OpenAlex Journal list.
This is from the pre-release, dated in file names as 2021-10-11 (but announced
in late November 2021).

Looking for ISSN-L dupes:

    cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -d | wc -l
    # 146

    cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -D | wc -l
    # 293

Looking for ISSN-L not in a recent dump of ISSN-Ls from issn.org:

    cat openalex-journals.txt | cut -f5 | rg '\-' | sort -u > openalex-issnl.tsv
    cat ISSN-to-ISSN-L.txt | cut -f2 | rg -v "ISSN" | rg '\-' | sort -u > issnl.tsv

    comm -23 openalex-issnl.tsv issnl.tsv | wc -l
    # 249

    comm -23 openalex-issnl.tsv issnl.tsv > openalex_unknown_issnl.txt

Looking for duplicate exact homepage URLs:

There are a few reasons that ISSNs might not be in the public list or available
through https://portal.issn.org (eg, sometimes there are typos which then
become widely used; or the ISSN is partially registered). But if using the
ISSN-L as a persistent identifier, should require it to be valid and publicly
registered.

Look for "normalized name" duplicates:

    cat openalex-journals.txt | cut -f3 | sort | uniq -d | wc -l
    14

Not many, good.

Look for bogus homepage URLs:

    cat openalex-journals.txt | cut -f1,10 | rg -v '\t$' | rg -v '://'

    JournalId       Webpage
    2944001180      www.cjb-rcb.ca
    2764771476      123\
    2948018973      ores.su/en/journals/chinese-journal-of-ecology/
    2764943583      197\
    2946866068      www.kais99.org
    2764846895      518\
    2947334459      www.jasnaoe.or.jp/en/
    2764943300      65\
    2944560164      www.ijqf.org
    2765015668      10\
    2764518604      116\
    2764649715      430\

HTTP/HTTPS:

    cat openalex-journals.txt | cut -f10 | rg '://' | cut -f1 -d: | sort | uniq -c
       5483 http
        873 https

Probably a whole bunch of these could be `https://` instead of `http://`, which
would improve end-user security/privacy by default.

Top domains:

    cat openalex-journals.txt | cut -f10 | rg '://' | cut -f3 -d/ | sed 's/www\.//g' | sort | uniq -c | sort -nr | head -n20
        463 journals.elsevier.com
        412 onlinelibrary.wiley.com
        304 springer.com
        286 sciencedirect.com
        183 sagepub.com
        183 elsevier.com
        169 tandfonline.com
         91 journals.cambridge.org
         75 worldscinet.com
         63 informahealthcare.com
         62 apa.org
         43 pubs.acs.org
         43 press.jhu.edu
         39 wiley.com
         35 pdcnet.org
         35 journals.uchicago.edu
         35 journals.lww.com
         34 degruyter.com
         33 uk.sagepub.com
         31 rsc.org

These look pretty good! Often catalogs have a bunch of URLs that just point to
aggregators, etc, but these seem like real hompage domains.

Any wayback URLs in there?

    cat openalex-journals.txt | cut -f1,10 | rg archive.org
    172099791       http://web.archive.org/web/20090803131854/http://www.rejecta.org:80/
    59114670        http://web.archive.org/web/http://www.multilingualarchive.com/ma/enwiki/es/botaniska_notiser

The first wayback URL seems reasonable (journal is defunct, but homepage was
captured).

The second wayback URL isn't good (we don't have a capture, and URL structure
isn't complete) and there seems to be a live-web homepage for the backcatalog:

    https://journals.lub.lu.se/bn/index