diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2023-01-04 21:32:09 -0800 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2023-01-04 21:32:09 -0800 | 
| commit | 101f47768822821c3e05d147c0c95ceef29f1769 (patch) | |
| tree | 226abef1a67a73a50aa6ba87da8f8394574410b8 | |
| parent | 619e26b353b9b4b16df494b087b2173a6ce06eec (diff) | |
| download | chocula-master.tar.gz chocula-master.zip  | |
| -rw-r--r-- | notes/2021_openalex.txt | 115 | 
1 files changed, 115 insertions, 0 deletions
diff --git a/notes/2021_openalex.txt b/notes/2021_openalex.txt new file mode 100644 index 0000000..ff3f485 --- /dev/null +++ b/notes/2021_openalex.txt @@ -0,0 +1,115 @@ + +With raw_issn: + +    Counter({'total': 49057, 'inserted': 36959, 'missing-issn': 11749, 'no-match': 198, 'duplicate': 151}) + +With issnl: + +    Counter({'total': 49057, 'inserted': 36959, 'missing-issn': 11947, 'duplicate': 151}) + +--- + +Running some quick initial metadata quality checks on OpenAlex Journal list. +This is from the pre-release, dated in file names as 2021-10-11 (but announced +in late November 2021). + +Looking for ISSN-L dupes: + +    cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -d | wc -l +    # 146 + +    cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -D | wc -l +    # 293 + +Looking for ISSN-L not in a recent dump of ISSN-Ls from issn.org: + +    cat openalex-journals.txt | cut -f5 | rg '\-' | sort -u > openalex-issnl.tsv +    cat ISSN-to-ISSN-L.txt | cut -f2 | rg -v "ISSN" | rg '\-' | sort -u > issnl.tsv + +    comm -23 openalex-issnl.tsv issnl.tsv | wc -l +    # 249 + +    comm -23 openalex-issnl.tsv issnl.tsv > openalex_unknown_issnl.txt + +Looking for duplicate exact homepage URLs: + +There are a few reasons that ISSNs might not be in the public list or available +through https://portal.issn.org (eg, sometimes there are typos which then +become widely used; or the ISSN is partially registered). But if using the +ISSN-L as a persistent identifier, should require it to be valid and publicly +registered. + +Look for "normalized name" duplicates: + +    cat openalex-journals.txt | cut -f3 | sort | uniq -d | wc -l +    14 + +Not many, good. + +Look for bogus homepage URLs: + +    cat openalex-journals.txt | cut -f1,10 | rg -v '\t$' | rg -v '://' + +    JournalId       Webpage +    2944001180      www.cjb-rcb.ca +    2764771476      123\ +    2948018973      ores.su/en/journals/chinese-journal-of-ecology/ +    2764943583      197\ +    2946866068      www.kais99.org +    2764846895      518\ +    2947334459      www.jasnaoe.or.jp/en/ +    2764943300      65\ +    2944560164      www.ijqf.org +    2765015668      10\ +    2764518604      116\ +    2764649715      430\ + +HTTP/HTTPS: + +    cat openalex-journals.txt | cut -f10 | rg '://' | cut -f1 -d: | sort | uniq -c +       5483 http +        873 https + +Probably a whole bunch of these could be `https://` instead of `http://`, which +would improve end-user security/privacy by default. + +Top domains: + +    cat openalex-journals.txt | cut -f10 | rg '://' | cut -f3 -d/ | sed 's/www\.//g' | sort | uniq -c | sort -nr | head -n20 +        463 journals.elsevier.com +        412 onlinelibrary.wiley.com +        304 springer.com +        286 sciencedirect.com +        183 sagepub.com +        183 elsevier.com +        169 tandfonline.com +         91 journals.cambridge.org +         75 worldscinet.com +         63 informahealthcare.com +         62 apa.org +         43 pubs.acs.org +         43 press.jhu.edu +         39 wiley.com +         35 pdcnet.org +         35 journals.uchicago.edu +         35 journals.lww.com +         34 degruyter.com +         33 uk.sagepub.com +         31 rsc.org + +These look pretty good! Often catalogs have a bunch of URLs that just point to +aggregators, etc, but these seem like real hompage domains. + +Any wayback URLs in there? + +    cat openalex-journals.txt | cut -f1,10 | rg archive.org +    172099791       http://web.archive.org/web/20090803131854/http://www.rejecta.org:80/ +    59114670        http://web.archive.org/web/http://www.multilingualarchive.com/ma/enwiki/es/botaniska_notiser + +The first wayback URL seems reasonable (journal is defunct, but homepage was +captured). + +The second wayback URL isn't good (we don't have a capture, and URL structure +isn't complete) and there seems to be a live-web homepage for the backcatalog: + +    https://journals.lub.lu.se/bn/index  | 
