diff options
Diffstat (limited to 'notes')
-rw-r--r-- | notes/2021_openalex.txt | 115 |
1 files changed, 115 insertions, 0 deletions
diff --git a/notes/2021_openalex.txt b/notes/2021_openalex.txt new file mode 100644 index 0000000..ff3f485 --- /dev/null +++ b/notes/2021_openalex.txt @@ -0,0 +1,115 @@ + +With raw_issn: + + Counter({'total': 49057, 'inserted': 36959, 'missing-issn': 11749, 'no-match': 198, 'duplicate': 151}) + +With issnl: + + Counter({'total': 49057, 'inserted': 36959, 'missing-issn': 11947, 'duplicate': 151}) + +--- + +Running some quick initial metadata quality checks on OpenAlex Journal list. +This is from the pre-release, dated in file names as 2021-10-11 (but announced +in late November 2021). + +Looking for ISSN-L dupes: + + cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -d | wc -l + # 146 + + cat openalex-journals.txt | cut -f5 | rg '\-' | sort | uniq -D | wc -l + # 293 + +Looking for ISSN-L not in a recent dump of ISSN-Ls from issn.org: + + cat openalex-journals.txt | cut -f5 | rg '\-' | sort -u > openalex-issnl.tsv + cat ISSN-to-ISSN-L.txt | cut -f2 | rg -v "ISSN" | rg '\-' | sort -u > issnl.tsv + + comm -23 openalex-issnl.tsv issnl.tsv | wc -l + # 249 + + comm -23 openalex-issnl.tsv issnl.tsv > openalex_unknown_issnl.txt + +Looking for duplicate exact homepage URLs: + +There are a few reasons that ISSNs might not be in the public list or available +through https://portal.issn.org (eg, sometimes there are typos which then +become widely used; or the ISSN is partially registered). But if using the +ISSN-L as a persistent identifier, should require it to be valid and publicly +registered. + +Look for "normalized name" duplicates: + + cat openalex-journals.txt | cut -f3 | sort | uniq -d | wc -l + 14 + +Not many, good. + +Look for bogus homepage URLs: + + cat openalex-journals.txt | cut -f1,10 | rg -v '\t$' | rg -v '://' + + JournalId Webpage + 2944001180 www.cjb-rcb.ca + 2764771476 123\ + 2948018973 ores.su/en/journals/chinese-journal-of-ecology/ + 2764943583 197\ + 2946866068 www.kais99.org + 2764846895 518\ + 2947334459 www.jasnaoe.or.jp/en/ + 2764943300 65\ + 2944560164 www.ijqf.org + 2765015668 10\ + 2764518604 116\ + 2764649715 430\ + +HTTP/HTTPS: + + cat openalex-journals.txt | cut -f10 | rg '://' | cut -f1 -d: | sort | uniq -c + 5483 http + 873 https + +Probably a whole bunch of these could be `https://` instead of `http://`, which +would improve end-user security/privacy by default. + +Top domains: + + cat openalex-journals.txt | cut -f10 | rg '://' | cut -f3 -d/ | sed 's/www\.//g' | sort | uniq -c | sort -nr | head -n20 + 463 journals.elsevier.com + 412 onlinelibrary.wiley.com + 304 springer.com + 286 sciencedirect.com + 183 sagepub.com + 183 elsevier.com + 169 tandfonline.com + 91 journals.cambridge.org + 75 worldscinet.com + 63 informahealthcare.com + 62 apa.org + 43 pubs.acs.org + 43 press.jhu.edu + 39 wiley.com + 35 pdcnet.org + 35 journals.uchicago.edu + 35 journals.lww.com + 34 degruyter.com + 33 uk.sagepub.com + 31 rsc.org + +These look pretty good! Often catalogs have a bunch of URLs that just point to +aggregators, etc, but these seem like real hompage domains. + +Any wayback URLs in there? + + cat openalex-journals.txt | cut -f1,10 | rg archive.org + 172099791 http://web.archive.org/web/20090803131854/http://www.rejecta.org:80/ + 59114670 http://web.archive.org/web/http://www.multilingualarchive.com/ma/enwiki/es/botaniska_notiser + +The first wayback URL seems reasonable (journal is defunct, but homepage was +captured). + +The second wayback URL isn't good (we don't have a capture, and URL structure +isn't complete) and there seems to be a live-web homepage for the backcatalog: + + https://journals.lub.lu.se/bn/index |