1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
|
Process for generating a sqlite3 database file mapping between:
- DOI
- Wikidata QID
- CORE database id
- PubMed Central id
- PubMed id
This fast/indexed local database file, which is about a gigabyte compressed, is
useful for optimizing some fatcat 'release' entity imports and operations.
There is an example database at <https://archive.org/details/ia_journal_pid_map_munge_20180908>
## Data Sources
EuropePMC mapping (more works than the USA PubMedCentral mapping)
- <ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz>
- <https://archive.org/details/europepmc-id-map-2018-08-31>
Wikicite data snapshot
- <https://archive.org/details/wikicite-biblio-data-20180903>
CORE dataset
- <https://core.ac.uk/services#api>
- <https://archive.org/download/core_oa_metadata_20180301>
## Wikidata Extract
Early query:
zcat wikidata-20180806-publications.ndjson.gz.crdownload | rg '"P356"' | jq '{qid: .id, doi: .claims.P356[0]}' -c | pv -l | head
Polished:
zcat wikidata-20180903-publications.ndjson.gz | rg '"P356"' | jq '[(.claims.P356[0] | ascii_downcase), .id] | @tsv' -cr | pv -l > doi_wikidata.20180903.tsv
# 13.6M 0:26:13 [8.65k/s]
Hrm, got this but seemed to continue:
jq: error (at <stdin>:455616): explode input must be a string
## CORE Extract
xzcat core_json/*.json.xz | jq -rc 'select(.doi != null) | [(.doi | ascii_downcase), .coreId] | @tsv' | pv -l > doi_core.20180301.tsv
# bnewbold@bnewbold-dev$ wc -l doi_core.20180301.tsv
# 28210779 doi_core.20180301.tsv
## PMCID/PMID
In a nice CSV format, no extract needed.
bnewbold@bnewbold-dev$ zcat PMID_PMCID_DOI.csv.gz | rg doi.org | wc -l
19441168
## sqlite schema
CREATE TABLE ids (doi text not null, core int, pmid int, pmcid text, wikidata text);
CREATE UNIQUE INDEX ids_doi on ids (doi);
Run CORE import first (largest mapping, thus fewest exception/update paths),
then pubmed, then wikidata.
## Run import
cat doi_core.20180301.tsv | ./load_core.py release_ids.db
=> read 28210000, wrote 28210000
=> overnight? but seemed to run fast
zcat doi_wikidata.20180903.tsv.gz | ./load_wikidata.py release_ids.db
=> uhoh, seems very slow. lots of IOWAIT. switching to host with SSD
=> wow, like 5-10x faster at least. sigh.
=> ran in a few hours
zcat PMID_PMCID_DOI.csv.gz | ./load_pmc.py release_ids.db
=> read 29692000, wrote 19441136
sqlite> select count(*) from ids;
24831337
# only 2.3 million works have all IDs
sqlite> select count(*) from ids where wikidata not null and core not null and pmid not null;
2314700
# almost (but not all) PubMedCentral items are PubMed items
sqlite> select count(*) from ids where pmid not null;
19328761
sqlite> select count(*) from ids where pmcid not null;
3739951
sqlite> select count(*) from ids where pmcid not null and pmid not null;
3682486
# wikidata DOIs seem to mostly come from pmid mapping
sqlite> select count(*) from ids where wikidata not null;
13608497
sqlite> select count(*) from ids where pmid not null and wikidata not null;
13179156
# core IDs are more independent (?)
sqlite> select count(*) from ids where core not null;
7903910
sqlite> select count(*) from ids where core not null and wikidata not null;
2372369
sqlite> select count(*) from ids where core not null and pmid not null;
2783344
|