summaryrefslogtreecommitdiffstats
path: root/extra/extid_map/README.md
blob: d972ae25f8db8cbfd4cd0ccb0118200630f2538c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

Process for generating a sqlite3 database file mapping between:

- DOI
- Wikidata QID
- CORE database id
- PubMed Central id
- PubMed id

This fast/indexed local database file, which is about a gigabyte compressed, is
useful for optimizing some fatcat 'release' entity imports and operations.

There is an example database at <https://archive.org/details/ia_journal_pid_map_munge_20180908>

## Data Sources

EuropePMC mapping (more works than the USA PubMedCentral mapping)
- <ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz>
- <https://archive.org/details/europepmc-id-map-2018-08-31>

Wikicite data snapshot
- <https://archive.org/details/wikicite-biblio-data-20180903>

CORE dataset
- <https://core.ac.uk/services#api>
- <https://archive.org/download/core_oa_metadata_20180301>

## Wikidata Extract

Early query:

    zcat wikidata-20180806-publications.ndjson.gz.crdownload | rg '"P356"' | jq '{qid: .id, doi: .claims.P356[0]}' -c | pv -l | head

Polished:

    zcat wikidata-20180903-publications.ndjson.gz | rg '"P356"' | jq '[(.claims.P356[0] | ascii_downcase), .id] | @tsv' -cr | pv -l > doi_wikidata.20180903.tsv

    # 13.6M 0:26:13 [8.65k/s]

Hrm, got this but seemed to continue:

    jq: error (at <stdin>:455616): explode input must be a string

## CORE Extract

    xzcat core_json/*.json.xz | jq -rc 'select(.doi != null) | [(.doi | ascii_downcase), .coreId] | @tsv' | pv -l > doi_core.20180301.tsv

    # bnewbold@bnewbold-dev$ wc -l doi_core.20180301.tsv
    # 28210779 doi_core.20180301.tsv

## PMCID/PMID

In a nice CSV format, no extract needed.

    bnewbold@bnewbold-dev$ zcat PMID_PMCID_DOI.csv.gz | rg doi.org | wc -l
    19441168

## sqlite schema

    CREATE TABLE ids (doi text not null, core int, pmid int, pmcid text, wikidata text);
    CREATE UNIQUE INDEX ids_doi on ids (doi);

Run CORE import first (largest mapping, thus fewest exception/update paths),
then pubmed, then wikidata.

## Run import

    cat doi_core.20180301.tsv | ./load_core.py release_ids.db
    => read 28210000, wrote 28210000
    => overnight? but seemed to run fast

    zcat doi_wikidata.20180903.tsv.gz | ./load_wikidata.py release_ids.db
    => uhoh, seems very slow. lots of IOWAIT. switching to host with SSD
    => wow, like 5-10x faster at least. sigh.
    => ran in a few hours

    zcat PMID_PMCID_DOI.csv.gz | ./load_pmc.py release_ids.db
    => read 29692000, wrote 19441136


    sqlite> select count(*) from ids;
    24831337

    # only 2.3 million works have all IDs
    sqlite> select count(*) from ids where wikidata not null and core not null and pmid not null;
    2314700

    # almost (but not all) PubMedCentral items are PubMed items
    sqlite> select count(*) from ids where pmid not null;
    19328761
    sqlite> select count(*) from ids where pmcid not null;
    3739951
    sqlite> select count(*) from ids where pmcid not null and pmid not null;
    3682486

    # wikidata DOIs seem to mostly come from pmid mapping
    sqlite> select count(*) from ids where wikidata not null;
    13608497
    sqlite> select count(*) from ids where pmid not null and wikidata not null;
    13179156

    # core IDs are more independent (?)
    sqlite> select count(*) from ids where core not null;
    7903910
    sqlite> select count(*) from ids where core not null and wikidata not null;
    2372369
    sqlite> select count(*) from ids where core not null and pmid not null;
    2783344