proposals/20190510_release_ext_ids.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87


# Release External ID Refactor

Goal is to make the external identifier "namespace" (number of external
identifiers supported) less scarce, while still allowing fast lookups. Adding
new identifiers would still require API schema and rust code changes.

This change would also bring the number of columns on `release_rev` back under
32, which makes diesel happier.

Unclear whether existing extids (like DOI) should get merged and the old
columns dropped. This could be done consistently by, eg, a rust worker process
that re-writes tables, probably in multiple stages (eg, first copy, then update
code to read from new location, then drop old columns and indices). Perhaps for
very hot/popular columns (like DOI, and maybe ISBN13?) it is better to have
separate columns/indices.

It would be possible to roll this out as a SQL change first, with identifiers
still at the top level of the API schema, then later switch the API schema. Not
sure this is worth it though.

## New API

All identifers as text

    release_entity
        ext_ids (required)
            doi
            pmid
            pmcid
            wikidata_qid
            core
            isbn13
            arxiv
            jstor
            [...]

## New SQL Schema

Something like:

    release_rev_extid (
        release_rev_id  UUID foreign key to release_rev.id
        extid_type      TEXT
        extid_value     TEXT
    )
    PRIMARY KEY (release_rev_id, extid_type)
    INDEX (extid_type, extid_value)

### Existing prod Column Use

Queries like:

    fatcat_prod=# SELECT COUNT(*) FROM release_rev WHERE wikidata_qid IS NOT NULL;
    13460723

Results:

    wikidata_qid: 13460723
    isbn13: 1
    core_id: 7160477
    arxiv_id: 3
    jstor_id: 0
    pmcid: 3688945

Keep in SQL:

- `doi`
- `pmid`
- `pmcid`
- `wikidata_qid`
- `core_id`

Drop columns:

- `isbn13`
- `arxiv_id`
- `jstor_id`

In new table:

- isbn13
- arxiv
- jstor
- mag
- ark
- dblp