proposals/20190509_schema_tweaks.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142


# SQL (and API) schema changes

Intend to make these changes at the same time as bumping OpenAPI schema from
0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and
`20190510_release_ext_ids`.

Also adding some indices to speed up entity edit history views, but those are
just a performance change, not visible in API schema.

### Structured Contrib Names

`creator` entities already have "structured" names: in addition to
`display_name`, there are `given_name` and `surname` fields. This change is to
add these two fields to release contribs as well (to join `raw_name`).

The two main motivations are:

1. make various representations (eg, citation formats) of release entities
   easier. CSL and many display formats require given/surname distinctions
2. improve algorithmic matching between release entities, raw metadata (eg,
   from GROBID), and citation strings. Eg, biblio-glutton wants "first author
   surname"; we can't provide this from existing `raw_name` field

The status quo is that many large metadata sources often include structured
names, and we munge them into a single name.

Some arguments against this change are:

1. should be "normalizing" this structure into creator entities. However,
   display/representation of a contributor might change between publications
2. structure isn't always deterministic from what is visible in published
   documents. AKA, raw name is unambiguous (it's what is "printed" on the
   document), but given/sur decomposition can be ambiguous (for individauls, or
   entire locales/cultures)
3. could just stash in contrib `extra_json`. However, seems common enough to
   include as full fields

Questions/Decisions:

- should contrib `raw_name` be changed to `display_name` for consistency with
  `creator`? `raw_name` should probably always be what is in/on the document
  itself, thus no.
- should we still munge a `raw_name` at insert time (we we only have structured
  names), or push this on to client code to always create something for
  display?

### Rename `release_status` to `release_stage`

Describes the field better. I think this is uncontroversial and not too
disruptive at this point.

### New release fields: subtitle, number, version

`subtitle`: mostly for books. could have a flat-out style guide policy against
use for articles? Already frequently add subtitle metadata as an `extra_json`
field.

`number`: intended to represent, eg, a report number ("RFC ..."). Not to be
confused with `container-number`, `chapter`, `edition`

`version`: intended to be a short string ("v3", "2", "third", "3.9") to
disambiguate which among multiple versions. CSL has a separate `edition` field.

These are somewhat hard to justify as dedicated fields vs. `extra_json`.

`subtitle` is a pretty core field for book metadata, but raises ambiguity for
other release types.

Excited to include many reports and memos (as grey lit), for which the number
is a pretty major field, and we probably want to include in elasticsearch but
not as part of the title field, and someday perhaps an index on `number`, so
that's easier to justify.

TODO:

- `version` maybe should be dropped. arXiv is one possible justification, as is
  sorting by this field in display.

### Withdrawn fields

As part of a plan to represent retractions and other "unpublishing", decided to
track when and whether a release has been "withdrawn", distinct from the
`release_stage`.

To motivate this, consider a work that has been retracted. There are multiple
releases of different stages; should not set the `release_stage` for all to
`withdrawn` or `retracted`, because then hard to disambiguate between the
release entities. Also maybe the pre-print hasn't been formally withdrawn and
is still in the pre-print server, or maybe only the pre-print was withdrawn
(for being partial/incorrect?) while the final version is still "active".

As with `release_date`, just `withdrawn_date` is insufficient, so we get
`withdrawn_year` also...  and `withdrawn_month` in the future? Also
`withdrawn_state` for cases where we don't know even the year. This could
probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a
TEXT/ENUM has been nice.

TODO:

- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's
  keep text to allow evolution in the future; if the field is defined at all
  it's "withdrawn" (true), if not it isn't

### New release extids: `mag_id`, `ark_id`

See also: `20190510_release_ext_ids`.

- `mag_id`: Microsoft Academic Graph identifier.
- `ark_id`: ARK identifier.

These will likely be the last identifiers added as fields on `release`; a
future two-stage refactor will be to move these out to a child table (something
like `extid_type`, `extid_value`, with a UNIQ index for lookups).

Perhaps the `extid` table should be implemented now, starting with these
identifiers?

### Web Capture CDX `size_bytes`

Pretty straight-forward. 

Considered adding `extra_json` as well, to be consistent with other tables, but
feels too heavy for the CDX case. Can add later if there is an actual need;
adding fields easier than removing (for backwards compat).

### Object/Class Name Changes

TODO

### Rust/Python Library Name Changes

Do these as separate commits, after merging back in to master, for v0.3:

- rust `fatcat-api-spec` => `fatcat-openapi`
- python `fatcat_client` => `fatcat_openapi_client`

### More?

`release_month`: apprently pretty common to know the year and month but not
date. I have avoided so far, seems like unnecessary complexity. Could start
as an `extra_json` field?