aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2020-06-04_work_schema.md
blob: 97d60aca35ea7deae493180b3c12ce97a8dec8d8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

## Top-Level

- type: `_doc` (aka, no type, `include_type_name=false`)
- key: keyword (same as `_id`)
- `collapse_key`: work ident, or SIM issue item (for collapsing/grouping search hits)
- `doc_type`: keyword (work or page)
- `doc_index_ts`: timestamp when document indexed
- `work_ident`: fatcat work ident (optional)

- `biblio`: obj
- `fulltext`: obj
- `ia_sim`: obj
- `abstracts`: nested
    body
    lang
- `releases`: nested (TBD)
- `access`
- `tags`: array of keywords

TODO:
- summary fields to index "everything" into?

## Biblio

Mostly matches existing `fatcat_release` schema.

- `release_id`
- `release_revision`
- `title`
- `subtitle`
- `original_title`
- `release_date`
- `release_year`
- `withdrawn_status`
- `language`
- `country_code`
- `volume` (etc)
- `volume_int` (etc)
- `first_page`
- `first_page_int`
- `pages`
- `doi` etc
- `number` (etc)

NEW:
- `preservation_status`

[etc]

- `license_slug`
- `publisher` (etc)
- `container_name` (etc)
- `container_id`
- `container_issnl`
- `container_wikidata_qid`
- `issns` (array)
- `contrib_names`
- `affiliations`
- `creator_ids`

TODO: should all external identifiers go under `releases` instead of `biblio`? Or some duplicated?

## Fulltext

- `status`: web, sim, shadow
- `body`
- `lang`
- `file_mimetype`
- `file_sha1`
- `file_id`
- `thumbnail_url`

## Abstracts

Nested object with:

- body
- lang

For prototyping, perhaps just make it an object with `body` as an array.

Only index one abstract per language.

## SIM (Microfilm)

Enough details to construct a link or do a lookup or whatever. Note that might
be doing CDL status lookups on SERP pages.

- `issue_item`: str
- `pub_collection`: str
- `sim_pubid`: str
- `first_page`: str


Also pass-through archive.org metadata here (collection-level and item-level)

## Access

Start with obj, but maybe later nested?

- `status`: direct, cdl, repository, publisher, loginwall, paywall, etc
- `mimetype`
- `access_url`
- `file_url`
- `file_id`
- `release_id`