proposals/2020_elasticsearch_schemas.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160


status: implemented

This document tracks "easy" elasticsearch schema and behavior changes that
could be made while being backwards compatible with the current v0.3 schema and
not requiring any API/database schema changes.

## Release Field Additions

Simple additions:

- volume
- issue
- pages
- `first_page` (parsed from pages) (?)
- number
- `doi_prefix`
- `doi_registrar` (based on extra)
- `first_author` (surname; for matching)

"Array" keyword types for reverse lookups:

- referenced releases idents
- contrib creator idents

Add affiliations, both as raw strings and ROR identifiers.


## Preservation Summary Field

To make facet/aggregate queries easier, propose summarizing the preservation
status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which
is:

- `bright`
- `dark`
- `shadows_only`
- `none`

Note that these don't align with OA color or work-level preservation (aka, no
"green"), it is a release-level status.

Filters like "papers only", "published only", "not stub", "single container"
would be overlaid in queries.


## OA Color Summary Field

Might not be ready for this yet, but for both releases and containers may be
able to do a better job of indicating OA status/policy for published works.

Not clear if this should be for "published" only, or whether we should try to
handle embargo time spans and dates.

Maybe also container `sherpa_romeo` color as a field?


## Release Merged Default Field

A current issue with searches is that term queries only match on a single
field, unless alternative fields are explicitly indicated. This breaks obvious
queries like "principa newton" which include both title terms and author terms,
or "coffee death bmj" which include a release title and journal title.

A partial solution to this is to index a field with multiple fields "copied"
into it, and have that be the default for term queries.

Fields to copy in include at least:

- `title`
- `subtitle`
- `original_title`
- `container_name`
- names of all authors (maybe not acronyms?)

May also want to include volume, issue, year, and any container acronyms or
aliases. If we did that, users could paste in citations and there is a better
chance the best match would be the exact cited paper.

This should be a pretty simple change. The biggest downside will be larger (up
to double?) index size.


## Partial Query Parsing

At some point we may want to build a proper query parser (see separate
proposal), but in the short term there is some low-hanging fruit simple token
parsing and re-writing we could do.

- strings like `N/A` which are parse bugs; auto-quote these
- pasting/searching for entire titles which include a word then colon ("Yummy
  Food: A Review"). We can detect that "food" is not a valid facet, and quote
  that single token
- ability to do an empty search (to get total count) (?)

This would require at least a simple escaped quotes tokenizer.


## Basic Filtering

This would be in the user interface, not schema.

At simple google-style filtering in release searches like:

- time span (last year, last 5, last 20, all)
- fulltext availability
- release type; stage; withdrawn
- language
- country/region

For containers:

- is OA
- stub (zero releases)

## Work Grouping

Release searches can be "grouped by" work identifier in the default responses,
to prevent the common situation where there are multiple release which are just
different versions of the same work.

Need to ensure this is performant.

Would need to update query UI/UX to display another line under hits ("also XYZ
other copies {including retraction or update} {having fulltext if this
hit does not}").


## Container Fields

- `issn` (all issns)
- `original_name`

The `release_count` would not be indexed (left null) by default, and would be
"patched" in to entities by a separate script (periodically?).


## Container Copied Fields

Like releases, container entities could have a merged biblio field to use as
default in term queries:

- `name`
- `original_name`
- `aliases` (in extra?)
- `publisher`

Maybe also language and country names?


## Background Reading

"Which Academic Search Systems are Suitable for Systematic Reviews or
Meta-Analyses?  Evaluating Retrieval Qualities of Google Scholar, PubMed and 26
other Resources"

https://musingsaboutlibrarianship.blogspot.com/2019/12/the-rise-of-open-discovery-indexes.html

"Scholarly Search Engine Comparison"
https://docs.google.com/spreadsheets/d/1ZiCUuKNse8dwHRFAyhFsZsl6kG0Fkgaj5gttdwdVZEM/edit#gid=1016151070