proposals/20190911_search_query_parsing.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


Status: brainstorm

## Search Query Parsing

The default "release" search on fatcat.wiki currently uses the elasticsearch
built-in `query_string` parser, which is explicitly not recommended for
public/production use.

The best way forward is likely a custom query parser (eg, PEG-generated parser)
that generates a complete elasticsearch query JSON structure.

A couple search issues this would help with:

- better parsing of keywords (year, year-range, DOI, ISSN, etc) in complex
  queries and turning these in to keyword term sub-queries
- queries including terms from multiple fields which aren't explicitly tagged
  (eg, "lovelace computer" vs. "author:lovelace title:computer")
- avoiding unsustainably expensive queries (eg, prefix wildcard, regex)
- handling single-character mispellings and synonyms
- collapsing multiple releases under the same work in search results

In the near future, we may also create a fulltext search index, which will have
it's own issues.

## Tech Changes

If we haven't already, should also switch to using elasticsearch client library.