diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-11-12 18:57:24 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-03 15:53:41 -0800 |
commit | d5e2af24cb6563eca91407425cea9b808a7d691c (patch) | |
tree | a11a9771baea5fdc78275d1a927a540d930e2d1d | |
parent | 6c95f23bfbef73f231ca94309031a130f12f2c32 (diff) | |
download | fatcat-d5e2af24cb6563eca91407425cea9b808a7d691c.tar.gz fatcat-d5e2af24cb6563eca91407425cea9b808a7d691c.zip |
notes on search query parsing (WIP)
-rw-r--r-- | proposals/20190911_search_query_parsing.md | 22 |
1 files changed, 22 insertions, 0 deletions
diff --git a/proposals/20190911_search_query_parsing.md b/proposals/20190911_search_query_parsing.md new file mode 100644 index 00000000..1e656fef --- /dev/null +++ b/proposals/20190911_search_query_parsing.md @@ -0,0 +1,22 @@ + +status: work-in-progress + +The default "release" search on fatcat.wiki currently uses the elasticsearch +built-in `query_string` parser, which is explicitly not recommended for +public/production use. + +The best way forward is likely a custom query parser (eg, PEG-generated parser) +that generates a complete elasticsearch query JSON structure. + +A couple search issues this would help with: + +- better parsing of keywords (year, year-range, DOI, ISSN, etc) in complex + queries and turning these in to keyword term sub-queries +- queries including terms from multiple fields which aren't explicitly tagged + (eg, "lovelace computer" vs. "author:lovelace title:computer") +- avoiding unsustainably expensive queries (eg, prefix wildcard, regex) +- handling single-character mispellings and synonyms +- collapsing multiple releases under the same work in search results + +In the near future, we may also create a fulltext search index, which will have +it's own issues. |