notes on search query parsing (WIP)

author: Bryan Newbold <bnewbold@robocracy.org> 2019-11-12 18:57:24 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2020-01-03 15:53:41 -0800
commit: d5e2af24cb6563eca91407425cea9b808a7d691c (patch)
tree: a11a9771baea5fdc78275d1a927a540d930e2d1d /proposals/20190911_search_query_parsing.md
parent: 6c95f23bfbef73f231ca94309031a130f12f2c32 (diff)
download: fatcat-d5e2af24cb6563eca91407425cea9b808a7d691c.tar.gz
fatcat-d5e2af24cb6563eca91407425cea9b808a7d691c.zip
1 files changed, 22 insertions, 0 deletions
diff --git a/proposals/20190911_search_query_parsing.md b/proposals/20190911_search_query_parsing.md
new file mode 100644
index 00000000..1e656fef
--- /dev/null
+++ b/proposals/20190911_search_query_parsing.md
@@ -0,0 +1,22 @@
+
+status: work-in-progress
+
+The default "release" search on fatcat.wiki currently uses the elasticsearch
+built-in `query_string` parser, which is explicitly not recommended for
+public/production use.
+
+The best way forward is likely a custom query parser (eg, PEG-generated parser)
+that generates a complete elasticsearch query JSON structure.
+
+A couple search issues this would help with:
+
+- better parsing of keywords (year, year-range, DOI, ISSN, etc) in complex
+  queries and turning these in to keyword term sub-queries
+- queries including terms from multiple fields which aren't explicitly tagged
+  (eg, "lovelace computer" vs. "author:lovelace title:computer")
+- avoiding unsustainably expensive queries (eg, prefix wildcard, regex)
+- handling single-character mispellings and synonyms
+- collapsing multiple releases under the same work in search results
+
+In the near future, we may also create a fulltext search index, which will have
+it's own issues.
author	Bryan Newbold <bnewbold@robocracy.org>	2019-11-12 18:57:24 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2020-01-03 15:53:41 -0800
commit	d5e2af24cb6563eca91407425cea9b808a7d691c (patch)
tree	a11a9771baea5fdc78275d1a927a540d930e2d1d /proposals/20190911_search_query_parsing.md
parent	6c95f23bfbef73f231ca94309031a130f12f2c32 (diff)
download	fatcat-d5e2af24cb6563eca91407425cea9b808a7d691c.tar.gz fatcat-d5e2af24cb6563eca91407425cea9b808a7d691c.zip