summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* scrub_text: single-token strings skippedBryan Newbold2020-08-062-1/+5
|
* strip ACKNOWLEDGEMENTS prefixBryan Newbold2020-08-061-0/+1
|
* fix acknowledgement highlighting (typo)Bryan Newbold2020-08-061-1/+1
|
* more notes on scalingBryan Newbold2020-08-061-0/+363
|
* reduce title boost; use only base query for highlightingBryan Newbold2020-08-061-1/+2
|
* special case '*' queriesBryan Newbold2020-08-061-6/+16
| | | | | More/better query parsing in the client could detect if this was a "filter only" query and do the same kind of optimization.
* remove 'title' from poor metadata scoringBryan Newbold2020-08-061-1/+0
|
* better time ranges (don't search future)Bryan Newbold2020-08-061-4/+7
|
* add title back to match queryBryan Newbold2020-08-061-0/+1
|
* enable index_phrases on everything, biblio_all, title_allBryan Newbold2020-08-061-5/+3
| | | | | Want phrase queries to be faster. Expect this to increase term index size, requiring more disk space.
* ES schema: do not index fulltext.body or fulltext.annex separately from ↵Bryan Newbold2020-08-061-3/+2
| | | | | | | | 'everything' The goal here is to reduce term index size. This means that querying/matching only on these fields (distinct from "everything") will not work.
* ES schema: use smaller integer size (short) for most numbersBryan Newbold2020-08-061-5/+5
|
* ES schema: copy_to titles into single title_all fieldBryan Newbold2020-08-061-4/+4
|
* query fewer fields; highlight all fulltext fields regardless of matchBryan Newbold2020-08-061-3/+1
|
* fix typo in SERP page macroBryan Newbold2020-08-061-1/+1
|
* search tweaks to be forwards-compatible with ES 7.xBryan Newbold2020-08-061-2/+10
| | | | | | When we fully commit to ES 7.x we should upgrade the client library correspondingly, and then can remove these work-arounds. But for now we have one instance of ES 6.x and one ES 7.x.
* extend ES client timeout to 25 secondsBryan Newbold2020-08-061-1/+1
|
* fix display of papers missing fulltextBryan Newbold2020-08-061-1/+1
| | | | | | I think the bug happened now that we do not serialize the pydantic structures with empty values. A better solution might be to deserialize search hits into pydantic objects before rendering.
* Revert "remove duplicate fulltext search from query"Bryan Newbold2020-07-301-0/+1
| | | | | | This reverts commit 0d3fd83493c7307a2b9593c7add90b8b6f4b4152. Seems like we do need to query on this field for highlighting to work.
* transform: catch more cases of null extraBryan Newbold2020-07-301-10/+10
| | | | Also correctly pull issne/issnp from container.extra, not release.extra.
* include container_ident in metadata completeness boostBryan Newbold2020-07-281-0/+1
|
* search: smaller default result setBryan Newbold2020-07-272-1/+4
|
* pipeline: skip grobid/pdftext lookups when no URL; prefer GROBID to pdftextBryan Newbold2020-07-271-1/+3
|
* scaling notes (ES)Bryan Newbold2020-07-271-1/+71
|
* remove duplicate fulltext search from queryBryan Newbold2020-07-271-1/+0
| | | | | | may also remove the 'title' and 'abstracts' searches, though they currently help with boosting, and will want to measure actual preformance difference before that change
* json: exclude None in output, and sort keysBryan Newbold2020-07-273-4/+4
| | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
* search: tweak 'past week' date range to not include futureBryan Newbold2020-07-271-2/+4
|
* schema: 12 shards, 0 replicas, more compressionBryan Newbold2020-07-271-0/+3
|
* abstracts: more prefixes to ignoreBryan Newbold2020-07-271-0/+3
|
* more careful watermark removalBryan Newbold2020-07-222-0/+0
|
* hide overflow link domain text (for mobile SERPs)Bryan Newbold2020-07-211-1/+1
|
* gaudy placeholder vaporwave logoBryan Newbold2020-07-214-12/+11
|
* differentiate SERP card size from other card divsBryan Newbold2020-07-212-2/+2
|
* include fulltext acknowledgements in highlightingBryan Newbold2020-07-211-0/+1
|
* ensure SIM release date parses before assigningBryan Newbold2020-07-211-1/+6
|
* strip <em> tags explicitlyBryan Newbold2020-07-211-0/+1
|
* display Szczepanski as an OA quality labelBryan Newbold2020-07-211-1/+1
|
* load issue rows: handle empty metadataBryan Newbold2020-07-211-0/+2
|
* scale-up notesBryan Newbold2020-07-211-0/+26
|
* TODO itemsBryan Newbold2020-07-211-0/+4
|
* more notes on SIM/fatcat intersectionsBryan Newbold2020-07-211-1/+77
|
* schema: access as object (list), not nestedBryan Newbold2020-07-211-1/+1
| | | | | | Nested allows more precise filter queries, but it seems that simple "dot notation" filters/queries don't work. We don't have anything doing the sophisticated queries yet, so keep it simple.
* update README instructions for issue_db generationBryan Newbold2020-07-011-2/+3
|
* skip partial/stub issue itemsBryan Newbold2020-07-011-0/+2
|
* tweak CSS of last commit so it worksBryan Newbold2020-06-291-1/+1
|
* at full screen width, show full thumbnailsBryan Newbold2020-06-291-0/+3
|
* fix search filter bug (papers is default)Bryan Newbold2020-06-291-2/+2
|
* update COVID-19 ingest for refactorsBryan Newbold2020-06-291-2/+2
|
* handle large/bad 'first_page' metadataBryan Newbold2020-06-291-0/+3
| | | | This was causing elasticsearch indexing errors
* update plan docBryan Newbold2020-06-291-67/+2
|