Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | truncate indexed fulltext body at 1 MByte | Bryan Newbold | 2021-02-15 | 1 | -2/+13 |
| | | | | | | There was a large ~4 MByte document getting indexed (work_lumgqw4vqbgvha2ejbsbaepedq) with several megabytes of text, and this was causing elasticsearch indexing timeouts. | ||||
* | catch TEI-XML parsing exception | Bryan Newbold | 2021-01-30 | 1 | -12/+17 |
| | |||||
* | enable sentry exceptions for workers and pipelines | Bryan Newbold | 2021-01-30 | 1 | -1/+12 |
| | | | | It is otherwise difficult to debug multi-million record pipelines. | ||||
* | bigfix: try resolving lang_code list issue again | Bryan Newbold | 2021-01-30 | 1 | -5/+4 |
| | |||||
* | bugfix: lang_code sometimes a list | Bryan Newbold | 2021-01-29 | 1 | -2/+7 |
| | |||||
* | make fmt | Bryan Newbold | 2021-01-25 | 1 | -1/+4 |
| | |||||
* | basic support for excluding web content from index | Bryan Newbold | 2021-01-22 | 1 | -6/+45 |
| | | | | Based on particular patterns in metadata, or exclusion lists in settings | ||||
* | bug fix: more html_fulltext not getting processed | Bryan Newbold | 2021-01-22 | 1 | -0/+2 |
| | |||||
* | add container_sherpa_color field, and populate it | Bryan Newbold | 2021-01-22 | 1 | -0/+1 |
| | |||||
* | improve 'oa' tag calculation | Bryan Newbold | 2021-01-16 | 1 | -4/+4 |
| | |||||
* | small corrections to schema/transform | Bryan Newbold | 2021-01-16 | 1 | -2/+4 |
| | |||||
* | add support for new identifiers and size_bytes schema fields | Bryan Newbold | 2021-01-14 | 1 | -0/+3 |
| | |||||
* | basic HTML transform/index support | Bryan Newbold | 2020-11-18 | 1 | -2/+46 |
| | |||||
* | refs: extract fatcat crossref pages metadata | Bryan Newbold | 2020-11-13 | 1 | -1/+1 |
| | |||||
* | commands: show usage on empty command | Bryan Newbold | 2020-11-02 | 1 | -1/+1 |
| | |||||
* | more SIM metadata mappings | Bryan Newbold | 2020-10-19 | 1 | -3/+31 |
| | |||||
* | SIM pipeline: more language conversions | Bryan Newbold | 2020-10-16 | 1 | -2/+5 |
| | | | | | Not sure where these language strings are coming from, but these were from existing SIM item metadata in archive.org | ||||
* | transform: refactor tag generation out of transform heavy method | Bryan Newbold | 2020-10-16 | 1 | -28/+37 |
| | |||||
* | Upgrade Dynaconf to 3+ | Bruno Rocha | 2020-10-05 | 1 | -1/+1 |
| | | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class. | ||||
* | refs and grobid2json bugfixes from testing | Bryan Newbold | 2020-09-14 | 1 | -3/+10 |
| | |||||
* | bugfix: release_year | Bryan Newbold | 2020-09-13 | 1 | -2/+2 |
| | |||||
* | refs transform: both GROBID and fatcat refs | Bryan Newbold | 2020-09-13 | 1 | -5/+12 |
| | |||||
* | ref transform: support more GROBID fields | Bryan Newbold | 2020-09-13 | 1 | -10/+16 |
| | |||||
* | fixes to refs transform (for non-str author fields) | Bryan Newbold | 2020-09-04 | 1 | -2/+6 |
| | |||||
* | heavy to refs command | Bryan Newbold | 2020-09-04 | 1 | -2/+142 |
| | |||||
* | use simple names, not domain names, for some platforms | Bryan Newbold | 2020-08-12 | 1 | -3/+3 |
| | |||||
* | biblio metadata hacks at transform time | Bryan Newbold | 2020-08-12 | 1 | -2/+98 |
| | |||||
* | don't index sim_page without issue_item and first_page | Bryan Newbold | 2020-08-06 | 1 | -0/+3 |
| | |||||
* | handle integer conversion and bounding for ES schema | Bryan Newbold | 2020-08-06 | 1 | -10/+13 |
| | |||||
* | json: exclude None in output, and sort keys | Bryan Newbold | 2020-07-27 | 1 | -1/+1 |
| | | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents | ||||
* | ensure SIM release date parses before assigning | Bryan Newbold | 2020-07-21 | 1 | -1/+6 |
| | |||||
* | make fmt | Bryan Newbold | 2020-06-29 | 1 | -8/+13 |
| | |||||
* | include GROBID-extracted abstracts in search documents | Bryan Newbold | 2020-06-29 | 1 | -10/+15 |
| | |||||
* | small improvements to SIM metadata maps | Bryan Newbold | 2020-06-29 | 1 | -6/+11 |
| | |||||
* | fixes for pdf_meta dict | Bryan Newbold | 2020-06-29 | 1 | -1/+2 |
| | |||||
* | remove old COVID19 thumbnail hack | Bryan Newbold | 2020-06-29 | 1 | -1/+2 |
| | |||||
* | fetch pdftotext and pdf_meta from blobs, postgrest | Bryan Newbold | 2020-06-29 | 1 | -21/+13 |
| | | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs. | ||||
* | collapse pages by SIM issue | Bryan Newbold | 2020-06-04 | 1 | -0/+3 |
| | |||||
* | flake8-annotation linting | Bryan Newbold | 2020-06-03 | 1 | -3/+3 |
| | | | | Added some new annotations; need to finish more. | ||||
* | flake8 fixes (partial) | Bryan Newbold | 2020-06-03 | 1 | -11/+2 |
| | |||||
* | reformat python code with black | Bryan Newbold | 2020-06-03 | 1 | -109/+158 |
| | |||||
* | fixes from running pipeline | Bryan Newbold | 2020-06-03 | 1 | -1/+2 |
| | | | | Not caught by mypi/lint? Hrm. | ||||
* | compute and use tags | Bryan Newbold | 2020-06-03 | 1 | -0/+41 |
| | |||||
* | fixes from manual testing | Bryan Newbold | 2020-05-20 | 1 | -5/+4 |
| | |||||
* | fixes to release+sim pipeline | Bryan Newbold | 2020-05-20 | 1 | -1/+2 |
| | |||||
* | indexing tweaks | Bryan Newbold | 2020-05-20 | 1 | -3/+4 |
| | |||||
* | first pass transform from pipelines to ES schema | Bryan Newbold | 2020-05-20 | 1 | -0/+306 |