Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | catch/ignore ChunkedEncoding errors in fetches | Bryan Newbold | 2021-06-11 | 1 | -0/+3 |
| | |||||
* | lint fixes, and run fmt | Bryan Newbold | 2021-06-02 | 1 | -7/+7 |
| | |||||
* | add 'crossref' hydration to work pipeline | Bryan Newbold | 2021-06-02 | 1 | -0/+35 |
| | | | | | | | | The immediate motivation is to include recent crossref refs in citation graph transforms. May also be valuable for researchers to have authoritative/publisher metadata in the bundle dumps. | ||||
* | schema: add 'crossref' to bundle schema, and add from_json() helper | Bryan Newbold | 2021-06-02 | 1 | -0/+1 |
| | | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class | ||||
* | Modernize Python syntax with pyupgrade --py38-plus **/*.py | Christian Clauss | 2021-02-23 | 1 | -1/+1 |
| | |||||
* | fmt and lint fixes (including one actual bug) | Bryan Newbold | 2021-02-15 | 1 | -1/+1 |
| | |||||
* | more seaweedfs hacks | Bryan Newbold | 2021-02-12 | 1 | -0/+8 |
| | |||||
* | enable sentry exceptions for workers and pipelines | Bryan Newbold | 2021-01-30 | 1 | -1/+10 |
| | | | | It is otherwise difficult to debug multi-million record pipelines. | ||||
* | work pipeline: hack to skip seaweedfs errors for now | Bryan Newbold | 2021-01-26 | 1 | -0/+5 |
| | | | | | This isn't great becasue it turns a lot of problems into silent failures. | ||||
* | sort keys in work pipeline (fix typo) | Bryan Newbold | 2021-01-22 | 1 | -1/+1 |
| | |||||
* | bug fix: actually fetch/include HTML fulltext | Bryan Newbold | 2021-01-22 | 1 | -1/+1 |
| | |||||
* | add basic html fulltext support to fetch pipeline | Bryan Newbold | 2020-11-18 | 1 | -2/+46 |
| | |||||
* | commands: show usage on empty command | Bryan Newbold | 2020-11-02 | 1 | -1/+1 |
| | |||||
* | work pipeline comparison fix | Bryan Newbold | 2020-10-28 | 1 | -0/+3 |
| | |||||
* | Upgrade Dynaconf to 3+ | Bruno Rocha | 2020-10-05 | 1 | -1/+1 |
| | | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class. | ||||
* | pipeline: skip grobid/pdftext lookups when no URL; prefer GROBID to pdftext | Bryan Newbold | 2020-07-27 | 1 | -1/+3 |
| | |||||
* | json: exclude None in output, and sort keys | Bryan Newbold | 2020-07-27 | 1 | -2/+2 |
| | | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents | ||||
* | fix lint errors (and some small bugs) | Bryan Newbold | 2020-06-29 | 1 | -6/+8 |
| | |||||
* | seaweedfs for S3 API; pull config from dynaconf | Bryan Newbold | 2020-06-29 | 1 | -11/+2 |
| | |||||
* | make fmt | Bryan Newbold | 2020-06-29 | 1 | -1/+3 |
| | |||||
* | fetch pdftotext and pdf_meta from blobs, postgrest | Bryan Newbold | 2020-06-29 | 1 | -18/+45 |
| | | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs. | ||||
* | flake8 fixes (partial) | Bryan Newbold | 2020-06-03 | 1 | -5/+2 |
| | |||||
* | reformat python code with black | Bryan Newbold | 2020-06-03 | 1 | -68/+120 |
| | |||||
* | more petabox timeout handling | Bryan Newbold | 2020-05-21 | 1 | -0/+3 |
| | |||||
* | handle petabox read timeouts a bit | Bryan Newbold | 2020-05-21 | 1 | -1/+6 |
| | |||||
* | fix typo with UnicodeDecodeError catch | Bryan Newbold | 2020-05-21 | 1 | -1/+1 |
| | |||||
* | skip pdftotext loading on unicode error | Bryan Newbold | 2020-05-20 | 1 | -0/+2 |
| | |||||
* | skip SIM items w/o page_numbers (instead of asserting) | Bryan Newbold | 2020-05-20 | 1 | -1/+3 |
| | |||||
* | fixes from manual testing | Bryan Newbold | 2020-05-20 | 1 | -8/+13 |
| | |||||
* | local pdftotext cache dir hack | Bryan Newbold | 2020-05-20 | 1 | -1/+18 |
| | |||||
* | fixes to release+sim pipeline | Bryan Newbold | 2020-05-20 | 1 | -10/+16 |
| | |||||
* | first pass transform from pipelines to ES schema | Bryan Newbold | 2020-05-20 | 1 | -16/+1 |
| | |||||
* | WIP on SIM pipeline | Bryan Newbold | 2020-05-19 | 1 | -2/+2 |
| | |||||
* | WIP on release-to-sim fetching | Bryan Newbold | 2020-05-19 | 1 | -12/+49 |
| | |||||
* | initial progress on work pipeline | Bryan Newbold | 2020-05-16 | 1 | -0/+305 |