summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/sim_pipeline.py
Commit message (Collapse)AuthorAgeFilesLines
* schema: add 'crossref' to bundle schema, and add from_json() helperBryan Newbold2021-06-021-0/+1
| | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
* sim: catch MaxRetryErrorBryan Newbold2021-01-311-0/+2
|
* enable sentry exceptions for workers and pipelinesBryan Newbold2021-01-301-0/+10
| | | | It is otherwise difficult to debug multi-million record pipelines.
* sim pipeline: improve exception catchingBryan Newbold2021-01-271-4/+5
|
* sim indexing: new parallel fetch structureBryan Newbold2021-01-261-0/+65
|
* commands: show usage on empty commandBryan Newbold2020-11-021-1/+1
|
* SIM pipeline: refactor issue item fetching and bundle conversionBryan Newbold2020-10-161-23/+32
|
* json: exclude None in output, and sort keysBryan Newbold2020-07-271-1/+1
| | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-2/+2
|
* more flake8Bryan Newbold2020-06-031-1/+1
|
* flake8 fixes (partial)Bryan Newbold2020-06-031-13/+4
|
* reformat python code with blackBryan Newbold2020-06-031-45/+65
|
* more petabox timeout handlingBryan Newbold2020-05-211-0/+3
|
* handle petabox read timeouts a bitBryan Newbold2020-05-211-1/+6
|
* skip SIM items w/o page_numbers (instead of asserting)Bryan Newbold2020-05-201-1/+3
|
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-4/+8
|
* WIP on SIM pipelineBryan Newbold2020-05-191-0/+173