summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/sim_pipeline.py
Commit message (Collapse)AuthorAgeFilesLines
* json: exclude None in output, and sort keysBryan Newbold2020-07-271-1/+1
| | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-2/+2
|
* more flake8Bryan Newbold2020-06-031-1/+1
|
* flake8 fixes (partial)Bryan Newbold2020-06-031-13/+4
|
* reformat python code with blackBryan Newbold2020-06-031-45/+65
|
* more petabox timeout handlingBryan Newbold2020-05-211-0/+3
|
* handle petabox read timeouts a bitBryan Newbold2020-05-211-1/+6
|
* skip SIM items w/o page_numbers (instead of asserting)Bryan Newbold2020-05-201-1/+3
|
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-4/+8
|
* WIP on SIM pipelineBryan Newbold2020-05-191-0/+173