aboutsummaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/work_pipeline.py
Commit message (Expand)AuthorAgeFilesLines
* sort keys in work pipeline (fix typo)Bryan Newbold2021-01-221-1/+1
* bug fix: actually fetch/include HTML fulltextBryan Newbold2021-01-221-1/+1
* add basic html fulltext support to fetch pipelineBryan Newbold2020-11-181-2/+46
* commands: show usage on empty commandBryan Newbold2020-11-021-1/+1
* work pipeline comparison fixBryan Newbold2020-10-281-0/+3
* Upgrade Dynaconf to 3+Bruno Rocha2020-10-051-1/+1
* pipeline: skip grobid/pdftext lookups when no URL; prefer GROBID to pdftextBryan Newbold2020-07-271-1/+3
* json: exclude None in output, and sort keysBryan Newbold2020-07-271-2/+2
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-6/+8
* seaweedfs for S3 API; pull config from dynaconfBryan Newbold2020-06-291-11/+2
* make fmtBryan Newbold2020-06-291-1/+3
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-18/+45
* flake8 fixes (partial)Bryan Newbold2020-06-031-5/+2
* reformat python code with blackBryan Newbold2020-06-031-68/+120
* more petabox timeout handlingBryan Newbold2020-05-211-0/+3
* handle petabox read timeouts a bitBryan Newbold2020-05-211-1/+6
* fix typo with UnicodeDecodeError catchBryan Newbold2020-05-211-1/+1
* skip pdftotext loading on unicode errorBryan Newbold2020-05-201-0/+2
* skip SIM items w/o page_numbers (instead of asserting)Bryan Newbold2020-05-201-1/+3
* fixes from manual testingBryan Newbold2020-05-201-8/+13
* local pdftotext cache dir hackBryan Newbold2020-05-201-1/+18
* fixes to release+sim pipelineBryan Newbold2020-05-201-10/+16
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-16/+1
* WIP on SIM pipelineBryan Newbold2020-05-191-2/+2
* WIP on release-to-sim fetchingBryan Newbold2020-05-191-12/+49
* initial progress on work pipelineBryan Newbold2020-05-161-0/+305