summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/work_pipeline.py
Commit message (Collapse)AuthorAgeFilesLines
* Revert "pull GROBID refs along with crossref records into bundles"Bryan Newbold2021-11-101-1/+0
| | | | | | This reverts commit c164970449a392b5165d903d213c2bb51f2a187f. Didn't mean to merge this to master just yet.
* pull GROBID refs along with crossref records into bundlesBryan Newbold2021-11-101-0/+1
|
* lint: small cleanups, mostly E711 and E713Bryan Newbold2021-10-271-2/+2
|
* lint: remove all 'import *' usesBryan Newbold2021-10-271-1/+1
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-5/+17
|
* re-style imports (isort) on all core python filesBryan Newbold2021-10-271-14/+11
|
* catch/ignore ChunkedEncoding errors in fetchesBryan Newbold2021-06-111-0/+3
|
* lint fixes, and run fmtBryan Newbold2021-06-021-7/+7
|
* add 'crossref' hydration to work pipelineBryan Newbold2021-06-021-0/+35
| | | | | | | | The immediate motivation is to include recent crossref refs in citation graph transforms. May also be valuable for researchers to have authoritative/publisher metadata in the bundle dumps.
* schema: add 'crossref' to bundle schema, and add from_json() helperBryan Newbold2021-06-021-0/+1
| | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
* Modernize Python syntax with pyupgrade --py38-plus **/*.pyChristian Clauss2021-02-231-1/+1
|
* fmt and lint fixes (including one actual bug)Bryan Newbold2021-02-151-1/+1
|
* more seaweedfs hacksBryan Newbold2021-02-121-0/+8
|
* enable sentry exceptions for workers and pipelinesBryan Newbold2021-01-301-1/+10
| | | | It is otherwise difficult to debug multi-million record pipelines.
* work pipeline: hack to skip seaweedfs errors for nowBryan Newbold2021-01-261-0/+5
| | | | | This isn't great becasue it turns a lot of problems into silent failures.
* sort keys in work pipeline (fix typo)Bryan Newbold2021-01-221-1/+1
|
* bug fix: actually fetch/include HTML fulltextBryan Newbold2021-01-221-1/+1
|
* add basic html fulltext support to fetch pipelineBryan Newbold2020-11-181-2/+46
|
* commands: show usage on empty commandBryan Newbold2020-11-021-1/+1
|
* work pipeline comparison fixBryan Newbold2020-10-281-0/+3
|
* Upgrade Dynaconf to 3+Bruno Rocha2020-10-051-1/+1
| | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class.
* pipeline: skip grobid/pdftext lookups when no URL; prefer GROBID to pdftextBryan Newbold2020-07-271-1/+3
|
* json: exclude None in output, and sort keysBryan Newbold2020-07-271-2/+2
| | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-6/+8
|
* seaweedfs for S3 API; pull config from dynaconfBryan Newbold2020-06-291-11/+2
|
* make fmtBryan Newbold2020-06-291-1/+3
|
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-18/+45
| | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
* flake8 fixes (partial)Bryan Newbold2020-06-031-5/+2
|
* reformat python code with blackBryan Newbold2020-06-031-68/+120
|
* more petabox timeout handlingBryan Newbold2020-05-211-0/+3
|
* handle petabox read timeouts a bitBryan Newbold2020-05-211-1/+6
|
* fix typo with UnicodeDecodeError catchBryan Newbold2020-05-211-1/+1
|
* skip pdftotext loading on unicode errorBryan Newbold2020-05-201-0/+2
|
* skip SIM items w/o page_numbers (instead of asserting)Bryan Newbold2020-05-201-1/+3
|
* fixes from manual testingBryan Newbold2020-05-201-8/+13
|
* local pdftotext cache dir hackBryan Newbold2020-05-201-1/+18
|
* fixes to release+sim pipelineBryan Newbold2020-05-201-10/+16
|
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-16/+1
|
* WIP on SIM pipelineBryan Newbold2020-05-191-2/+2
|
* WIP on release-to-sim fetchingBryan Newbold2020-05-191-12/+49
|
* initial progress on work pipelineBryan Newbold2020-05-161-0/+305