aboutsummaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/transform.py
Commit message (Collapse)AuthorAgeFilesLines
* Upgrade Dynaconf to 3+Bruno Rocha2020-10-051-1/+1
| | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class.
* refs and grobid2json bugfixes from testingBryan Newbold2020-09-141-3/+10
|
* bugfix: release_yearBryan Newbold2020-09-131-2/+2
|
* refs transform: both GROBID and fatcat refsBryan Newbold2020-09-131-5/+12
|
* ref transform: support more GROBID fieldsBryan Newbold2020-09-131-10/+16
|
* fixes to refs transform (for non-str author fields)Bryan Newbold2020-09-041-2/+6
|
* heavy to refs commandBryan Newbold2020-09-041-2/+142
|
* use simple names, not domain names, for some platformsBryan Newbold2020-08-121-3/+3
|
* biblio metadata hacks at transform timeBryan Newbold2020-08-121-2/+98
|
* don't index sim_page without issue_item and first_pageBryan Newbold2020-08-061-0/+3
|
* handle integer conversion and bounding for ES schemaBryan Newbold2020-08-061-10/+13
|
* json: exclude None in output, and sort keysBryan Newbold2020-07-271-1/+1
| | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
* ensure SIM release date parses before assigningBryan Newbold2020-07-211-1/+6
|
* make fmtBryan Newbold2020-06-291-8/+13
|
* include GROBID-extracted abstracts in search documentsBryan Newbold2020-06-291-10/+15
|
* small improvements to SIM metadata mapsBryan Newbold2020-06-291-6/+11
|
* fixes for pdf_meta dictBryan Newbold2020-06-291-1/+2
|
* remove old COVID19 thumbnail hackBryan Newbold2020-06-291-1/+2
|
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-21/+13
| | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
* collapse pages by SIM issueBryan Newbold2020-06-041-0/+3
|
* flake8-annotation lintingBryan Newbold2020-06-031-3/+3
| | | | Added some new annotations; need to finish more.
* flake8 fixes (partial)Bryan Newbold2020-06-031-11/+2
|
* reformat python code with blackBryan Newbold2020-06-031-109/+158
|
* fixes from running pipelineBryan Newbold2020-06-031-1/+2
| | | | Not caught by mypi/lint? Hrm.
* compute and use tagsBryan Newbold2020-06-031-0/+41
|
* fixes from manual testingBryan Newbold2020-05-201-5/+4
|
* fixes to release+sim pipelineBryan Newbold2020-05-201-1/+2
|
* indexing tweaksBryan Newbold2020-05-201-3/+4
|
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-0/+306