summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/transform.py
Commit message (Expand)AuthorAgeFilesLines
* refs: clean up GROBID DOIs and PMCIDsBryan Newbold2021-07-011-2/+3
* HACK: don't parse TEI-XML for a specific paper/fileBryan Newbold2021-06-301-2/+4
* refs: include (source) release_stage in outputBryan Newbold2021-06-301-0/+1
* bugfix: pass full crossref obj, not just 'record'Bryan Newbold2021-06-021-1/+1
* refs: use fatcat prefix for some sourcesBryan Newbold2021-06-021-5/+5
* integrate crossref references, and iterate on refs output logicBryan Newbold2021-06-021-7/+115
* schema: add 'crossref' to bundle schema, and add from_json() helperBryan Newbold2021-06-021-26/+4
* reduce max body size to 0.5M charactersBryan Newbold2021-02-241-1/+1
* fix body size limitBryan Newbold2021-02-241-4/+4
* fmt and lint fixes (including one actual bug)Bryan Newbold2021-02-151-2/+3
* truncate indexed fulltext body at 1 MByteBryan Newbold2021-02-151-2/+13
* catch TEI-XML parsing exceptionBryan Newbold2021-01-301-12/+17
* enable sentry exceptions for workers and pipelinesBryan Newbold2021-01-301-1/+12
* bigfix: try resolving lang_code list issue againBryan Newbold2021-01-301-5/+4
* bugfix: lang_code sometimes a listBryan Newbold2021-01-291-2/+7
* make fmtBryan Newbold2021-01-251-1/+4
* basic support for excluding web content from indexBryan Newbold2021-01-221-6/+45
* bug fix: more html_fulltext not getting processedBryan Newbold2021-01-221-0/+2
* add container_sherpa_color field, and populate itBryan Newbold2021-01-221-0/+1
* improve 'oa' tag calculationBryan Newbold2021-01-161-4/+4
* small corrections to schema/transformBryan Newbold2021-01-161-2/+4
* add support for new identifiers and size_bytes schema fieldsBryan Newbold2021-01-141-0/+3
* basic HTML transform/index supportBryan Newbold2020-11-181-2/+46
* refs: extract fatcat crossref pages metadataBryan Newbold2020-11-131-1/+1
* commands: show usage on empty commandBryan Newbold2020-11-021-1/+1
* more SIM metadata mappingsBryan Newbold2020-10-191-3/+31
* SIM pipeline: more language conversionsBryan Newbold2020-10-161-2/+5
* transform: refactor tag generation out of transform heavy methodBryan Newbold2020-10-161-28/+37
* Upgrade Dynaconf to 3+Bruno Rocha2020-10-051-1/+1
* refs and grobid2json bugfixes from testingBryan Newbold2020-09-141-3/+10
* bugfix: release_yearBryan Newbold2020-09-131-2/+2
* refs transform: both GROBID and fatcat refsBryan Newbold2020-09-131-5/+12
* ref transform: support more GROBID fieldsBryan Newbold2020-09-131-10/+16
* fixes to refs transform (for non-str author fields)Bryan Newbold2020-09-041-2/+6
* heavy to refs commandBryan Newbold2020-09-041-2/+142
* use simple names, not domain names, for some platformsBryan Newbold2020-08-121-3/+3
* biblio metadata hacks at transform timeBryan Newbold2020-08-121-2/+98
* don't index sim_page without issue_item and first_pageBryan Newbold2020-08-061-0/+3
* handle integer conversion and bounding for ES schemaBryan Newbold2020-08-061-10/+13
* json: exclude None in output, and sort keysBryan Newbold2020-07-271-1/+1
* ensure SIM release date parses before assigningBryan Newbold2020-07-211-1/+6
* make fmtBryan Newbold2020-06-291-8/+13
* include GROBID-extracted abstracts in search documentsBryan Newbold2020-06-291-10/+15
* small improvements to SIM metadata mapsBryan Newbold2020-06-291-6/+11
* fixes for pdf_meta dictBryan Newbold2020-06-291-1/+2
* remove old COVID19 thumbnail hackBryan Newbold2020-06-291-1/+2
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-21/+13
* collapse pages by SIM issueBryan Newbold2020-06-041-0/+3
* flake8-annotation lintingBryan Newbold2020-06-031-3/+3
* flake8 fixes (partial)Bryan Newbold2020-06-031-11/+2