summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/transform.py
Commit message (Collapse)AuthorAgeFilesLines
* SIM transform: handle multiple publishersBryan Newbold2022-01-061-1/+5
|
* refs transform: handle rare missing ref 'id'Bryan Newbold2022-01-051-1/+7
| | | | This impacted one single DOI in the most recent dump/transform
* move public domain wall to 1926 ('before 1927')Bryan Newbold2022-01-051-1/+1
|
* refs: include GROBID-parsed crossref refsBryan Newbold2021-12-061-4/+52
| | | | | | This takes advantage of Crossref 'unstructured' refs which have been parsed using GROBID and stored in the sandcrawler database, as part of the sandcrawler crossref metadata pipeline.
* refactor use of grobid_tei_xmlBryan Newbold2021-10-271-41/+39
|
* replace grobid2json with grobid_tei_xmlBryan Newbold2021-10-271-3/+5
| | | | | This first iteration uses the .to_legacy_dict() helpers for backwards compatibility
* lint: small cleanups, mostly E711 and E713Bryan Newbold2021-10-271-3/+3
|
* lint: remove all 'import *' usesBryan Newbold2021-10-271-2/+20
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-3/+10
|
* re-style imports (isort) on all core python filesBryan Newbold2021-10-271-5/+5
|
* better parsing of year as integer in refs pipelineBryan Newbold2021-07-261-2/+2
|
* make fmtBryan Newbold2021-07-261-4/+10
|
* ref_key: hotfix for some corner casesBryan Newbold2021-07-261-8/+25
|
* transform: more clean_doi() callsBryan Newbold2021-07-261-3/+3
|
* refs transform: consolidate clean_ref_key() hacksBryan Newbold2021-07-251-17/+35
|
* refs transform: many fixesBryan Newbold2021-07-251-9/+34
| | | | | | | | | - include year correctly (many cases) - test coverage for Crossref transform - pass-through 'edition' as 'version' - series-title parsed in to title or container as appropriate - missing release stage - fix 0-index vs. 1-index ref index field
* refs transform: 1-index refs.index, not 0-indexBryan Newbold2021-07-251-3/+11
| | | | | | | | This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values.
* refs: clean up GROBID DOIs and PMCIDsBryan Newbold2021-07-011-2/+3
|
* HACK: don't parse TEI-XML for a specific paper/fileBryan Newbold2021-06-301-2/+4
| | | | | GROBID v0.5.5 returns TEI-XML for this one PDF which is not valid XML, due to a text encoding issue.
* refs: include (source) release_stage in outputBryan Newbold2021-06-301-0/+1
|
* bugfix: pass full crossref obj, not just 'record'Bryan Newbold2021-06-021-1/+1
|
* refs: use fatcat prefix for some sourcesBryan Newbold2021-06-021-5/+5
| | | | This makes debugging what is going on much easier
* integrate crossref references, and iterate on refs output logicBryan Newbold2021-06-021-7/+115
| | | | Needs test coverage!
* schema: add 'crossref' to bundle schema, and add from_json() helperBryan Newbold2021-06-021-26/+4
| | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
* reduce max body size to 0.5M charactersBryan Newbold2021-02-241-1/+1
|
* fix body size limitBryan Newbold2021-02-241-4/+4
|
* fmt and lint fixes (including one actual bug)Bryan Newbold2021-02-151-2/+3
|
* truncate indexed fulltext body at 1 MByteBryan Newbold2021-02-151-2/+13
| | | | | | There was a large ~4 MByte document getting indexed (work_lumgqw4vqbgvha2ejbsbaepedq) with several megabytes of text, and this was causing elasticsearch indexing timeouts.
* catch TEI-XML parsing exceptionBryan Newbold2021-01-301-12/+17
|
* enable sentry exceptions for workers and pipelinesBryan Newbold2021-01-301-1/+12
| | | | It is otherwise difficult to debug multi-million record pipelines.
* bigfix: try resolving lang_code list issue againBryan Newbold2021-01-301-5/+4
|
* bugfix: lang_code sometimes a listBryan Newbold2021-01-291-2/+7
|
* make fmtBryan Newbold2021-01-251-1/+4
|
* basic support for excluding web content from indexBryan Newbold2021-01-221-6/+45
| | | | Based on particular patterns in metadata, or exclusion lists in settings
* bug fix: more html_fulltext not getting processedBryan Newbold2021-01-221-0/+2
|
* add container_sherpa_color field, and populate itBryan Newbold2021-01-221-0/+1
|
* improve 'oa' tag calculationBryan Newbold2021-01-161-4/+4
|
* small corrections to schema/transformBryan Newbold2021-01-161-2/+4
|
* add support for new identifiers and size_bytes schema fieldsBryan Newbold2021-01-141-0/+3
|
* basic HTML transform/index supportBryan Newbold2020-11-181-2/+46
|
* refs: extract fatcat crossref pages metadataBryan Newbold2020-11-131-1/+1
|
* commands: show usage on empty commandBryan Newbold2020-11-021-1/+1
|
* more SIM metadata mappingsBryan Newbold2020-10-191-3/+31
|
* SIM pipeline: more language conversionsBryan Newbold2020-10-161-2/+5
| | | | | Not sure where these language strings are coming from, but these were from existing SIM item metadata in archive.org
* transform: refactor tag generation out of transform heavy methodBryan Newbold2020-10-161-28/+37
|
* Upgrade Dynaconf to 3+Bruno Rocha2020-10-051-1/+1
| | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class.
* refs and grobid2json bugfixes from testingBryan Newbold2020-09-141-3/+10
|
* bugfix: release_yearBryan Newbold2020-09-131-2/+2
|
* refs transform: both GROBID and fatcat refsBryan Newbold2020-09-131-5/+12
|
* ref transform: support more GROBID fieldsBryan Newbold2020-09-131-10/+16
|