summaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/schema.py
Commit message (Collapse)AuthorAgeFilesLines
* schema: use container redirect as ident if definedBryan Newbold2021-11-301-2/+2
| | | | | | This is to handle containers which have been merged (redirected), but the release entities have not be updated to point to the new "primary" container yet.
* refactor use of grobid_tei_xmlBryan Newbold2021-10-271-4/+5
|
* scrub_text: remove unused mimetype argBryan Newbold2021-10-271-1/+1
| | | | To resolve a warning caught by pytype
* replace classmethods with staticmethodsBryan Newbold2021-10-271-2/+2
|
* lint: small cleanups, mostly E711 and E713Bryan Newbold2021-10-271-10/+10
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-1/+0
|
* re-style imports (isort) on all core python filesBryan Newbold2021-10-271-6/+7
|
* better parsing of year as integer in refs pipelineBryan Newbold2021-07-261-2/+6
|
* bibref: add version field; isbn13 -> isbnBryan Newbold2021-07-251-1/+2
|
* refs transform: 1-index refs.index, not 0-indexBryan Newbold2021-07-251-1/+1
| | | | | | | | This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values.
* refs: include (source) release_stage in outputBryan Newbold2021-06-301-0/+1
|
* schema: add 'crossref' to bundle schema, and add from_json() helperBryan Newbold2021-06-021-1/+20
| | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
* indexing: defer to creator.display_name over contrib.raw_nameBryan Newbold2021-04-121-1/+3
|
* catch HTML parsing error from withing html (via bs4)Bryan Newbold2021-02-011-2/+9
|
* bugfix: container_sherpa_color not definedBryan Newbold2021-01-291-1/+1
|
* make fmtBryan Newbold2021-01-251-1/+3
|
* basic support for excluding web content from indexBryan Newbold2021-01-221-0/+14
| | | | Based on particular patterns in metadata, or exclusion lists in settings
* add container_sherpa_color field, and populate itBryan Newbold2021-01-221-18/+18
|
* refactor DOI domain lookup into python code; expand tableBryan Newbold2021-01-211-0/+14
|
* citation: fixes to generic hack; remove bibtex hackBryan Newbold2021-01-211-31/+6
|
* fixup: check for container.extra in indexing pipelineBryan Newbold2021-01-211-1/+3
|
* fix indexing bug (false-y publisher_type?)Bryan Newbold2021-01-181-0/+2
|
* lint: fix small bugs and type annotationsBryan Newbold2021-01-181-1/+2
|
* small corrections to schema/transformBryan Newbold2021-01-161-1/+4
|
* make fmtBryan Newbold2021-01-151-6/+6
|
* crude bibtex and citation formatting, as a demoBryan Newbold2021-01-141-0/+49
|
* schema: make fulltext body optional (eg, for search results)Bryan Newbold2021-01-141-1/+1
|
* add support for new identifiers and size_bytes schema fieldsBryan Newbold2021-01-141-4/+13
|
* add basic html fulltext support to fetch pipelineBryan Newbold2020-11-181-0/+1
|
* schema: optional 'fetched' field on bundlesBryan Newbold2020-10-161-0/+2
|
* make fmtBryan Newbold2020-09-131-6/+12
|
* ref transform: support more GROBID fieldsBryan Newbold2020-09-131-1/+4
|
* URL cleanup helperBryan Newbold2020-09-131-0/+28
|
* heavy to refs commandBryan Newbold2020-09-041-0/+36
|
* handle small ints better (signed/unsigned abs size)Bryan Newbold2020-08-121-1/+2
|
* transform: more string cleaningBryan Newbold2020-08-121-12/+59
|
* volume_int/issue_int as actual intsBryan Newbold2020-08-061-2/+2
|
* handle integer conversion and bounding for ES schemaBryan Newbold2020-08-061-9/+22
|
* scrub_text: single-token strings skippedBryan Newbold2020-08-061-0/+4
|
* strip ACKNOWLEDGEMENTS prefixBryan Newbold2020-08-061-0/+1
|
* transform: catch more cases of null extraBryan Newbold2020-07-301-10/+10
| | | | Also correctly pull issne/issnp from container.extra, not release.extra.
* abstracts: more prefixes to ignoreBryan Newbold2020-07-271-0/+3
|
* strip <em> tags explicitlyBryan Newbold2020-07-211-0/+1
|
* handle large/bad 'first_page' metadataBryan Newbold2020-06-291-0/+3
| | | | This was causing elasticsearch indexing errors
* more conservative container_original_nameBryan Newbold2020-06-291-0/+2
|
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-2/+1
|
* fixes to schema parsing from prodBryan Newbold2020-06-291-9/+13
|
* include GROBID-extracted abstracts in search documentsBryan Newbold2020-06-291-0/+8
|
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-4/+5
| | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
* commit production work-around (temporarily)Bryan Newbold2020-06-041-1/+2
|