aboutsummaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/schema.py
Commit message (Expand)AuthorAgeFilesLines
* catch HTML parsing error from withing html (via bs4)Bryan Newbold2021-02-011-2/+9
* bugfix: container_sherpa_color not definedBryan Newbold2021-01-291-1/+1
* make fmtBryan Newbold2021-01-251-1/+3
* basic support for excluding web content from indexBryan Newbold2021-01-221-0/+14
* add container_sherpa_color field, and populate itBryan Newbold2021-01-221-18/+18
* refactor DOI domain lookup into python code; expand tableBryan Newbold2021-01-211-0/+14
* citation: fixes to generic hack; remove bibtex hackBryan Newbold2021-01-211-31/+6
* fixup: check for container.extra in indexing pipelineBryan Newbold2021-01-211-1/+3
* fix indexing bug (false-y publisher_type?)Bryan Newbold2021-01-181-0/+2
* lint: fix small bugs and type annotationsBryan Newbold2021-01-181-1/+2
* small corrections to schema/transformBryan Newbold2021-01-161-1/+4
* make fmtBryan Newbold2021-01-151-6/+6
* crude bibtex and citation formatting, as a demoBryan Newbold2021-01-141-0/+49
* schema: make fulltext body optional (eg, for search results)Bryan Newbold2021-01-141-1/+1
* add support for new identifiers and size_bytes schema fieldsBryan Newbold2021-01-141-4/+13
* add basic html fulltext support to fetch pipelineBryan Newbold2020-11-181-0/+1
* schema: optional 'fetched' field on bundlesBryan Newbold2020-10-161-0/+2
* make fmtBryan Newbold2020-09-131-6/+12
* ref transform: support more GROBID fieldsBryan Newbold2020-09-131-1/+4
* URL cleanup helperBryan Newbold2020-09-131-0/+28
* heavy to refs commandBryan Newbold2020-09-041-0/+36
* handle small ints better (signed/unsigned abs size)Bryan Newbold2020-08-121-1/+2
* transform: more string cleaningBryan Newbold2020-08-121-12/+59
* volume_int/issue_int as actual intsBryan Newbold2020-08-061-2/+2
* handle integer conversion and bounding for ES schemaBryan Newbold2020-08-061-9/+22
* scrub_text: single-token strings skippedBryan Newbold2020-08-061-0/+4
* strip ACKNOWLEDGEMENTS prefixBryan Newbold2020-08-061-0/+1
* transform: catch more cases of null extraBryan Newbold2020-07-301-10/+10
* abstracts: more prefixes to ignoreBryan Newbold2020-07-271-0/+3
* strip <em> tags explicitlyBryan Newbold2020-07-211-0/+1
* handle large/bad 'first_page' metadataBryan Newbold2020-06-291-0/+3
* more conservative container_original_nameBryan Newbold2020-06-291-0/+2
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-2/+1
* fixes to schema parsing from prodBryan Newbold2020-06-291-9/+13
* include GROBID-extracted abstracts in search documentsBryan Newbold2020-06-291-0/+8
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-4/+5
* commit production work-around (temporarily)Bryan Newbold2020-06-041-1/+2
* collapse pages by SIM issueBryan Newbold2020-06-041-0/+1
* fmtBryan Newbold2020-06-041-0/+2
* start some annotaition fixes for pytypeBryan Newbold2020-06-031-1/+3
* more flake8Bryan Newbold2020-06-031-1/+1
* flake8 fixes (partial)Bryan Newbold2020-06-031-1/+1
* reformat python code with blackBryan Newbold2020-06-031-38/+64
* improve text scrubbingBryan Newbold2020-06-031-13/+21
* add prefix scrubing (esp. for abstracts)Bryan Newbold2020-05-211-0/+18
* use beautiful soup for XML scrubingBryan Newbold2020-05-211-7/+6
* be more inclusive of author namesBryan Newbold2020-05-211-4/+4
* fixes from manual testingBryan Newbold2020-05-201-7/+11
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-0/+334