aboutsummaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/schema.py
Commit message (Collapse)AuthorAgeFilesLines
* improve text scrubbingBryan Newbold2020-06-031-13/+21
| | | | | | | | | | Was going to use textpipe, but dependency was too large and failed to install with halfway modern GCC (due to CLD2 issue): https://github.com/GregBowyer/cld2-cffi/issues/12 So instead basically pulled out the clean_text function, which is quite short.
* add prefix scrubing (esp. for abstracts)Bryan Newbold2020-05-211-0/+18
|
* use beautiful soup for XML scrubingBryan Newbold2020-05-211-7/+6
|
* be more inclusive of author namesBryan Newbold2020-05-211-4/+4
|
* fixes from manual testingBryan Newbold2020-05-201-7/+11
|
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-0/+334