aboutsummaryrefslogtreecommitdiffstats
path: root/fatcat_scholar/schema.py
Commit message (Collapse)AuthorAgeFilesLines
* transform: more string cleaningBryan Newbold2020-08-121-12/+59
|
* volume_int/issue_int as actual intsBryan Newbold2020-08-061-2/+2
|
* handle integer conversion and bounding for ES schemaBryan Newbold2020-08-061-9/+22
|
* scrub_text: single-token strings skippedBryan Newbold2020-08-061-0/+4
|
* strip ACKNOWLEDGEMENTS prefixBryan Newbold2020-08-061-0/+1
|
* transform: catch more cases of null extraBryan Newbold2020-07-301-10/+10
| | | | Also correctly pull issne/issnp from container.extra, not release.extra.
* abstracts: more prefixes to ignoreBryan Newbold2020-07-271-0/+3
|
* strip <em> tags explicitlyBryan Newbold2020-07-211-0/+1
|
* handle large/bad 'first_page' metadataBryan Newbold2020-06-291-0/+3
| | | | This was causing elasticsearch indexing errors
* more conservative container_original_nameBryan Newbold2020-06-291-0/+2
|
* fix lint errors (and some small bugs)Bryan Newbold2020-06-291-2/+1
|
* fixes to schema parsing from prodBryan Newbold2020-06-291-9/+13
|
* include GROBID-extracted abstracts in search documentsBryan Newbold2020-06-291-0/+8
|
* fetch pdftotext and pdf_meta from blobs, postgrestBryan Newbold2020-06-291-4/+5
| | | | | This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
* commit production work-around (temporarily)Bryan Newbold2020-06-041-1/+2
|
* collapse pages by SIM issueBryan Newbold2020-06-041-0/+1
|
* fmtBryan Newbold2020-06-041-0/+2
|
* start some annotaition fixes for pytypeBryan Newbold2020-06-031-1/+3
|
* more flake8Bryan Newbold2020-06-031-1/+1
|
* flake8 fixes (partial)Bryan Newbold2020-06-031-1/+1
|
* reformat python code with blackBryan Newbold2020-06-031-38/+64
|
* improve text scrubbingBryan Newbold2020-06-031-13/+21
| | | | | | | | | | Was going to use textpipe, but dependency was too large and failed to install with halfway modern GCC (due to CLD2 issue): https://github.com/GregBowyer/cld2-cffi/issues/12 So instead basically pulled out the clean_text function, which is quite short.
* add prefix scrubing (esp. for abstracts)Bryan Newbold2020-05-211-0/+18
|
* use beautiful soup for XML scrubingBryan Newbold2020-05-211-7/+6
|
* be more inclusive of author namesBryan Newbold2020-05-211-4/+4
|
* fixes from manual testingBryan Newbold2020-05-201-7/+11
|
* first pass transform from pipelines to ES schemaBryan Newbold2020-05-201-0/+334