fatcat-scholar - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	refactor DOI domain lookup into python code; expand table	Bryan Newbold	2021-01-21	1	-0/+14
\|
*	citation: fixes to generic hack; remove bibtex hack	Bryan Newbold	2021-01-21	1	-31/+6
\|
*	fixup: check for container.extra in indexing pipeline	Bryan Newbold	2021-01-21	1	-1/+3
\|
*	fix indexing bug (false-y publisher_type?)	Bryan Newbold	2021-01-18	1	-0/+2
\|
*	lint: fix small bugs and type annotations	Bryan Newbold	2021-01-18	1	-1/+2
\|
*	small corrections to schema/transform	Bryan Newbold	2021-01-16	1	-1/+4
\|
*	make fmt	Bryan Newbold	2021-01-15	1	-6/+6
\|
*	crude bibtex and citation formatting, as a demo	Bryan Newbold	2021-01-14	1	-0/+49
\|
*	schema: make fulltext body optional (eg, for search results)	Bryan Newbold	2021-01-14	1	-1/+1
\|
*	add support for new identifiers and size_bytes schema fields	Bryan Newbold	2021-01-14	1	-4/+13
\|
*	add basic html fulltext support to fetch pipeline	Bryan Newbold	2020-11-18	1	-0/+1
\|
*	schema: optional 'fetched' field on bundles	Bryan Newbold	2020-10-16	1	-0/+2
\|
*	make fmt	Bryan Newbold	2020-09-13	1	-6/+12
\|
*	ref transform: support more GROBID fields	Bryan Newbold	2020-09-13	1	-1/+4
\|
*	URL cleanup helper	Bryan Newbold	2020-09-13	1	-0/+28
\|
*	heavy to refs command	Bryan Newbold	2020-09-04	1	-0/+36
\|
*	handle small ints better (signed/unsigned abs size)	Bryan Newbold	2020-08-12	1	-1/+2
\|
*	transform: more string cleaning	Bryan Newbold	2020-08-12	1	-12/+59
\|
*	volume_int/issue_int as actual ints	Bryan Newbold	2020-08-06	1	-2/+2
\|
*	handle integer conversion and bounding for ES schema	Bryan Newbold	2020-08-06	1	-9/+22
\|
*	scrub_text: single-token strings skipped	Bryan Newbold	2020-08-06	1	-0/+4
\|
*	strip ACKNOWLEDGEMENTS prefix	Bryan Newbold	2020-08-06	1	-0/+1
\|
*	transform: catch more cases of null extra	Bryan Newbold	2020-07-30	1	-10/+10
\| \| \| \|	Also correctly pull issne/issnp from container.extra, not release.extra.
*	abstracts: more prefixes to ignore	Bryan Newbold	2020-07-27	1	-0/+3
\|
*	strip <em> tags explicitly	Bryan Newbold	2020-07-21	1	-0/+1
\|
*	handle large/bad 'first_page' metadata	Bryan Newbold	2020-06-29	1	-0/+3
\| \| \| \|	This was causing elasticsearch indexing errors
*	more conservative container_original_name	Bryan Newbold	2020-06-29	1	-0/+2
\|
*	fix lint errors (and some small bugs)	Bryan Newbold	2020-06-29	1	-2/+1
\|
*	fixes to schema parsing from prod	Bryan Newbold	2020-06-29	1	-9/+13
\|
*	include GROBID-extracted abstracts in search documents	Bryan Newbold	2020-06-29	1	-0/+8
\|
*	fetch pdftotext and pdf_meta from blobs, postgrest	Bryan Newbold	2020-06-29	1	-4/+5
\| \| \| \| \|	This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
*	commit production work-around (temporarily)	Bryan Newbold	2020-06-04	1	-1/+2
\|
*	collapse pages by SIM issue	Bryan Newbold	2020-06-04	1	-0/+1
\|
*	fmt	Bryan Newbold	2020-06-04	1	-0/+2
\|
*	start some annotaition fixes for pytype	Bryan Newbold	2020-06-03	1	-1/+3
\|
*	more flake8	Bryan Newbold	2020-06-03	1	-1/+1
\|
*	flake8 fixes (partial)	Bryan Newbold	2020-06-03	1	-1/+1
\|
*	reformat python code with black	Bryan Newbold	2020-06-03	1	-38/+64
\|
*	improve text scrubbing	Bryan Newbold	2020-06-03	1	-13/+21
\| \| \| \| \| \| \| \| \| \|	Was going to use textpipe, but dependency was too large and failed to install with halfway modern GCC (due to CLD2 issue): https://github.com/GregBowyer/cld2-cffi/issues/12 So instead basically pulled out the clean_text function, which is quite short.
*	add prefix scrubing (esp. for abstracts)	Bryan Newbold	2020-05-21	1	-0/+18
\|
*	use beautiful soup for XML scrubing	Bryan Newbold	2020-05-21	1	-7/+6
\|
*	be more inclusive of author names	Bryan Newbold	2020-05-21	1	-4/+4
\|
*	fixes from manual testing	Bryan Newbold	2020-05-20	1	-7/+11
\|
*	first pass transform from pipelines to ES schema	Bryan Newbold	2020-05-20	1	-0/+334