fatcat-scholar - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	transform: more clean_doi() calls	Bryan Newbold	2021-07-26	1	-3/+3
\|
*	refs transform: consolidate clean_ref_key() hacks	Bryan Newbold	2021-07-25	1	-17/+35
\|
*	refs transform: many fixes	Bryan Newbold	2021-07-25	1	-9/+34
\| \| \| \| \| \| \| \| \|	- include year correctly (many cases) - test coverage for Crossref transform - pass-through 'edition' as 'version' - series-title parsed in to title or container as appropriate - missing release stage - fix 0-index vs. 1-index ref index field
*	refs transform: 1-index refs.index, not 0-index	Bryan Newbold	2021-07-25	1	-3/+11
\| \| \| \| \| \| \| \|	This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values.
*	refs: clean up GROBID DOIs and PMCIDs	Bryan Newbold	2021-07-01	1	-2/+3
\|
*	HACK: don't parse TEI-XML for a specific paper/file	Bryan Newbold	2021-06-30	1	-2/+4
\| \| \| \| \|	GROBID v0.5.5 returns TEI-XML for this one PDF which is not valid XML, due to a text encoding issue.
*	refs: include (source) release_stage in output	Bryan Newbold	2021-06-30	1	-0/+1
\|
*	bugfix: pass full crossref obj, not just 'record'	Bryan Newbold	2021-06-02	1	-1/+1
\|
*	refs: use fatcat prefix for some sources	Bryan Newbold	2021-06-02	1	-5/+5
\| \| \| \|	This makes debugging what is going on much easier
*	integrate crossref references, and iterate on refs output logic	Bryan Newbold	2021-06-02	1	-7/+115
\| \| \| \|	Needs test coverage!
*	schema: add 'crossref' to bundle schema, and add from_json() helper	Bryan Newbold	2021-06-02	1	-26/+4
\| \| \| \| \|	from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
*	reduce max body size to 0.5M characters	Bryan Newbold	2021-02-24	1	-1/+1
\|
*	fix body size limit	Bryan Newbold	2021-02-24	1	-4/+4
\|
*	fmt and lint fixes (including one actual bug)	Bryan Newbold	2021-02-15	1	-2/+3
\|
*	truncate indexed fulltext body at 1 MByte	Bryan Newbold	2021-02-15	1	-2/+13
\| \| \| \| \| \|	There was a large ~4 MByte document getting indexed (work_lumgqw4vqbgvha2ejbsbaepedq) with several megabytes of text, and this was causing elasticsearch indexing timeouts.
*	catch TEI-XML parsing exception	Bryan Newbold	2021-01-30	1	-12/+17
\|
*	enable sentry exceptions for workers and pipelines	Bryan Newbold	2021-01-30	1	-1/+12
\| \| \| \|	It is otherwise difficult to debug multi-million record pipelines.
*	bigfix: try resolving lang_code list issue again	Bryan Newbold	2021-01-30	1	-5/+4
\|
*	bugfix: lang_code sometimes a list	Bryan Newbold	2021-01-29	1	-2/+7
\|
*	make fmt	Bryan Newbold	2021-01-25	1	-1/+4
\|
*	basic support for excluding web content from index	Bryan Newbold	2021-01-22	1	-6/+45
\| \| \| \|	Based on particular patterns in metadata, or exclusion lists in settings
*	bug fix: more html_fulltext not getting processed	Bryan Newbold	2021-01-22	1	-0/+2
\|
*	add container_sherpa_color field, and populate it	Bryan Newbold	2021-01-22	1	-0/+1
\|
*	improve 'oa' tag calculation	Bryan Newbold	2021-01-16	1	-4/+4
\|
*	small corrections to schema/transform	Bryan Newbold	2021-01-16	1	-2/+4
\|
*	add support for new identifiers and size_bytes schema fields	Bryan Newbold	2021-01-14	1	-0/+3
\|
*	basic HTML transform/index support	Bryan Newbold	2020-11-18	1	-2/+46
\|
*	refs: extract fatcat crossref pages metadata	Bryan Newbold	2020-11-13	1	-1/+1
\|
*	commands: show usage on empty command	Bryan Newbold	2020-11-02	1	-1/+1
\|
*	more SIM metadata mappings	Bryan Newbold	2020-10-19	1	-3/+31
\|
*	SIM pipeline: more language conversions	Bryan Newbold	2020-10-16	1	-2/+5
\| \| \| \| \|	Not sure where these language strings are coming from, but these were from existing SIM item metadata in archive.org
*	transform: refactor tag generation out of transform heavy method	Bryan Newbold	2020-10-16	1	-28/+37
\|
*	Upgrade Dynaconf to 3+	Bruno Rocha	2020-10-05	1	-1/+1
\| \| \| \| \| \|	In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class.
*	refs and grobid2json bugfixes from testing	Bryan Newbold	2020-09-14	1	-3/+10
\|
*	bugfix: release_year	Bryan Newbold	2020-09-13	1	-2/+2
\|
*	refs transform: both GROBID and fatcat refs	Bryan Newbold	2020-09-13	1	-5/+12
\|
*	ref transform: support more GROBID fields	Bryan Newbold	2020-09-13	1	-10/+16
\|
*	fixes to refs transform (for non-str author fields)	Bryan Newbold	2020-09-04	1	-2/+6
\|
*	heavy to refs command	Bryan Newbold	2020-09-04	1	-2/+142
\|
*	use simple names, not domain names, for some platforms	Bryan Newbold	2020-08-12	1	-3/+3
\|
*	biblio metadata hacks at transform time	Bryan Newbold	2020-08-12	1	-2/+98
\|
*	don't index sim_page without issue_item and first_page	Bryan Newbold	2020-08-06	1	-0/+3
\|
*	handle integer conversion and bounding for ES schema	Bryan Newbold	2020-08-06	1	-10/+13
\|
*	json: exclude None in output, and sort keys	Bryan Newbold	2020-07-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
*	ensure SIM release date parses before assigning	Bryan Newbold	2020-07-21	1	-1/+6
\|
*	make fmt	Bryan Newbold	2020-06-29	1	-8/+13
\|
*	include GROBID-extracted abstracts in search documents	Bryan Newbold	2020-06-29	1	-10/+15
\|
*	small improvements to SIM metadata maps	Bryan Newbold	2020-06-29	1	-6/+11
\|
*	fixes for pdf_meta dict	Bryan Newbold	2020-06-29	1	-1/+2
\|
*	remove old COVID19 thumbnail hack	Bryan Newbold	2020-06-29	1	-1/+2
\|