fatcat-scholar - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	SIM transform: handle multiple publishers	Bryan Newbold	2022-01-06	1	-1/+5
\|
*	refs transform: handle rare missing ref 'id'	Bryan Newbold	2022-01-05	1	-1/+7
\| \| \| \|	This impacted one single DOI in the most recent dump/transform
*	move public domain wall to 1926 ('before 1927')	Bryan Newbold	2022-01-05	1	-1/+1
\|
*	refs: include GROBID-parsed crossref refs	Bryan Newbold	2021-12-06	1	-4/+52
\| \| \| \| \| \|	This takes advantage of Crossref 'unstructured' refs which have been parsed using GROBID and stored in the sandcrawler database, as part of the sandcrawler crossref metadata pipeline.
*	refactor use of grobid_tei_xml	Bryan Newbold	2021-10-27	1	-41/+39
\|
*	replace grobid2json with grobid_tei_xml	Bryan Newbold	2021-10-27	1	-3/+5
\| \| \| \| \|	This first iteration uses the .to_legacy_dict() helpers for backwards compatibility
*	lint: small cleanups, mostly E711 and E713	Bryan Newbold	2021-10-27	1	-3/+3
\|
*	lint: remove all 'import *' uses	Bryan Newbold	2021-10-27	1	-2/+20
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	1	-3/+10
\|
*	re-style imports (isort) on all core python files	Bryan Newbold	2021-10-27	1	-5/+5
\|
*	better parsing of year as integer in refs pipeline	Bryan Newbold	2021-07-26	1	-2/+2
\|
*	make fmt	Bryan Newbold	2021-07-26	1	-4/+10
\|
*	ref_key: hotfix for some corner cases	Bryan Newbold	2021-07-26	1	-8/+25
\|
*	transform: more clean_doi() calls	Bryan Newbold	2021-07-26	1	-3/+3
\|
*	refs transform: consolidate clean_ref_key() hacks	Bryan Newbold	2021-07-25	1	-17/+35
\|
*	refs transform: many fixes	Bryan Newbold	2021-07-25	1	-9/+34
\| \| \| \| \| \| \| \| \|	- include year correctly (many cases) - test coverage for Crossref transform - pass-through 'edition' as 'version' - series-title parsed in to title or container as appropriate - missing release stage - fix 0-index vs. 1-index ref index field
*	refs transform: 1-index refs.index, not 0-index	Bryan Newbold	2021-07-25	1	-3/+11
\| \| \| \| \| \| \| \|	This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values.
*	refs: clean up GROBID DOIs and PMCIDs	Bryan Newbold	2021-07-01	1	-2/+3
\|
*	HACK: don't parse TEI-XML for a specific paper/file	Bryan Newbold	2021-06-30	1	-2/+4
\| \| \| \| \|	GROBID v0.5.5 returns TEI-XML for this one PDF which is not valid XML, due to a text encoding issue.
*	refs: include (source) release_stage in output	Bryan Newbold	2021-06-30	1	-0/+1
\|
*	bugfix: pass full crossref obj, not just 'record'	Bryan Newbold	2021-06-02	1	-1/+1
\|
*	refs: use fatcat prefix for some sources	Bryan Newbold	2021-06-02	1	-5/+5
\| \| \| \|	This makes debugging what is going on much easier
*	integrate crossref references, and iterate on refs output logic	Bryan Newbold	2021-06-02	1	-7/+115
\| \| \| \|	Needs test coverage!
*	schema: add 'crossref' to bundle schema, and add from_json() helper	Bryan Newbold	2021-06-02	1	-26/+4
\| \| \| \| \|	from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
*	reduce max body size to 0.5M characters	Bryan Newbold	2021-02-24	1	-1/+1
\|
*	fix body size limit	Bryan Newbold	2021-02-24	1	-4/+4
\|
*	fmt and lint fixes (including one actual bug)	Bryan Newbold	2021-02-15	1	-2/+3
\|
*	truncate indexed fulltext body at 1 MByte	Bryan Newbold	2021-02-15	1	-2/+13
\| \| \| \| \| \|	There was a large ~4 MByte document getting indexed (work_lumgqw4vqbgvha2ejbsbaepedq) with several megabytes of text, and this was causing elasticsearch indexing timeouts.
*	catch TEI-XML parsing exception	Bryan Newbold	2021-01-30	1	-12/+17
\|
*	enable sentry exceptions for workers and pipelines	Bryan Newbold	2021-01-30	1	-1/+12
\| \| \| \|	It is otherwise difficult to debug multi-million record pipelines.
*	bigfix: try resolving lang_code list issue again	Bryan Newbold	2021-01-30	1	-5/+4
\|
*	bugfix: lang_code sometimes a list	Bryan Newbold	2021-01-29	1	-2/+7
\|
*	make fmt	Bryan Newbold	2021-01-25	1	-1/+4
\|
*	basic support for excluding web content from index	Bryan Newbold	2021-01-22	1	-6/+45
\| \| \| \|	Based on particular patterns in metadata, or exclusion lists in settings
*	bug fix: more html_fulltext not getting processed	Bryan Newbold	2021-01-22	1	-0/+2
\|
*	add container_sherpa_color field, and populate it	Bryan Newbold	2021-01-22	1	-0/+1
\|
*	improve 'oa' tag calculation	Bryan Newbold	2021-01-16	1	-4/+4
\|
*	small corrections to schema/transform	Bryan Newbold	2021-01-16	1	-2/+4
\|
*	add support for new identifiers and size_bytes schema fields	Bryan Newbold	2021-01-14	1	-0/+3
\|
*	basic HTML transform/index support	Bryan Newbold	2020-11-18	1	-2/+46
\|
*	refs: extract fatcat crossref pages metadata	Bryan Newbold	2020-11-13	1	-1/+1
\|
*	commands: show usage on empty command	Bryan Newbold	2020-11-02	1	-1/+1
\|
*	more SIM metadata mappings	Bryan Newbold	2020-10-19	1	-3/+31
\|
*	SIM pipeline: more language conversions	Bryan Newbold	2020-10-16	1	-2/+5
\| \| \| \| \|	Not sure where these language strings are coming from, but these were from existing SIM item metadata in archive.org
*	transform: refactor tag generation out of transform heavy method	Bryan Newbold	2020-10-16	1	-28/+37
\|
*	Upgrade Dynaconf to 3+	Bruno Rocha	2020-10-05	1	-1/+1
\| \| \| \| \| \|	In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class.
*	refs and grobid2json bugfixes from testing	Bryan Newbold	2020-09-14	1	-3/+10
\|
*	bugfix: release_year	Bryan Newbold	2020-09-13	1	-2/+2
\|
*	refs transform: both GROBID and fatcat refs	Bryan Newbold	2020-09-13	1	-5/+12
\|
*	ref transform: support more GROBID fields	Bryan Newbold	2020-09-13	1	-10/+16
\|