fatcat-scholar - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	pull GROBID refs along with crossref records into bundles	Bryan Newbold	2021-11-10	1	-0/+1
\|
*	lint: small cleanups, mostly E711 and E713	Bryan Newbold	2021-10-27	1	-2/+2
\|
*	lint: remove all 'import *' uses	Bryan Newbold	2021-10-27	1	-1/+1
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	1	-5/+17
\|
*	re-style imports (isort) on all core python files	Bryan Newbold	2021-10-27	1	-14/+11
\|
*	catch/ignore ChunkedEncoding errors in fetches	Bryan Newbold	2021-06-11	1	-0/+3
\|
*	lint fixes, and run fmt	Bryan Newbold	2021-06-02	1	-7/+7
\|
*	add 'crossref' hydration to work pipeline	Bryan Newbold	2021-06-02	1	-0/+35
\| \| \| \| \| \| \| \|	The immediate motivation is to include recent crossref refs in citation graph transforms. May also be valuable for researchers to have authoritative/publisher metadata in the bundle dumps.
*	schema: add 'crossref' to bundle schema, and add from_json() helper	Bryan Newbold	2021-06-02	1	-0/+1
\| \| \| \| \|	from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class
*	Modernize Python syntax with pyupgrade --py38-plus */.py	Christian Clauss	2021-02-23	1	-1/+1
\|
*	fmt and lint fixes (including one actual bug)	Bryan Newbold	2021-02-15	1	-1/+1
\|
*	more seaweedfs hacks	Bryan Newbold	2021-02-12	1	-0/+8
\|
*	enable sentry exceptions for workers and pipelines	Bryan Newbold	2021-01-30	1	-1/+10
\| \| \| \|	It is otherwise difficult to debug multi-million record pipelines.
*	work pipeline: hack to skip seaweedfs errors for now	Bryan Newbold	2021-01-26	1	-0/+5
\| \| \| \| \|	This isn't great becasue it turns a lot of problems into silent failures.
*	sort keys in work pipeline (fix typo)	Bryan Newbold	2021-01-22	1	-1/+1
\|
*	bug fix: actually fetch/include HTML fulltext	Bryan Newbold	2021-01-22	1	-1/+1
\|
*	add basic html fulltext support to fetch pipeline	Bryan Newbold	2020-11-18	1	-2/+46
\|
*	commands: show usage on empty command	Bryan Newbold	2020-11-02	1	-1/+1
\|
*	work pipeline comparison fix	Bryan Newbold	2020-10-28	1	-0/+3
\|
*	Upgrade Dynaconf to 3+	Bruno Rocha	2020-10-05	1	-1/+1
\| \| \| \| \| \|	In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class.
*	pipeline: skip grobid/pdftext lookups when no URL; prefer GROBID to pdftext	Bryan Newbold	2020-07-27	1	-1/+3
\|
*	json: exclude None in output, and sort keys	Bryan Newbold	2020-07-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents
*	fix lint errors (and some small bugs)	Bryan Newbold	2020-06-29	1	-6/+8
\|
*	seaweedfs for S3 API; pull config from dynaconf	Bryan Newbold	2020-06-29	1	-11/+2
\|
*	make fmt	Bryan Newbold	2020-06-29	1	-1/+3
\|
*	fetch pdftotext and pdf_meta from blobs, postgrest	Bryan Newbold	2020-06-29	1	-18/+45
\| \| \| \| \|	This replaces the temporary COVID-19 content hack with production content (text, thumbnail URLs) stored in postgrest and seaweedfs.
*	flake8 fixes (partial)	Bryan Newbold	2020-06-03	1	-5/+2
\|
*	reformat python code with black	Bryan Newbold	2020-06-03	1	-68/+120
\|
*	more petabox timeout handling	Bryan Newbold	2020-05-21	1	-0/+3
\|
*	handle petabox read timeouts a bit	Bryan Newbold	2020-05-21	1	-1/+6
\|
*	fix typo with UnicodeDecodeError catch	Bryan Newbold	2020-05-21	1	-1/+1
\|
*	skip pdftotext loading on unicode error	Bryan Newbold	2020-05-20	1	-0/+2
\|
*	skip SIM items w/o page_numbers (instead of asserting)	Bryan Newbold	2020-05-20	1	-1/+3
\|
*	fixes from manual testing	Bryan Newbold	2020-05-20	1	-8/+13
\|
*	local pdftotext cache dir hack	Bryan Newbold	2020-05-20	1	-1/+18
\|
*	fixes to release+sim pipeline	Bryan Newbold	2020-05-20	1	-10/+16
\|
*	first pass transform from pipelines to ES schema	Bryan Newbold	2020-05-20	1	-16/+1
\|
*	WIP on SIM pipeline	Bryan Newbold	2020-05-19	1	-2/+2
\|
*	WIP on release-to-sim fetching	Bryan Newbold	2020-05-19	1	-12/+49
\|
*	initial progress on work pipeline	Bryan Newbold	2020-05-16	1	-0/+305