sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	changes from prod	Bryan Newbold	2020-06-25	1	-4/+12
\|
*	fixes and tweaks from testing locally	Bryan Newbold	2020-06-17	1	-11/+18
\|
*	tweak kafka topic names and seaweedfs layout	Bryan Newbold	2020-06-17	1	-1/+2
\|
*	add new pdf workers/persisters	Bryan Newbold	2020-06-17	1	-0/+99
\|
*	workers: refactor to pass key to process()	Bryan Newbold	2020-06-17	1	-6/+6
\|
*	persist: only GROBID updates file_meta, not file-result	Bryan Newbold	2020-04-16	1	-1/+1
\| \| \| \| \| \| \| \| \|	The hope here is to reduce deadlocks in production (on aitio). As context, we are only doing "updates" until the entire file_meta table is filled in with full metadata anyways; updates are wasteful of resources, and most inserts we have seen the file before, so should be doing "DO NOTHING" if the SHA1 is already in the table.
*	persist grobid: add option to skip S3 upload	Bryan Newbold	2020-03-19	1	-7/+10
\| \| \| \| \| \| \|	Motivation for this is that current S3 target (minio) is overloaded, with too many files on a single partition (80 million+). Going to look in to seaweedfs and other options, but for now stopping minio persist. Data is all stored in kafka anyways.
*	fixes to ingest-request persist	Bryan Newbold	2020-03-05	1	-3/+1
\|
*	persist: ingest_request tool (with no ingest_file_result)	Bryan Newbold	2020-03-05	1	-0/+29
\|
*	pdf_trio persist fixes from prod	Bryan Newbold	2020-02-19	1	-1/+5
\|
*	include rel and oa_status in ingest request 'extra'	Bryan Newbold	2020-02-18	1	-1/+1
\|
*	move pdf_trio results back under key in JSON/Kafka	Bryan Newbold	2020-02-13	1	-1/+9
\|
*	pdftrio basic python code	Bryan Newbold	2020-02-12	1	-0/+21
\| \| \| \|	This is basically just a copy/paste of GROBID code, only simpler!
*	fix persist bug where ingest_request_source not saved	Bryan Newbold	2020-02-05	1	-0/+1
\|
*	persist grobid: actually, status_code is required	Bryan Newbold	2020-01-21	1	-2/+9
\| \| \| \| \| \| \|	Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
*	persist: work around GROBID timeouts with no status_code	Bryan Newbold	2020-01-21	1	-2/+2
\|
*	persist worker: implement updated ingest result semantics	Bryan Newbold	2020-01-15	1	-11/+16
\|
*	ingest persist skips 'existing' ingest results	Bryan Newbold	2020-01-14	1	-0/+3
\|
*	handle grobid2json errors in calling code instead	Bryan Newbold	2020-01-02	1	-1/+7
\|
*	db: move duplicate row filtering into DB insert helpers	Bryan Newbold	2020-01-02	1	-15/+1
\|
*	remove unused filter in grobid worker	Bryan Newbold	2020-01-02	1	-1/+0
\|
*	fix dict typo	Bryan Newbold	2020-01-02	1	-1/+1
\|
*	improvements to grobid persist worker	Bryan Newbold	2020-01-02	1	-13/+16
\|
*	filter ingest results to not have key conflicts within batch	Bryan Newbold	2020-01-02	1	-1/+16
\| \| \| \| \|	This handles a corner case with ON CONFLICT ... DO UPDATE where you can't do multiple such updates in the same batch transaction.
*	db: fancy insert/update separation using postgres xmax	Bryan Newbold	2020-01-02	1	-9/+15
\|
*	add PersistGrobidDiskWorker	Bryan Newbold	2020-01-02	1	-0/+33
\| \| \| \|	To help with making dumps directly from Kafka (eg, for partner delivery)
*	flush out minio helper, add to grobid persist	Bryan Newbold	2020-01-02	1	-9/+29
\|
*	implement counts properly for persist workers	Bryan Newbold	2020-01-02	1	-15/+19
\|
*	start work on persist workers and tool	Bryan Newbold	2020-01-02	1	-0/+223