sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	handle grobid2json errors in calling code instead	Bryan Newbold	2020-01-02	1	-1/+7
\|
*	db: move duplicate row filtering into DB insert helpers	Bryan Newbold	2020-01-02	2	-15/+26
\|
*	remove unused filter in grobid worker	Bryan Newbold	2020-01-02	1	-1/+0
\|
*	fix dict typo	Bryan Newbold	2020-01-02	1	-1/+1
\|
*	improvements to grobid persist worker	Bryan Newbold	2020-01-02	1	-13/+16
\|
*	set mimetype when PUT to minio	Bryan Newbold	2020-01-02	1	-0/+4
\|
*	fix DB import counting	Bryan Newbold	2020-01-02	1	-4/+5
\|
*	fix small errors found by pylint	Bryan Newbold	2020-01-02	2	-1/+2
\|
*	fix sandcrawler persist workers	Bryan Newbold	2020-01-02	1	-0/+1
\|
*	filter ingest results to not have key conflicts within batch	Bryan Newbold	2020-01-02	1	-1/+16
\| \| \| \| \|	This handles a corner case with ON CONFLICT ... DO UPDATE where you can't do multiple such updates in the same batch transaction.
*	db: fancy insert/update separation using postgres xmax	Bryan Newbold	2020-01-02	2	-24/+45
\|
*	add PersistGrobidDiskWorker	Bryan Newbold	2020-01-02	1	-0/+33
\| \| \| \|	To help with making dumps directly from Kafka (eg, for partner delivery)
*	flush out minio helper, add to grobid persist	Bryan Newbold	2020-01-02	2	-22/+71
\|
*	implement counts properly for persist workers	Bryan Newbold	2020-01-02	1	-15/+19
\|
*	improve DB helpers	Bryan Newbold	2020-01-02	1	-26/+81
\| \| \| \| \|	- return insert/update row counts - implement ON CONFLICT ... DO UPDATE on some tables
*	be more parsimonious with GROBID metadata	Bryan Newbold	2020-01-02	1	-2/+4
\| \| \| \| \|	Because these are getting persisted in database (as well as kafka), don't write out empty keys.
*	start work on DB connector and minio client	Bryan Newbold	2020-01-02	2	-0/+200
\|
*	have JsonLinePusher continue on JSON decode errors (but count)	Bryan Newbold	2020-01-02	1	-1/+5
\|
*	start work on persist workers and tool	Bryan Newbold	2020-01-02	1	-0/+223
\|
*	refactor: use print(..., file=sys.stderr)	Bryan Newbold	2019-12-18	3	-25/+27
\| \| \| \|	Should use logging soon, but this seems more idiomatic in the meanwhile.
*	fixes for large GROBID result skip	Bryan Newbold	2019-12-02	1	-2/+2
\|
*	count empty blobs as 'failed' instead of crashing	Bryan Newbold	2019-12-01	1	-1/+2
\| \| \| \|	Might be better to record an artificial kafka response instead?
*	cleanup unused import	Bryan Newbold	2019-12-01	1	-1/+0
\|
*	filter out very large GROBID XML bodies	Bryan Newbold	2019-12-01	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
*	CI: make some jobs manual	Bryan Newbold	2019-11-15	1	-0/+2
\| \| \| \| \|	Scalding test is broken :( But we aren't even using that code much these days.
*	handle wayback fetch redirect loop in ingest code	Bryan Newbold	2019-11-14	1	-2/+5
\|
*	bump kafka max poll interval for consumers	Bryan Newbold	2019-11-14	1	-2/+2
\| \| \| \| \|	The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
*	handle WaybackError during ingest	Bryan Newbold	2019-11-14	1	-0/+4
\|
*	handle SPNv1 redirect loop	Bryan Newbold	2019-11-14	1	-0/+2
\|
*	handle SPNv2 polling timeout	Bryan Newbold	2019-11-14	1	-6/+10
\|
*	update ingest-file batch size to 1	Bryan Newbold	2019-11-14	1	-3/+3
\| \| \| \| \| \| \| \|	Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
*	start of hrmars.com ingest support	Bryan Newbold	2019-11-14	2	-2/+7
\|
*	treat failure to get terminal capture as a SavePageNowError	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	citation_pdf_url with host-relative URLs	Bryan Newbold	2019-11-13	1	-1/+3
\|
*	status_forcelist is on session, not request	Bryan Newbold	2019-11-13	1	-2/+2
\|
*	handle SPNv1 remote server HTTP status codes better	Bryan Newbold	2019-11-13	1	-8/+15
\|
*	handle requests (http) redirect loop from wayback	Bryan Newbold	2019-11-13	1	-1/+4
\|
*	handle wayback client return status correctly	Bryan Newbold	2019-11-13	1	-2/+2
\|
*	allow way more errors in SPN path	Bryan Newbold	2019-11-13	1	-2/+11
\|
*	clean up redirect-following CDX API path	Bryan Newbold	2019-11-13	1	-8/+15
\|
*	fix lint errors	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	improve ingest worker remote failure behavior	Bryan Newbold	2019-11-13	1	-5/+12
\|
*	have SPN client differentiate between SPN and remote errors	Bryan Newbold	2019-11-13	2	-3/+11
\| \| \| \| \| \| \| \|	This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything.
*	rename FileIngestWorker	Bryan Newbold	2019-11-13	2	-5/+10
\|
*	refactor consume_topic name out of make_kafka_consumer()	Bryan Newbold	2019-11-13	1	-5/+5
\| \| \| \|	Best to do this in wrapping code for full flexibility.
*	more progress on file ingest	Bryan Newbold	2019-11-13	3	-16/+73
\|
*	much progress on file ingest path	Bryan Newbold	2019-10-22	5	-15/+334
\|
*	we do actually want consolidateHeader=2, not 1	Bryan Newbold	2019-10-04	1	-3/+3
\|
*	grobid: consolidateHeaders typo	Bryan Newbold	2019-10-04	1	-1/+1
\|
*	disable citation consolidation by default	Bryan Newbold	2019-10-04	1	-1/+1
\| \| \| \| \| \| \|	with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.