sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	implement counts properly for persist workers	Bryan Newbold	2020-01-02	1	-15/+19
\|
*	improve DB helpers	Bryan Newbold	2020-01-02	1	-26/+81
\| \| \| \| \|	- return insert/update row counts - implement ON CONFLICT ... DO UPDATE on some tables
*	be more parsimonious with GROBID metadata	Bryan Newbold	2020-01-02	2	-3/+20
\| \| \| \| \|	Because these are getting persisted in database (as well as kafka), don't write out empty keys.
*	start work on DB connector and minio client	Bryan Newbold	2020-01-02	2	-0/+200
\|
*	have JsonLinePusher continue on JSON decode errors (but count)	Bryan Newbold	2020-01-02	1	-1/+5
\|
*	start work on persist workers and tool	Bryan Newbold	2020-01-02	3	-5/+336
\|
*	update TODO	Bryan Newbold	2019-12-26	1	-1/+7
\|
*	basic arabesque2ingestrequest script	Bryan Newbold	2019-12-24	1	-0/+69
\|
*	commit grobid_tool transform mode	Bryan Newbold	2019-12-22	1	-0/+27
\| \| \| \|	Had some stale code on aitio with this change I forgot to commit. Oops!
*	refactor: use print(..., file=sys.stderr)	Bryan Newbold	2019-12-18	5	-32/+34
\| \| \| \|	Should use logging soon, but this seems more idiomatic in the meanwhile.
*	refactor: sort keys in JSON output	Bryan Newbold	2019-12-18	4	-6/+7
\| \| \| \|	This makes debugging by tailing Kafka topics a lot more readable
*	refactor: improve argparse usage	Bryan Newbold	2019-12-18	5	-13/+27
\| \| \| \| \|	use ArgumentDefaultsHelpFormatter and add help messages to all sub-commands
*	update ingest proposal source/link naming	Bryan Newbold	2019-12-13	1	-1/+1
\|
*	fixes for large GROBID result skip	Bryan Newbold	2019-12-02	1	-2/+2
\|
*	count empty blobs as 'failed' instead of crashing	Bryan Newbold	2019-12-01	1	-1/+2
\| \| \| \|	Might be better to record an artificial kafka response instead?
*	cleanup unused import	Bryan Newbold	2019-12-01	1	-1/+0
\|
*	filter out very large GROBID XML bodies	Bryan Newbold	2019-12-01	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
*	CI: make some jobs manual	Bryan Newbold	2019-11-15	1	-0/+2
\| \| \| \| \|	Scalding test is broken :( But we aren't even using that code much these days.
*	handle wayback fetch redirect loop in ingest code	Bryan Newbold	2019-11-14	1	-2/+5
\|
*	bump kafka max poll interval for consumers	Bryan Newbold	2019-11-14	1	-2/+2
\| \| \| \| \|	The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
*	handle WaybackError during ingest	Bryan Newbold	2019-11-14	1	-0/+4
\|
*	handle SPNv1 redirect loop	Bryan Newbold	2019-11-14	1	-0/+2
\|
*	handle SPNv2 polling timeout	Bryan Newbold	2019-11-14	1	-6/+10
\|
*	update ingest-file batch size to 1	Bryan Newbold	2019-11-14	2	-4/+4
\| \| \| \| \| \| \| \|	Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
*	start of hrmars.com ingest support	Bryan Newbold	2019-11-14	2	-2/+7
\|
*	treat failure to get terminal capture as a SavePageNowError	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	citation_pdf_url with host-relative URLs	Bryan Newbold	2019-11-13	1	-1/+3
\|
*	status_forcelist is on session, not request	Bryan Newbold	2019-11-13	1	-2/+2
\|
*	handle SPNv1 remote server HTTP status codes better	Bryan Newbold	2019-11-13	1	-8/+15
\|
*	grobid2json: make lang detection flexible	Bryan Newbold	2019-11-13	1	-1/+2
\|
*	handle requests (http) redirect loop from wayback	Bryan Newbold	2019-11-13	1	-1/+4
\|
*	handle wayback client return status correctly	Bryan Newbold	2019-11-13	1	-2/+2
\|
*	allow way more errors in SPN path	Bryan Newbold	2019-11-13	1	-2/+11
\|
*	clean up redirect-following CDX API path	Bryan Newbold	2019-11-13	1	-8/+15
\|
*	fix lint errors	Bryan Newbold	2019-11-13	2	-6/+11
\|
*	improve ingest worker remote failure behavior	Bryan Newbold	2019-11-13	1	-5/+12
\|
*	have SPN client differentiate between SPN and remote errors	Bryan Newbold	2019-11-13	2	-3/+11
\| \| \| \| \| \| \| \|	This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything.
*	correct ingest-file consumer group	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	add basic sandcrawler worker (kafka)	Bryan Newbold	2019-11-13	1	-0/+74
\|
*	note that kafka_grobid.py is deprecated	Bryan Newbold	2019-11-13	1	-0/+3
\|
*	rename FileIngestWorker	Bryan Newbold	2019-11-13	3	-10/+16
\|
*	refactor consume_topic name out of make_kafka_consumer()	Bryan Newbold	2019-11-13	1	-5/+5
\| \| \| \|	Best to do this in wrapping code for full flexibility.
*	more progress on file ingest	Bryan Newbold	2019-11-13	4	-17/+75
\|
*	much progress on file ingest path	Bryan Newbold	2019-10-22	6	-335/+338
\|
*	remove spurious debug print from grobid2json	Bryan Newbold	2019-10-22	1	-1/+1
\|
*	we do actually want consolidateHeader=2, not 1	Bryan Newbold	2019-10-04	2	-4/+4
\|
*	remove any trailing newline	Bryan Newbold	2019-10-04	1	-2/+2
\|
*	grobid: consolidateHeaders typo	Bryan Newbold	2019-10-04	1	-1/+1
\|
*	grobid_tool: don't wrap multiprocess if we don't need to	Bryan Newbold	2019-10-04	1	-2/+4
\|
*	disable citation consolidation by default	Bryan Newbold	2019-10-04	1	-1/+1
\| \| \| \| \| \| \|	with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.