sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	fixes for large GROBID result skip	Bryan Newbold	2019-12-02	1	-2/+2
\|
*	count empty blobs as 'failed' instead of crashing	Bryan Newbold	2019-12-01	1	-1/+2
\| \| \| \|	Might be better to record an artificial kafka response instead?
*	cleanup unused import	Bryan Newbold	2019-12-01	1	-1/+0
\|
*	filter out very large GROBID XML bodies	Bryan Newbold	2019-12-01	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
*	CI: make some jobs manual	Bryan Newbold	2019-11-15	1	-0/+2
\| \| \| \| \|	Scalding test is broken :( But we aren't even using that code much these days.
*	handle wayback fetch redirect loop in ingest code	Bryan Newbold	2019-11-14	1	-2/+5
\|
*	bump kafka max poll interval for consumers	Bryan Newbold	2019-11-14	1	-2/+2
\| \| \| \| \|	The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
*	handle WaybackError during ingest	Bryan Newbold	2019-11-14	1	-0/+4
\|
*	handle SPNv1 redirect loop	Bryan Newbold	2019-11-14	1	-0/+2
\|
*	handle SPNv2 polling timeout	Bryan Newbold	2019-11-14	1	-6/+10
\|
*	update ingest-file batch size to 1	Bryan Newbold	2019-11-14	2	-4/+4
\| \| \| \| \| \| \| \|	Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
*	start of hrmars.com ingest support	Bryan Newbold	2019-11-14	2	-2/+7
\|
*	treat failure to get terminal capture as a SavePageNowError	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	citation_pdf_url with host-relative URLs	Bryan Newbold	2019-11-13	1	-1/+3
\|
*	status_forcelist is on session, not request	Bryan Newbold	2019-11-13	1	-2/+2
\|
*	handle SPNv1 remote server HTTP status codes better	Bryan Newbold	2019-11-13	1	-8/+15
\|
*	grobid2json: make lang detection flexible	Bryan Newbold	2019-11-13	1	-1/+2
\|
*	handle requests (http) redirect loop from wayback	Bryan Newbold	2019-11-13	1	-1/+4
\|
*	handle wayback client return status correctly	Bryan Newbold	2019-11-13	1	-2/+2
\|
*	allow way more errors in SPN path	Bryan Newbold	2019-11-13	1	-2/+11
\|
*	clean up redirect-following CDX API path	Bryan Newbold	2019-11-13	1	-8/+15
\|
*	fix lint errors	Bryan Newbold	2019-11-13	2	-6/+11
\|
*	improve ingest worker remote failure behavior	Bryan Newbold	2019-11-13	1	-5/+12
\|
*	have SPN client differentiate between SPN and remote errors	Bryan Newbold	2019-11-13	2	-3/+11
\| \| \| \| \| \| \| \|	This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything.
*	correct ingest-file consumer group	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	add basic sandcrawler worker (kafka)	Bryan Newbold	2019-11-13	1	-0/+74
\|
*	note that kafka_grobid.py is deprecated	Bryan Newbold	2019-11-13	1	-0/+3
\|
*	rename FileIngestWorker	Bryan Newbold	2019-11-13	3	-10/+16
\|
*	refactor consume_topic name out of make_kafka_consumer()	Bryan Newbold	2019-11-13	1	-5/+5
\| \| \| \|	Best to do this in wrapping code for full flexibility.
*	more progress on file ingest	Bryan Newbold	2019-11-13	4	-17/+75
\|
*	much progress on file ingest path	Bryan Newbold	2019-10-22	6	-335/+338
\|
*	remove spurious debug print from grobid2json	Bryan Newbold	2019-10-22	1	-1/+1
\|
*	we do actually want consolidateHeader=2, not 1	Bryan Newbold	2019-10-04	2	-4/+4
\|
*	remove any trailing newline	Bryan Newbold	2019-10-04	1	-2/+2
\|
*	grobid: consolidateHeaders typo	Bryan Newbold	2019-10-04	1	-1/+1
\|
*	grobid_tool: don't wrap multiprocess if we don't need to	Bryan Newbold	2019-10-04	1	-2/+4
\|
*	disable citation consolidation by default	Bryan Newbold	2019-10-04	1	-1/+1
\| \| \| \| \| \| \|	with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.
*	grobid-output-pg, not grobid-output-json	Bryan Newbold	2019-10-04	1	-4/+2
\|
*	grobid_tool: don't always insert multi wrapper	Bryan Newbold	2019-10-04	1	-6/+13
\|
*	grobid2json: language_code	Bryan Newbold	2019-10-04	2	-1/+7
\|
*	fix GROBID POST flags	Bryan Newbold	2019-10-04	1	-1/+3
\|
*	workers: better generic batch-size arg handling	Bryan Newbold	2019-10-03	1	-0/+6
\|
*	handle GROBID fetch empty blob condition	Bryan Newbold	2019-10-03	1	-1/+2
\|
*	grobid_affiliations fix from prod, and usage example	Bryan Newbold	2019-10-02	1	-0/+5
\|
*	deliver_dumpgrobid_to_s3: typo fix from old prod	Bryan Newbold	2019-10-02	1	-3/+4
\|
*	grobid affiliation extractor (script)	Bryan Newbold	2019-10-02	1	-0/+47
\|
*	python tests for pusher classes	Bryan Newbold	2019-10-02	2	-0/+28
\|
*	have grobidworker error status indicate issues instead of bailing	Bryan Newbold	2019-10-02	1	-4/+13
\|
*	grobid_tool.py example usage in docstring	Bryan Newbold	2019-10-02	1	-0/+6
\|
*	add tests for affiliation extraction	Bryan Newbold	2019-10-02	2	-1/+25
\|