sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)	Bryan Newbold	2022-05-16	1	-0/+9
\|
*	grobid: set a maximum file size (256 MByte)	Bryan Newbold	2021-12-07	1	-0/+8
\|
*	make fmt	Bryan Newbold	2021-11-16	1	-1/+1
\|
*	grobid: handle XML parsing errors, and have them recorded in sandcrawler-db	Bryan Newbold	2021-11-12	1	-1/+5
\|
*	grobid: extract more metadata in document TEI-XML	Bryan Newbold	2021-11-10	1	-0/+5
\|
*	grobid: update 'TODO' comment based on review	Bryan Newbold	2021-11-04	1	-3/+0
\|
*	crossref grobid refs: another error case (ReadTimeout)	Bryan Newbold	2021-11-04	1	-4/+6
\| \| \| \| \|	With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
*	grobid: use requests session	Bryan Newbold	2021-11-04	1	-3/+4
\| \| \| \| \| \|	This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
*	grobid crossref refs: try to handle HTTP 5xx and XML parse errors	Bryan Newbold	2021-11-04	1	-4/+24
\|
*	grobid: handle weird whitespace unstructured from crossref	Bryan Newbold	2021-11-04	1	-1/+10
\| \| \| \|	See also: https://github.com/kermitt2/grobid/issues/849
*	iterated GROBID citation cleaning and processing	Bryan Newbold	2021-11-04	1	-27/+45
\| \| \| \|	Switched to using just 'key'/'id' for downstream matching.
*	grobid citations: first pass at cleaning unstructured	Bryan Newbold	2021-11-04	1	-2/+34
\|
*	initial crossref-refs via GROBID helper routine	Bryan Newbold	2021-11-04	1	-4/+121
\|
*	remove grobid2json helper file, replace with grobid_tei_xml	Bryan Newbold	2021-10-27	1	-3/+4
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	1	-50/+55
\|
*	fix type annotations for petabox body fetch helper	Bryan Newbold	2021-10-26	1	-1/+2
\|
*	more progress on type annotations	Bryan Newbold	2021-10-26	1	-1/+3
\|
*	grobid: fix a bug with consolidate_mode header, exposed by type annotations	Bryan Newbold	2021-10-26	1	-1/+2
\|
*	grobid: type annotations	Bryan Newbold	2021-10-26	1	-9/+19
\|
*	start handling trivial lint cleanups: unused imports, 'is None', etc	Bryan Newbold	2021-10-26	1	-3/+1
\|
*	make fmt	Bryan Newbold	2021-10-26	1	-13/+17
\|
*	python: isort all imports	Bryan Newbold	2021-10-26	1	-1/+3
\|
*	grobid: disable biblio-glutton consolidation	Bryan Newbold	2021-04-07	1	-3/+3
\|
*	differential wayback-error from wayback-content-error	Bryan Newbold	2020-10-21	1	-1/+0
\| \| \| \| \| \|	The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
*	workers: refactor to pass key to process()	Bryan Newbold	2020-06-17	1	-2/+2
\|
*	refactor worker fetch code into wrapper class	Bryan Newbold	2020-06-16	1	-60/+9
\|
*	timeout message implementation for GROBID and ingest workers	Bryan Newbold	2020-04-27	1	-0/+9
\|
*	grobid petabox: fix fetch body/content	Bryan Newbold	2020-02-03	1	-1/+1
\|
*	grobid worker: catch PetaboxError also	Bryan Newbold	2020-01-28	1	-2/+2
\|
*	grobid worker: always set a key in response	Bryan Newbold	2020-01-28	1	-4/+25
\| \| \| \| \| \| \| \| \|	We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
*	grobid: fix error_msg typo; set status_code for timeouts	Bryan Newbold	2020-01-21	1	-1/+2
\|
*	add 200 second timeout to GROBID requests	Bryan Newbold	2020-01-17	1	-8/+15
\|
*	grobid worker fixes for newer ia lib refactors	Bryan Newbold	2020-01-14	1	-3/+9
\|
*	fix grobid tests for new wayback refactors	Bryan Newbold	2020-01-09	1	-3/+3
\|
*	be more parsimonious with GROBID metadata	Bryan Newbold	2020-01-02	1	-2/+4
\| \| \| \| \|	Because these are getting persisted in database (as well as kafka), don't write out empty keys.
*	fixes for large GROBID result skip	Bryan Newbold	2019-12-02	1	-2/+2
\|
*	count empty blobs as 'failed' instead of crashing	Bryan Newbold	2019-12-01	1	-1/+2
\| \| \| \|	Might be better to record an artificial kafka response instead?
*	cleanup unused import	Bryan Newbold	2019-12-01	1	-1/+0
\|
*	filter out very large GROBID XML bodies	Bryan Newbold	2019-12-01	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
*	much progress on file ingest path	Bryan Newbold	2019-10-22	1	-0/+14
\|
*	we do actually want consolidateHeader=2, not 1	Bryan Newbold	2019-10-04	1	-3/+3
\|
*	grobid: consolidateHeaders typo	Bryan Newbold	2019-10-04	1	-1/+1
\|
*	disable citation consolidation by default	Bryan Newbold	2019-10-04	1	-1/+1
\| \| \| \| \| \| \|	with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.
*	fix GROBID POST flags	Bryan Newbold	2019-10-04	1	-1/+3
\|
*	handle GROBID fetch empty blob condition	Bryan Newbold	2019-10-03	1	-1/+2
\|
*	have grobidworker error status indicate issues instead of bailing	Bryan Newbold	2019-10-02	1	-4/+13
\|
*	more counts and bugfixes in grobid_tool	Bryan Newbold	2019-09-26	1	-4/+0
\|
*	small improvements to GROBID tool	Bryan Newbold	2019-09-26	1	-0/+4
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-3/+63
\|
*	start refactoring sandcrawler python common code	Bryan Newbold	2019-09-23	1	-0/+44