sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ingest: bulk workers don't hit SPNv2	Bryan Newbold	2020-02-13	1	-0/+2
\|
*	pdftrio fixes from testing	Bryan Newbold	2020-02-13	1	-3/+9
\|
*	move pdf_trio results back under key in JSON/Kafka	Bryan Newbold	2020-02-13	2	-7/+31
\|
*	pdftrio: small fixes from testing	Bryan Newbold	2020-02-12	1	-2/+2
\|
*	pdftrio basic python code	Bryan Newbold	2020-02-12	7	-1/+393
\| \| \| \|	This is basically just a copy/paste of GROBID code, only simpler!
*	add ingestrequest_row2json.py	Bryan Newbold	2020-02-05	1	-0/+48
\|
*	fix persist bug where ingest_request_source not saved	Bryan Newbold	2020-02-05	1	-0/+1
\|
*	fix bug where ingest_request extra fields not persisted	Bryan Newbold	2020-02-05	1	-1/+2
\|
*	handle alternative dt format in WARC headers	Bryan Newbold	2020-02-05	1	-2/+4
\| \| \| \| \|	If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one.
*	decrease SPNv2 polling timeout to 3 minutes	Bryan Newbold	2020-02-05	1	-2/+2
\|
*	improvements to reliability from prod testing	Bryan Newbold	2020-02-03	2	-7/+20
\|
*	hack-y backoff ingest attempt	Bryan Newbold	2020-02-03	2	-3/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
*	grobid petabox: fix fetch body/content	Bryan Newbold	2020-02-03	1	-1/+1
\|
*	wayback: try to resolve HTTPException due to many HTTP headers	Bryan Newbold	2020-02-02	1	-1/+9
\| \| \| \| \| \| \| \| \|	This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on
*	sandcrawler_worker: ingest worker distinct consumer groups	Bryan Newbold	2020-01-29	1	-1/+3
\| \| \| \| \| \|	I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format.
*	grobid worker: catch PetaboxError also	Bryan Newbold	2020-01-28	1	-2/+2
\|
*	worker kafka setting tweaks	Bryan Newbold	2020-01-28	1	-2/+4
\| \| \| \|	These are all attempts to get kafka workers operating more smoothly.
*	make grobid-extract worker batch size 1	Bryan Newbold	2020-01-28	1	-0/+1
\| \| \| \| \|	This is part of attempts to fix Kafka errors that look like they might be timeouts.
*	workers: yes, poll is necessary	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	grobid worker: always set a key in response	Bryan Newbold	2020-01-28	1	-4/+25
\| \| \| \| \| \| \| \| \|	We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
*	fix kafka worker partition-specific error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix WaybackError exception formating	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix elif syntax error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	block springer page-one domain	Bryan Newbold	2020-01-28	1	-0/+3
\|
*	clarify petabox fetch behavior	Bryan Newbold	2020-01-28	1	-3/+6
\|
*	re-enable figshare and zenodo crawling	Bryan Newbold	2020-01-21	1	-8/+0
\| \| \| \|	For daily imports
*	persist grobid: actually, status_code is required	Bryan Newbold	2020-01-21	2	-3/+10
\| \| \| \| \| \| \|	Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
*	ingest: check for null-body before file_meta	Bryan Newbold	2020-01-21	1	-0/+3
\| \| \| \| \|	gen_file_metadata raises an assert error if body is None (or false-y in general)
*	wayback: replay redirects have X-Archive-Redirect-Reason	Bryan Newbold	2020-01-21	1	-2/+4
\|
*	persist: work around GROBID timeouts with no status_code	Bryan Newbold	2020-01-21	2	-3/+3
\|
*	grobid: fix error_msg typo; set status_code for timeouts	Bryan Newbold	2020-01-21	1	-1/+2
\|
*	add 200 second timeout to GROBID requests	Bryan Newbold	2020-01-17	1	-8/+15
\|
*	add SKIP log line for skip-url-blocklist path	Bryan Newbold	2020-01-17	1	-0/+1
\|
*	ingest: add URL blocklist feature	Bryan Newbold	2020-01-17	2	-4/+49
\| \| \| \|	And, temporarily, block zenodo and figshare.
*	handle UnicodeDecodeError in the other GET instance	Bryan Newbold	2020-01-15	1	-0/+2
\|
*	increase SPNv2 polling timeout to 4 minutes	Bryan Newbold	2020-01-15	1	-1/+3
\|
*	make failed replay fetch an error, not assert error	Bryan Newbold	2020-01-15	1	-1/+2
\|
*	improve sentry reporting with 'release' git hash	Bryan Newbold	2020-01-15	2	-2/+5
\|
*	wayback replay: catch UnicodeDecodeError	Bryan Newbold	2020-01-15	1	-0/+2
\| \| \| \| \| \| \| \|	In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests.
*	persist: fix dupe field copying	Bryan Newbold	2020-01-15	1	-1/+8
\| \| \| \| \| \|	In testing hit: AttributeError: 'str' object has no attribute 'get'
*	persist worker: implement updated ingest result semantics	Bryan Newbold	2020-01-15	2	-12/+17
\|
*	clarify ingest result schema and semantics	Bryan Newbold	2020-01-15	3	-7/+32
\|
*	pass through revisit_cdx	Bryan Newbold	2020-01-15	2	-5/+21
\|
*	fix revisit resolution	Bryan Newbold	2020-01-15	1	-4/+12
\| \| \| \| \|	Returns the original CDX record, but keeps the terminal_url and terminal_sha1hex info.
*	add postgrest checks to test mocks	Bryan Newbold	2020-01-14	1	-1/+9
\|
*	tests: don't use localhost as a responses mock host	Bryan Newbold	2020-01-14	2	-6/+6
\|
*	bulk ingest file request topic support	Bryan Newbold	2020-01-14	1	-1/+7
\|
*	ingest: sketch out more of how 'existing' path would work	Bryan Newbold	2020-01-14	1	-8/+22
\|
*	ingest: check existing GROBID; also push results to sink	Bryan Newbold	2020-01-14	1	-4/+22
\|
*	ingest persist skips 'existing' ingest results	Bryan Newbold	2020-01-14	1	-0/+3
\|