sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	more random sandcrawler-db queries	Bryan Newbold	2020-02-03	2	-32/+62
\|
*	grobid petabox: fix fetch body/content	Bryan Newbold	2020-02-03	1	-1/+1
\|
*	more SQL commands	Bryan Newbold	2020-02-02	1	-0/+15
\|
*	wayback: try to resolve HTTPException due to many HTTP headers	Bryan Newbold	2020-02-02	1	-1/+9
\| \| \| \| \| \| \| \| \|	This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on
*	sandcrawler_worker: ingest worker distinct consumer groups	Bryan Newbold	2020-01-29	1	-1/+3
\| \| \| \| \| \|	I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format.
*	2020q1 fulltext ingest plans	Bryan Newbold	2020-01-29	1	-0/+272
\|
*	grobid worker: catch PetaboxError also	Bryan Newbold	2020-01-28	1	-2/+2
\|
*	worker kafka setting tweaks	Bryan Newbold	2020-01-28	1	-2/+4
\| \| \| \|	These are all attempts to get kafka workers operating more smoothly.
*	make grobid-extract worker batch size 1	Bryan Newbold	2020-01-28	1	-0/+1
\| \| \| \| \|	This is part of attempts to fix Kafka errors that look like they might be timeouts.
*	sql stats: typo fix	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	sql howto: database dumps	Bryan Newbold	2020-01-28	1	-0/+7
\|
*	workers: yes, poll is necessary	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	grobid worker: always set a key in response	Bryan Newbold	2020-01-28	1	-4/+25
\| \| \| \| \| \| \| \| \|	We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
*	fix kafka worker partition-specific error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix WaybackError exception formating	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix elif syntax error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	block springer page-one domain	Bryan Newbold	2020-01-28	1	-0/+3
\|
*	clarify petabox fetch behavior	Bryan Newbold	2020-01-28	1	-3/+6
\|
*	re-enable figshare and zenodo crawling	Bryan Newbold	2020-01-21	1	-8/+0
\| \| \| \|	For daily imports
*	persist grobid: actually, status_code is required	Bryan Newbold	2020-01-21	2	-3/+10
\| \| \| \| \| \| \|	Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
*	ingest: check for null-body before file_meta	Bryan Newbold	2020-01-21	1	-0/+3
\| \| \| \| \|	gen_file_metadata raises an assert error if body is None (or false-y in general)
*	wayback: replay redirects have X-Archive-Redirect-Reason	Bryan Newbold	2020-01-21	1	-2/+4
\|
*	persist: work around GROBID timeouts with no status_code	Bryan Newbold	2020-01-21	2	-3/+3
\|
*	grobid: fix error_msg typo; set status_code for timeouts	Bryan Newbold	2020-01-21	1	-1/+2
\|
*	add 200 second timeout to GROBID requests	Bryan Newbold	2020-01-17	1	-8/+15
\|
*	add SKIP log line for skip-url-blocklist path	Bryan Newbold	2020-01-17	1	-0/+1
\|
*	ingest: add URL blocklist feature	Bryan Newbold	2020-01-17	2	-4/+49
\| \| \| \|	And, temporarily, block zenodo and figshare.
*	handle UnicodeDecodeError in the other GET instance	Bryan Newbold	2020-01-15	1	-0/+2
\|
*	increase SPNv2 polling timeout to 4 minutes	Bryan Newbold	2020-01-15	1	-1/+3
\|
*	make failed replay fetch an error, not assert error	Bryan Newbold	2020-01-15	1	-1/+2
\|
*	kafka config: actually we do want large bulk ingest request topic	Bryan Newbold	2020-01-15	1	-1/+1
\|
*	improve sentry reporting with 'release' git hash	Bryan Newbold	2020-01-15	2	-2/+5
\|
*	wayback replay: catch UnicodeDecodeError	Bryan Newbold	2020-01-15	1	-0/+2
\| \| \| \| \| \| \| \|	In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests.
*	persist: fix dupe field copying	Bryan Newbold	2020-01-15	1	-1/+8
\| \| \| \| \| \|	In testing hit: AttributeError: 'str' object has no attribute 'get'
*	persist worker: implement updated ingest result semantics	Bryan Newbold	2020-01-15	2	-12/+17
\|
*	clarify ingest result schema and semantics	Bryan Newbold	2020-01-15	5	-30/+82
\|
*	pass through revisit_cdx	Bryan Newbold	2020-01-15	2	-5/+21
\|
*	fix revisit resolution	Bryan Newbold	2020-01-15	1	-4/+12
\| \| \| \| \|	Returns the original CDX record, but keeps the terminal_url and terminal_sha1hex info.
*	database stats	Bryan Newbold	2020-01-14	2	-0/+289
\|
*	add new bulk ingest request topic	Bryan Newbold	2020-01-14	1	-1/+6
\|
*	add postgrest checks to test mocks	Bryan Newbold	2020-01-14	1	-1/+9
\|
*	tests: don't use localhost as a responses mock host	Bryan Newbold	2020-01-14	2	-6/+6
\|
*	bulk ingest file request topic support	Bryan Newbold	2020-01-14	1	-1/+7
\|
*	ingest: sketch out more of how 'existing' path would work	Bryan Newbold	2020-01-14	1	-8/+22
\|
*	ingest: check existing GROBID; also push results to sink	Bryan Newbold	2020-01-14	1	-4/+22
\|
*	ingest persist skips 'existing' ingest results	Bryan Newbold	2020-01-14	1	-0/+3
\|
*	grobid-to-kafka support in ingest worker	Bryan Newbold	2020-01-14	1	-0/+6
\|
*	grobid worker fixes for newer ia lib refactors	Bryan Newbold	2020-01-14	1	-3/+9
\|
*	small fixups to SandcrawlerPostgrestClient	Bryan Newbold	2020-01-14	2	-1/+11
\|
*	filter out archive.org and web.archive.org (until implemented)	Bryan Newbold	2020-01-14	1	-1/+12
\|