sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	grobid persist: if status_code is not set, default to 0bnewbold-persist-grobid-errors	Bryan Newbold	2020-01-28	3	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have to set something currently because of a NOT NULL constraint on the table. Originally I thought we would just not record rows if there was an error, and that is still sort of a valid stance. However, when doing bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which cause wayback or petabox errors. We should fix bugs or delete these rows as a cleanup, but until that happens we should record the error state so we don't loop forever. One danger of this commit is that we can clobber existing good rows with new errors rapidly if there is wayback downtime or something like that.
*	sql stats: typo fix	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	sql howto: database dumps	Bryan Newbold	2020-01-28	1	-0/+7
\|
*	workers: yes, poll is necessary	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	grobid worker: always set a key in response	Bryan Newbold	2020-01-28	1	-4/+25
\| \| \| \| \| \| \| \| \|	We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
*	fix kafka worker partition-specific error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix WaybackError exception formating	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix elif syntax error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	block springer page-one domain	Bryan Newbold	2020-01-28	1	-0/+3
\|
*	clarify petabox fetch behavior	Bryan Newbold	2020-01-28	1	-3/+6
\|
*	re-enable figshare and zenodo crawling	Bryan Newbold	2020-01-21	1	-8/+0
\| \| \| \|	For daily imports
*	persist grobid: actually, status_code is required	Bryan Newbold	2020-01-21	2	-3/+10
\| \| \| \| \| \| \|	Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
*	ingest: check for null-body before file_meta	Bryan Newbold	2020-01-21	1	-0/+3
\| \| \| \| \|	gen_file_metadata raises an assert error if body is None (or false-y in general)
*	wayback: replay redirects have X-Archive-Redirect-Reason	Bryan Newbold	2020-01-21	1	-2/+4
\|
*	persist: work around GROBID timeouts with no status_code	Bryan Newbold	2020-01-21	2	-3/+3
\|
*	grobid: fix error_msg typo; set status_code for timeouts	Bryan Newbold	2020-01-21	1	-1/+2
\|
*	add 200 second timeout to GROBID requests	Bryan Newbold	2020-01-17	1	-8/+15
\|
*	add SKIP log line for skip-url-blocklist path	Bryan Newbold	2020-01-17	1	-0/+1
\|
*	ingest: add URL blocklist feature	Bryan Newbold	2020-01-17	2	-4/+49
\| \| \| \|	And, temporarily, block zenodo and figshare.
*	handle UnicodeDecodeError in the other GET instance	Bryan Newbold	2020-01-15	1	-0/+2
\|
*	increase SPNv2 polling timeout to 4 minutes	Bryan Newbold	2020-01-15	1	-1/+3
\|
*	make failed replay fetch an error, not assert error	Bryan Newbold	2020-01-15	1	-1/+2
\|
*	kafka config: actually we do want large bulk ingest request topic	Bryan Newbold	2020-01-15	1	-1/+1
\|
*	improve sentry reporting with 'release' git hash	Bryan Newbold	2020-01-15	2	-2/+5
\|
*	wayback replay: catch UnicodeDecodeError	Bryan Newbold	2020-01-15	1	-0/+2
\| \| \| \| \| \| \| \|	In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests.
*	persist: fix dupe field copying	Bryan Newbold	2020-01-15	1	-1/+8
\| \| \| \| \| \|	In testing hit: AttributeError: 'str' object has no attribute 'get'
*	persist worker: implement updated ingest result semantics	Bryan Newbold	2020-01-15	2	-12/+17
\|
*	clarify ingest result schema and semantics	Bryan Newbold	2020-01-15	5	-30/+82
\|
*	pass through revisit_cdx	Bryan Newbold	2020-01-15	2	-5/+21
\|
*	fix revisit resolution	Bryan Newbold	2020-01-15	1	-4/+12
\| \| \| \| \|	Returns the original CDX record, but keeps the terminal_url and terminal_sha1hex info.
*	database stats	Bryan Newbold	2020-01-14	2	-0/+289
\|
*	add new bulk ingest request topic	Bryan Newbold	2020-01-14	1	-1/+6
\|
*	add postgrest checks to test mocks	Bryan Newbold	2020-01-14	1	-1/+9
\|
*	tests: don't use localhost as a responses mock host	Bryan Newbold	2020-01-14	2	-6/+6
\|
*	bulk ingest file request topic support	Bryan Newbold	2020-01-14	1	-1/+7
\|
*	ingest: sketch out more of how 'existing' path would work	Bryan Newbold	2020-01-14	1	-8/+22
\|
*	ingest: check existing GROBID; also push results to sink	Bryan Newbold	2020-01-14	1	-4/+22
\|
*	ingest persist skips 'existing' ingest results	Bryan Newbold	2020-01-14	1	-0/+3
\|
*	grobid-to-kafka support in ingest worker	Bryan Newbold	2020-01-14	1	-0/+6
\|
*	grobid worker fixes for newer ia lib refactors	Bryan Newbold	2020-01-14	1	-3/+9
\|
*	small fixups to SandcrawlerPostgrestClient	Bryan Newbold	2020-01-14	2	-1/+11
\|
*	filter out archive.org and web.archive.org (until implemented)	Bryan Newbold	2020-01-14	1	-1/+12
\|
*	clarify pmc/pmcid pairing	Bryan Newbold	2020-01-14	1	-3/+3
\|
*	SPNv2 doesn't support FTP; add a live test for non-revist FTP	Bryan Newbold	2020-01-14	2	-0/+26
\|
*	more ftp status 226 support	Bryan Newbold	2020-01-14	5	-9/+23
\|
*	add live tests for ftp, revisits	Bryan Newbold	2020-01-14	1	-1/+36
\|
*	basic FTP ingest support; revist record resolution	Bryan Newbold	2020-01-14	2	-35/+78
\| \| \| \| \| \| \|	- supporting revisits means more wayback hits (fewer crawls) => faster - ... but this is only partial support. will also need to work through sandcrawler db schema, etc. current status should be safe to merge/use. - ftp support via treating an ftp hit as a 200
*	arabesque2ingestrequest: ingest type flag	Bryan Newbold	2020-01-14	1	-1/+4
\|
*	better print() output	Bryan Newbold	2020-01-10	2	-4/+4
\|
*	fix trivial typo	Bryan Newbold	2020-01-10	1	-1/+1
\|