sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	many publisher-specific ingest improvements	Bryan Newbold	2020-01-10	1	-4/+96
\|
*	improve ingest robustness (for legacy requests)	Bryan Newbold	2020-01-10	1	-6/+12
\|
*	support forwarding url types other than pdf_url	Bryan Newbold	2020-01-09	1	-4/+5
\|
*	wayback: datetime mismatch as an error	Bryan Newbold	2020-01-09	1	-1/+2
\|
*	fill in more html extraction techniques	Bryan Newbold	2020-01-09	1	-7/+6
\|
*	refactor ingest to a loop, allowing multiple hops	Bryan Newbold	2020-01-09	1	-25/+48
\|
*	lots of progress on wayback refactoring	Bryan Newbold	2020-01-09	2	-50/+138
\| \| \| \| \| \|	- too much to list - canonical flags to control crawling - cdx_to_dict helper
*	location comes as a string, not list	Bryan Newbold	2020-01-09	1	-1/+1
\|
*	fix http/https issue with GlobalWayback library	Bryan Newbold	2020-01-09	1	-1/+2
\|
*	wayback fetch via replay; confirm hashes in crawl_resource()	Bryan Newbold	2020-01-09	1	-5/+40
\|
*	wrap up basic (locally testable) ingest refactor	Bryan Newbold	2020-01-09	2	-178/+219
\|
*	fix grobid tests for new wayback refactors	Bryan Newbold	2020-01-09	1	-3/+3
\|
*	more wayback and SPN tests and fixes	Bryan Newbold	2020-01-09	2	-39/+153
\|
*	refactor CdxApiClient, add tests	Bryan Newbold	2020-01-08	1	-40/+130
\| \| \| \| \| \|	- always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level
*	refactor SavePaperNowClient and add test	Bryan Newbold	2020-01-07	1	-28/+154
\| \| \| \| \| \|	- response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception
*	remove SPNv1 code paths	Bryan Newbold	2020-01-07	2	-65/+25
\|
*	handle grobid2json errors in calling code instead	Bryan Newbold	2020-01-02	1	-1/+7
\|
*	db: move duplicate row filtering into DB insert helpers	Bryan Newbold	2020-01-02	2	-15/+26
\|
*	remove unused filter in grobid worker	Bryan Newbold	2020-01-02	1	-1/+0
\|
*	fix dict typo	Bryan Newbold	2020-01-02	1	-1/+1
\|
*	improvements to grobid persist worker	Bryan Newbold	2020-01-02	1	-13/+16
\|
*	set mimetype when PUT to minio	Bryan Newbold	2020-01-02	1	-0/+4
\|
*	fix DB import counting	Bryan Newbold	2020-01-02	1	-4/+5
\|
*	fix small errors found by pylint	Bryan Newbold	2020-01-02	2	-1/+2
\|
*	fix sandcrawler persist workers	Bryan Newbold	2020-01-02	1	-0/+1
\|
*	filter ingest results to not have key conflicts within batch	Bryan Newbold	2020-01-02	1	-1/+16
\| \| \| \| \|	This handles a corner case with ON CONFLICT ... DO UPDATE where you can't do multiple such updates in the same batch transaction.
*	db: fancy insert/update separation using postgres xmax	Bryan Newbold	2020-01-02	2	-24/+45
\|
*	add PersistGrobidDiskWorker	Bryan Newbold	2020-01-02	1	-0/+33
\| \| \| \|	To help with making dumps directly from Kafka (eg, for partner delivery)
*	flush out minio helper, add to grobid persist	Bryan Newbold	2020-01-02	2	-22/+71
\|
*	implement counts properly for persist workers	Bryan Newbold	2020-01-02	1	-15/+19
\|
*	improve DB helpers	Bryan Newbold	2020-01-02	1	-26/+81
\| \| \| \| \|	- return insert/update row counts - implement ON CONFLICT ... DO UPDATE on some tables
*	be more parsimonious with GROBID metadata	Bryan Newbold	2020-01-02	1	-2/+4
\| \| \| \| \|	Because these are getting persisted in database (as well as kafka), don't write out empty keys.
*	start work on DB connector and minio client	Bryan Newbold	2020-01-02	2	-0/+200
\|
*	have JsonLinePusher continue on JSON decode errors (but count)	Bryan Newbold	2020-01-02	1	-1/+5
\|
*	start work on persist workers and tool	Bryan Newbold	2020-01-02	1	-0/+223
\|
*	refactor: use print(..., file=sys.stderr)	Bryan Newbold	2019-12-18	3	-25/+27
\| \| \| \|	Should use logging soon, but this seems more idiomatic in the meanwhile.
*	fixes for large GROBID result skip	Bryan Newbold	2019-12-02	1	-2/+2
\|
*	count empty blobs as 'failed' instead of crashing	Bryan Newbold	2019-12-01	1	-1/+2
\| \| \| \|	Might be better to record an artificial kafka response instead?
*	cleanup unused import	Bryan Newbold	2019-12-01	1	-1/+0
\|
*	filter out very large GROBID XML bodies	Bryan Newbold	2019-12-01	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
*	CI: make some jobs manual	Bryan Newbold	2019-11-15	1	-0/+2
\| \| \| \| \|	Scalding test is broken :( But we aren't even using that code much these days.
*	handle wayback fetch redirect loop in ingest code	Bryan Newbold	2019-11-14	1	-2/+5
\|
*	bump kafka max poll interval for consumers	Bryan Newbold	2019-11-14	1	-2/+2
\| \| \| \| \|	The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
*	handle WaybackError during ingest	Bryan Newbold	2019-11-14	1	-0/+4
\|
*	handle SPNv1 redirect loop	Bryan Newbold	2019-11-14	1	-0/+2
\|
*	handle SPNv2 polling timeout	Bryan Newbold	2019-11-14	1	-6/+10
\|
*	update ingest-file batch size to 1	Bryan Newbold	2019-11-14	1	-3/+3
\| \| \| \| \| \| \| \|	Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
*	start of hrmars.com ingest support	Bryan Newbold	2019-11-14	2	-2/+7
\|
*	treat failure to get terminal capture as a SavePageNowError	Bryan Newbold	2019-11-13	1	-1/+1
\|
*	citation_pdf_url with host-relative URLs	Bryan Newbold	2019-11-13	1	-1/+3
\|