sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	local-file version of gen_file_metadata	Bryan Newbold	2021-10-15	1	-1/+1
\|
*	refactoring; progress on filesets	Bryan Newbold	2021-10-15	1	-1/+2
\|
*	differential wayback-error from wayback-content-error	Bryan Newbold	2020-10-21	1	-1/+1
\| \| \| \| \| \|	The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
*	lint fixes	Bryan Newbold	2020-06-17	1	-1/+1
\|
*	add new pdf workers/persisters	Bryan Newbold	2020-06-17	1	-2/+2
\|
*	initial work on PDF extraction worker	Bryan Newbold	2020-06-16	1	-1/+1
\| \| \| \| \|	This worker fetches full PDFs, then extracts thumbnails, raw text, and PDF metadata. Similar to GROBID worker.
*	rename KafkaGrobidSink -> KafkaCompressSink	Bryan Newbold	2020-06-16	1	-1/+1
\|
*	url cleaning (canonicalization) for ingest base_url	Bryan Newbold	2020-03-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
*	persist: ingest_request tool (with no ingest_file_result)	Bryan Newbold	2020-03-05	1	-1/+1
\|
*	pdftrio basic python code	Bryan Newbold	2020-02-12	1	-1/+2
\| \| \| \|	This is basically just a copy/paste of GROBID code, only simpler!
*	small fixups to SandcrawlerPostgrestClient	Bryan Newbold	2020-01-14	1	-0/+1
\|
*	more wayback and SPN tests and fixes	Bryan Newbold	2020-01-09	1	-1/+1
\|
*	fix sandcrawler persist workers	Bryan Newbold	2020-01-02	1	-0/+1
\|
*	have SPN client differentiate between SPN and remote errors	Bryan Newbold	2019-11-13	1	-1/+1
\| \| \| \| \| \| \| \|	This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything.
*	rename FileIngestWorker	Bryan Newbold	2019-11-13	1	-0/+1
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-1/+4
\|
*	re-write parse_cdx_line for sandcrawler lib	Bryan Newbold	2019-09-25	1	-1/+1
\|
*	start refactoring sandcrawler python common code	Bryan Newbold	2019-09-23	1	-0/+3