sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	more progress on type annotations and linting	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	start handling trivial lint cleanups: unused imports, 'is None', etc	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	make fmt	Bryan Newbold	2021-10-26	1	-8/+14
\|
*	python: isort all imports	Bryan Newbold	2021-10-26	1	-1/+2
\|
*	local-file version of gen_file_metadata	Bryan Newbold	2021-10-15	1	-1/+13
\|
*	url cleaning (canonicalization) for ingest base_url	Bryan Newbold	2020-03-10	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \|	As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-3/+3
\|
*	re-write parse_cdx_line for sandcrawler lib	Bryan Newbold	2019-09-25	1	-1/+31
\|
*	start refactoring sandcrawler python common code	Bryan Newbold	2019-09-23	1	-0/+41