sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ingest: clean_url() in more places	Bryan Newbold	2020-03-23	1	-0/+1
\| \| \| \| \| \|	Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error.
*	url cleaning (canonicalization) for ingest base_url	Bryan Newbold	2020-03-10	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \|	As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
*	more mime normalization	Bryan Newbold	2020-02-27	1	-1/+18
\|
*	much progress on file ingest path	Bryan Newbold	2019-10-22	1	-0/+24
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-5/+11
\|
*	re-write parse_cdx_line for sandcrawler lib	Bryan Newbold	2019-09-25	1	-0/+84
\|
*	start refactoring sandcrawler python common code	Bryan Newbold	2019-09-23	1	-0/+43