sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	local-file version of gen_file_metadata	Bryan Newbold	2021-10-15	1	-1/+42
\|
*	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	1	-0/+17
\|
*	html: try to detect and mark XHTML (vs. HTML or XML)	Bryan Newbold	2020-11-08	1	-2/+4
\|
*	gen_file_metadata: allow empty/null bodies (if flag set)	Bryan Newbold	2020-11-08	1	-2/+4
\| \| \| \|	This is for HTML sub-resources, which can validly be empty (I think)
*	gen_file_metadata: detect JATS XML and use application/jats+xml	Bryan Newbold	2020-11-03	1	-0/+4
\|
*	cdx datetime parsing improvements	Bryan Newbold	2020-10-30	1	-0/+11
\|
*	misc: type annotations, fix parse_cdx_datetime	Bryan Newbold	2020-10-29	1	-14/+18
\|
*	ingest: clean_url() in more places	Bryan Newbold	2020-03-23	1	-0/+1
\| \| \| \| \| \|	Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error.
*	url cleaning (canonicalization) for ingest base_url	Bryan Newbold	2020-03-10	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \|	As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
*	more mime normalization	Bryan Newbold	2020-02-27	1	-1/+18
\|
*	much progress on file ingest path	Bryan Newbold	2019-10-22	1	-0/+24
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-5/+11
\|
*	re-write parse_cdx_line for sandcrawler lib	Bryan Newbold	2019-09-25	1	-0/+84
\|
*	start refactoring sandcrawler python common code	Bryan Newbold	2019-09-23	1	-0/+43