aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests/test_misc.py
Commit message (Collapse)AuthorAgeFilesLines
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+7
| | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-3/+3
|
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-1/+31
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+41