aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/misc.py
Commit message (Expand)AuthorAgeFilesLines
* cdx datetime parsing improvementsBryan Newbold2020-10-301-0/+11
* misc: type annotations, fix parse_cdx_datetimeBryan Newbold2020-10-291-14/+18
* ingest: clean_url() in more placesBryan Newbold2020-03-231-0/+1
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-0/+7
* more mime normalizationBryan Newbold2020-02-271-1/+18
* much progress on file ingest pathBryan Newbold2019-10-221-0/+24
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-5/+11
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-0/+84
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+43