aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/misc.py
Commit message (Expand)AuthorAgeFilesLines
* local-file version of gen_file_metadataBryan Newbold2021-10-151-1/+42
* move fuzzy URL match method to miscBryan Newbold2020-11-081-0/+17
* html: try to detect and mark XHTML (vs. HTML or XML)Bryan Newbold2020-11-081-2/+4
* gen_file_metadata: allow empty/null bodies (if flag set)Bryan Newbold2020-11-081-2/+4
* gen_file_metadata: detect JATS XML and use application/jats+xmlBryan Newbold2020-11-031-0/+4
* cdx datetime parsing improvementsBryan Newbold2020-10-301-0/+11
* misc: type annotations, fix parse_cdx_datetimeBryan Newbold2020-10-291-14/+18
* ingest: clean_url() in more placesBryan Newbold2020-03-231-0/+1
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-0/+7
* more mime normalizationBryan Newbold2020-02-271-1/+18
* much progress on file ingest pathBryan Newbold2019-10-221-0/+24
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-5/+11
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-0/+84
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+43