aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/misc.py
Commit message (Expand)AuthorAgeFilesLines
* mypy lint fixesBryan Newbold2023-01-041-1/+1
* shorten default HTTP backoff factorBryan Newbold2022-07-131-1/+1
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-51/+64
* lint collection membership (last lint for now)Bryan Newbold2021-10-261-2/+2
* more progress on type annotations and lintingBryan Newbold2021-10-261-8/+8
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-4/+4
* make fmtBryan Newbold2021-10-261-19/+45
* python: isort all importsBryan Newbold2021-10-261-5/+5
* improve fileset ingest integration with file ingestBryan Newbold2021-10-151-0/+15
* local-file version of gen_file_metadataBryan Newbold2021-10-151-1/+42
* move fuzzy URL match method to miscBryan Newbold2020-11-081-0/+17
* html: try to detect and mark XHTML (vs. HTML or XML)Bryan Newbold2020-11-081-2/+4
* gen_file_metadata: allow empty/null bodies (if flag set)Bryan Newbold2020-11-081-2/+4
* gen_file_metadata: detect JATS XML and use application/jats+xmlBryan Newbold2020-11-031-0/+4
* cdx datetime parsing improvementsBryan Newbold2020-10-301-0/+11
* misc: type annotations, fix parse_cdx_datetimeBryan Newbold2020-10-291-14/+18
* ingest: clean_url() in more placesBryan Newbold2020-03-231-0/+1
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-0/+7
* more mime normalizationBryan Newbold2020-02-271-1/+18
* much progress on file ingest pathBryan Newbold2019-10-221-0/+24
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-5/+11
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-0/+84
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+43