aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ingest_html.py
Commit message (Expand)AuthorAgeFilesLines
* html: pubpub platform detectionBryan Newbold2022-10-241-0/+2
* ingest: record bad GZIP transfer decode, instead of crashing (HTML)Bryan Newbold2022-07-181-1/+4
* html ingest: allow fuzzy CDX sha1 match based on encoding/not-encodingBryan Newbold2022-07-161-3/+10
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-57/+82
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-5/+4
* make fmtBryan Newbold2021-10-261-49/+71
* ingest_html: update trafilatura TEI-XML output kwargBryan Newbold2021-10-261-1/+1
* python: isort all importsBryan Newbold2021-10-261-9/+9
* wrap up previous renaming workBryan Newbold2021-10-151-1/+1
* rename some python files for clarityBryan Newbold2021-10-151-0/+441