aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html.py
Commit message (Expand)AuthorAgeFilesLines
* html extract: protocols.io, fix americanarchivistBryan Newbold2020-01-101-1/+7
* more ingest HTML extraction hacksBryan Newbold2020-01-101-6/+46
* many publisher-specific ingest improvementsBryan Newbold2020-01-101-4/+96
* fill in more html extraction techniquesBryan Newbold2020-01-091-7/+6
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-1/+1
* start of hrmars.com ingest supportBryan Newbold2019-11-141-0/+2
* citation_pdf_url with host-relative URLsBryan Newbold2019-11-131-1/+3
* more progress on file ingestBryan Newbold2019-11-131-0/+19
* much progress on file ingest pathBryan Newbold2019-10-221-0/+73