aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/__init__.py
Commit message (Expand)AuthorAgeFilesLines
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-1/+1
* lint fixesBryan Newbold2020-06-171-1/+1
* add new pdf workers/persistersBryan Newbold2020-06-171-2/+2
* initial work on PDF extraction workerBryan Newbold2020-06-161-1/+1
* rename KafkaGrobidSink -> KafkaCompressSinkBryan Newbold2020-06-161-1/+1
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+1
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-051-1/+1
* pdftrio basic python codeBryan Newbold2020-02-121-1/+2
* small fixups to SandcrawlerPostgrestClientBryan Newbold2020-01-141-0/+1
* more wayback and SPN tests and fixesBryan Newbold2020-01-091-1/+1
* fix sandcrawler persist workersBryan Newbold2020-01-021-0/+1
* have SPN client differentiate between SPN and remote errorsBryan Newbold2019-11-131-1/+1
* rename FileIngestWorkerBryan Newbold2019-11-131-0/+1
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-1/+4
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-1/+1
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+3