aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/__init__.py
Commit message (Collapse)AuthorAgeFilesLines
* local-file version of gen_file_metadataBryan Newbold2021-10-151-1/+1
|
* refactoring; progress on filesetsBryan Newbold2021-10-151-1/+2
|
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-1/+1
| | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
* lint fixesBryan Newbold2020-06-171-1/+1
|
* add new pdf workers/persistersBryan Newbold2020-06-171-2/+2
|
* initial work on PDF extraction workerBryan Newbold2020-06-161-1/+1
| | | | | This worker fetches full PDFs, then extracts thumbnails, raw text, and PDF metadata. Similar to GROBID worker.
* rename KafkaGrobidSink -> KafkaCompressSinkBryan Newbold2020-06-161-1/+1
|
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+1
| | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-051-1/+1
|
* pdftrio basic python codeBryan Newbold2020-02-121-1/+2
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* small fixups to SandcrawlerPostgrestClientBryan Newbold2020-01-141-0/+1
|
* more wayback and SPN tests and fixesBryan Newbold2020-01-091-1/+1
|
* fix sandcrawler persist workersBryan Newbold2020-01-021-0/+1
|
* have SPN client differentiate between SPN and remote errorsBryan Newbold2019-11-131-1/+1
| | | | | | | | This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything.
* rename FileIngestWorkerBryan Newbold2019-11-131-0/+1
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-1/+4
|
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-1/+1
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+3