Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | differential wayback-error from wayback-content-error | Bryan Newbold | 2020-10-21 | 1 | -1/+1 |
| | | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption). | ||||
* | lint fixes | Bryan Newbold | 2020-06-17 | 1 | -1/+1 |
| | |||||
* | add new pdf workers/persisters | Bryan Newbold | 2020-06-17 | 1 | -2/+2 |
| | |||||
* | initial work on PDF extraction worker | Bryan Newbold | 2020-06-16 | 1 | -1/+1 |
| | | | | | This worker fetches full PDFs, then extracts thumbnails, raw text, and PDF metadata. Similar to GROBID worker. | ||||
* | rename KafkaGrobidSink -> KafkaCompressSink | Bryan Newbold | 2020-06-16 | 1 | -1/+1 |
| | |||||
* | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -1/+1 |
| | | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script. | ||||
* | persist: ingest_request tool (with no ingest_file_result) | Bryan Newbold | 2020-03-05 | 1 | -1/+1 |
| | |||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 1 | -1/+2 |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | ||||
* | small fixups to SandcrawlerPostgrestClient | Bryan Newbold | 2020-01-14 | 1 | -0/+1 |
| | |||||
* | more wayback and SPN tests and fixes | Bryan Newbold | 2020-01-09 | 1 | -1/+1 |
| | |||||
* | fix sandcrawler persist workers | Bryan Newbold | 2020-01-02 | 1 | -0/+1 |
| | |||||
* | have SPN client differentiate between SPN and remote errors | Bryan Newbold | 2019-11-13 | 1 | -1/+1 |
| | | | | | | | | This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything. | ||||
* | rename FileIngestWorker | Bryan Newbold | 2019-11-13 | 1 | -0/+1 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -1/+4 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 1 | -1/+1 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 1 | -0/+3 |