Commit message (Expand) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ingest: clean_url() in more places | Bryan Newbold | 2020-03-23 | 1 | -0/+1 |
* | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -0/+7 |
* | more mime normalization | Bryan Newbold | 2020-02-27 | 1 | -1/+18 |
* | much progress on file ingest path | Bryan Newbold | 2019-10-22 | 1 | -0/+24 |
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -5/+11 |
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 1 | -0/+84 |
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 1 | -0/+43 |