Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -1/+2 |
| | |||||
* | local-file version of gen_file_metadata | Bryan Newbold | 2021-10-15 | 1 | -1/+13 |
| | |||||
* | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -1/+7 |
| | | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script. | ||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -3/+3 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 1 | -1/+31 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 1 | -0/+41 |