Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -51/+64 |
| | |||||
* | lint collection membership (last lint for now) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 1 | -8/+8 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -4/+4 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -19/+45 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -5/+5 |
| | |||||
* | improve fileset ingest integration with file ingest | Bryan Newbold | 2021-10-15 | 1 | -0/+15 |
| | |||||
* | local-file version of gen_file_metadata | Bryan Newbold | 2021-10-15 | 1 | -1/+42 |
| | |||||
* | move fuzzy URL match method to misc | Bryan Newbold | 2020-11-08 | 1 | -0/+17 |
| | |||||
* | html: try to detect and mark XHTML (vs. HTML or XML) | Bryan Newbold | 2020-11-08 | 1 | -2/+4 |
| | |||||
* | gen_file_metadata: allow empty/null bodies (if flag set) | Bryan Newbold | 2020-11-08 | 1 | -2/+4 |
| | | | | This is for HTML sub-resources, which can validly be empty (I think) | ||||
* | gen_file_metadata: detect JATS XML and use application/jats+xml | Bryan Newbold | 2020-11-03 | 1 | -0/+4 |
| | |||||
* | cdx datetime parsing improvements | Bryan Newbold | 2020-10-30 | 1 | -0/+11 |
| | |||||
* | misc: type annotations, fix parse_cdx_datetime | Bryan Newbold | 2020-10-29 | 1 | -14/+18 |
| | |||||
* | ingest: clean_url() in more places | Bryan Newbold | 2020-03-23 | 1 | -0/+1 |
| | | | | | | Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error. | ||||
* | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -0/+7 |
| | | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script. | ||||
* | more mime normalization | Bryan Newbold | 2020-02-27 | 1 | -1/+18 |
| | |||||
* | much progress on file ingest path | Bryan Newbold | 2019-10-22 | 1 | -0/+24 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -5/+11 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 1 | -0/+84 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 1 | -0/+43 |