aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/misc.py
Commit message (Collapse)AuthorAgeFilesLines
* local-file version of gen_file_metadataBryan Newbold2021-10-151-1/+42
|
* move fuzzy URL match method to miscBryan Newbold2020-11-081-0/+17
|
* html: try to detect and mark XHTML (vs. HTML or XML)Bryan Newbold2020-11-081-2/+4
|
* gen_file_metadata: allow empty/null bodies (if flag set)Bryan Newbold2020-11-081-2/+4
| | | | This is for HTML sub-resources, which can validly be empty (I think)
* gen_file_metadata: detect JATS XML and use application/jats+xmlBryan Newbold2020-11-031-0/+4
|
* cdx datetime parsing improvementsBryan Newbold2020-10-301-0/+11
|
* misc: type annotations, fix parse_cdx_datetimeBryan Newbold2020-10-291-14/+18
|
* ingest: clean_url() in more placesBryan Newbold2020-03-231-0/+1
| | | | | | Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error.
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-0/+7
| | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
* more mime normalizationBryan Newbold2020-02-271-1/+18
|
* much progress on file ingest pathBryan Newbold2019-10-221-0/+24
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-5/+11
|
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-0/+84
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+43