Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | remove grobid2json helper file, replace with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 1 | -5/+9 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 13 | -402/+595 |
| | |||||
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 2 | -2/+2 |
| | |||||
* | live tests: FTP wayback replay now returns 200, not 226 | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 2 | -1/+2 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 10 | -42/+26 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 13 | -194/+294 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 12 | -20/+30 |
| | |||||
* | local-file version of gen_file_metadata | Bryan Newbold | 2021-10-15 | 1 | -1/+13 |
| | |||||
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 1 | -1/+1 |
| | |||||
* | refactor and expand wall/block/cookie URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+14 |
| | |||||
* | move some PDF URL extraction into declarative format | Bryan Newbold | 2020-11-08 | 2 | -9/+3 |
| | |||||
* | xml: re-encode XML docs into UTF-8 for persisting | Bryan Newbold | 2020-11-03 | 2 | -0/+354 |
| | |||||
* | html: some refactoring | Bryan Newbold | 2020-11-03 | 1 | -1/+1 |
| | |||||
* | html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs | Bryan Newbold | 2020-10-30 | 1 | -7/+8 |
| | |||||
* | html: work around firstmonday DOCTYPE issue | Bryan Newbold | 2020-10-30 | 2 | -0/+455 |
| | |||||
* | tests: fix conditional on poppler version check | Bryan Newbold | 2020-10-30 | 1 | -1/+1 |
| | |||||
* | improve test running and config | Bryan Newbold | 2020-10-29 | 1 | -0/+2 |
| | |||||
* | html: more metadata tests | Bryan Newbold | 2020-10-29 | 2 | -0/+2453 |
| | |||||
* | HTML metadata: fix type warnings | Bryan Newbold | 2020-10-27 | 1 | -1/+2 |
| | |||||
* | start HTML metadata extraction code | Bryan Newbold | 2020-10-27 | 5 | -0/+2628 |
| | |||||
* | check for simple URL patterns that are usually paywalls or loginwalls | Bryan Newbold | 2020-08-11 | 1 | -0/+18 |
| | |||||
* | fix tests passing str as HTML | Bryan Newbold | 2020-08-08 | 1 | -3/+3 |
| | |||||
* | another bad/non PDF test; catch correct error | Bryan Newbold | 2020-06-25 | 1 | -0/+5 |
| | | | | | | This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though. | ||||
* | pdfextract support in ingest worker | Bryan Newbold | 2020-06-25 | 1 | -0/+7 |
| | |||||
* | fix tests for page0_height/width | Bryan Newbold | 2020-06-25 | 1 | -2/+2 |
| | |||||
* | lint fixes | Bryan Newbold | 2020-06-17 | 1 | -1/+1 |
| | |||||
* | rename pdf tools to pdfextract | Bryan Newbold | 2020-06-17 | 1 | -0/+0 |
| | |||||
* | partial test coverage of pdf extract worker | Bryan Newbold | 2020-06-17 | 1 | -0/+61 |
| | |||||
* | remove unused common.py | Bryan Newbold | 2020-06-17 | 1 | -40/+0 |
| | |||||
* | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -1/+7 |
| | | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script. | ||||
* | ingest: add URL blocklist feature | Bryan Newbold | 2020-01-17 | 1 | -0/+17 |
| | | | | And, temporarily, block zenodo and figshare. | ||||
* | clarify ingest result schema and semantics | Bryan Newbold | 2020-01-15 | 2 | -3/+21 |
| | |||||
* | add postgrest checks to test mocks | Bryan Newbold | 2020-01-14 | 1 | -1/+9 |
| | |||||
* | tests: don't use localhost as a responses mock host | Bryan Newbold | 2020-01-14 | 2 | -6/+6 |
| | |||||
* | SPNv2 doesn't support FTP; add a live test for non-revist FTP | Bryan Newbold | 2020-01-14 | 1 | -0/+16 |
| | |||||
* | more ftp status 226 support | Bryan Newbold | 2020-01-14 | 3 | -3/+9 |
| | |||||
* | add live tests for ftp, revisits | Bryan Newbold | 2020-01-14 | 1 | -1/+36 |
| | |||||
* | more live tests (for regressions) | Bryan Newbold | 2020-01-10 | 1 | -0/+41 |
| | |||||
* | refactor ingest to a loop, allowing multiple hops | Bryan Newbold | 2020-01-09 | 1 | -2/+9 |
| | |||||
* | add (skipped) live tests for wayback services | Bryan Newbold | 2020-01-09 | 1 | -0/+73 |
| | |||||
* | add ingest test file | Bryan Newbold | 2020-01-09 | 1 | -0/+120 |
| | | | | Forgot to commit earlier! | ||||
* | lots of progress on wayback refactoring | Bryan Newbold | 2020-01-09 | 1 | -1/+7 |
| | | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper | ||||
* | location comes as a string, not list | Bryan Newbold | 2020-01-09 | 1 | -4/+4 |
| | |||||
* | wrap up basic (locally testable) ingest refactor | Bryan Newbold | 2020-01-09 | 1 | -4/+48 |
| | |||||
* | basic elife+plos extraction tests | Bryan Newbold | 2020-01-09 | 3 | -0/+4842 |
| | | | | | Ripped out some HTML, but these could have been minimized even further to keep repository from growing large. | ||||
* | fix grobid test (ISO-8859-1 encoding) | Bryan Newbold | 2020-01-09 | 1 | -6/+4 |
| | | | | Also changes for wayback refactor | ||||
* | fix grobid tests for new wayback refactors | Bryan Newbold | 2020-01-09 | 2 | -12/+14 |
| | |||||
* | more wayback and SPN tests and fixes | Bryan Newbold | 2020-01-09 | 2 | -13/+67 |
| | |||||
* | refactor CdxApiClient, add tests | Bryan Newbold | 2020-01-08 | 1 | -0/+110 |
| | | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level |