Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 1 | -1/+1 |
| | |||||
* | refactor and expand wall/block/cookie URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+14 |
| | |||||
* | move some PDF URL extraction into declarative format | Bryan Newbold | 2020-11-08 | 2 | -9/+3 |
| | |||||
* | xml: re-encode XML docs into UTF-8 for persisting | Bryan Newbold | 2020-11-03 | 2 | -0/+354 |
| | |||||
* | html: some refactoring | Bryan Newbold | 2020-11-03 | 1 | -1/+1 |
| | |||||
* | html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs | Bryan Newbold | 2020-10-30 | 1 | -7/+8 |
| | |||||
* | html: work around firstmonday DOCTYPE issue | Bryan Newbold | 2020-10-30 | 2 | -0/+455 |
| | |||||
* | tests: fix conditional on poppler version check | Bryan Newbold | 2020-10-30 | 1 | -1/+1 |
| | |||||
* | improve test running and config | Bryan Newbold | 2020-10-29 | 1 | -0/+2 |
| | |||||
* | html: more metadata tests | Bryan Newbold | 2020-10-29 | 2 | -0/+2453 |
| | |||||
* | HTML metadata: fix type warnings | Bryan Newbold | 2020-10-27 | 1 | -1/+2 |
| | |||||
* | start HTML metadata extraction code | Bryan Newbold | 2020-10-27 | 5 | -0/+2628 |
| | |||||
* | check for simple URL patterns that are usually paywalls or loginwalls | Bryan Newbold | 2020-08-11 | 1 | -0/+18 |
| | |||||
* | fix tests passing str as HTML | Bryan Newbold | 2020-08-08 | 1 | -3/+3 |
| | |||||
* | another bad/non PDF test; catch correct error | Bryan Newbold | 2020-06-25 | 1 | -0/+5 |
| | | | | | | This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though. | ||||
* | pdfextract support in ingest worker | Bryan Newbold | 2020-06-25 | 1 | -0/+7 |
| | |||||
* | fix tests for page0_height/width | Bryan Newbold | 2020-06-25 | 1 | -2/+2 |
| | |||||
* | lint fixes | Bryan Newbold | 2020-06-17 | 1 | -1/+1 |
| | |||||
* | rename pdf tools to pdfextract | Bryan Newbold | 2020-06-17 | 1 | -0/+0 |
| | |||||
* | partial test coverage of pdf extract worker | Bryan Newbold | 2020-06-17 | 1 | -0/+61 |
| | |||||
* | remove unused common.py | Bryan Newbold | 2020-06-17 | 1 | -40/+0 |
| | |||||
* | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -1/+7 |
| | | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script. | ||||
* | ingest: add URL blocklist feature | Bryan Newbold | 2020-01-17 | 1 | -0/+17 |
| | | | | And, temporarily, block zenodo and figshare. | ||||
* | clarify ingest result schema and semantics | Bryan Newbold | 2020-01-15 | 2 | -3/+21 |
| | |||||
* | add postgrest checks to test mocks | Bryan Newbold | 2020-01-14 | 1 | -1/+9 |
| | |||||
* | tests: don't use localhost as a responses mock host | Bryan Newbold | 2020-01-14 | 2 | -6/+6 |
| | |||||
* | SPNv2 doesn't support FTP; add a live test for non-revist FTP | Bryan Newbold | 2020-01-14 | 1 | -0/+16 |
| | |||||
* | more ftp status 226 support | Bryan Newbold | 2020-01-14 | 3 | -3/+9 |
| | |||||
* | add live tests for ftp, revisits | Bryan Newbold | 2020-01-14 | 1 | -1/+36 |
| | |||||
* | more live tests (for regressions) | Bryan Newbold | 2020-01-10 | 1 | -0/+41 |
| | |||||
* | refactor ingest to a loop, allowing multiple hops | Bryan Newbold | 2020-01-09 | 1 | -2/+9 |
| | |||||
* | add (skipped) live tests for wayback services | Bryan Newbold | 2020-01-09 | 1 | -0/+73 |
| | |||||
* | add ingest test file | Bryan Newbold | 2020-01-09 | 1 | -0/+120 |
| | | | | Forgot to commit earlier! | ||||
* | lots of progress on wayback refactoring | Bryan Newbold | 2020-01-09 | 1 | -1/+7 |
| | | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper | ||||
* | location comes as a string, not list | Bryan Newbold | 2020-01-09 | 1 | -4/+4 |
| | |||||
* | wrap up basic (locally testable) ingest refactor | Bryan Newbold | 2020-01-09 | 1 | -4/+48 |
| | |||||
* | basic elife+plos extraction tests | Bryan Newbold | 2020-01-09 | 3 | -0/+4842 |
| | | | | | Ripped out some HTML, but these could have been minimized even further to keep repository from growing large. | ||||
* | fix grobid test (ISO-8859-1 encoding) | Bryan Newbold | 2020-01-09 | 1 | -6/+4 |
| | | | | Also changes for wayback refactor | ||||
* | fix grobid tests for new wayback refactors | Bryan Newbold | 2020-01-09 | 2 | -12/+14 |
| | |||||
* | more wayback and SPN tests and fixes | Bryan Newbold | 2020-01-09 | 2 | -13/+67 |
| | |||||
* | refactor CdxApiClient, add tests | Bryan Newbold | 2020-01-08 | 1 | -0/+110 |
| | | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level | ||||
* | refactor SavePaperNowClient and add test | Bryan Newbold | 2020-01-07 | 1 | -0/+160 |
| | | | | | | - response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception | ||||
* | teixml2json test update for skipping null JSON keys | Bryan Newbold | 2020-01-02 | 1 | -10/+1 |
| | |||||
* | grobid2json: language_code | Bryan Newbold | 2019-10-04 | 1 | -1/+2 |
| | |||||
* | python tests for pusher classes | Bryan Newbold | 2019-10-02 | 2 | -0/+28 |
| | |||||
* | add tests for affiliation extraction | Bryan Newbold | 2019-10-02 | 2 | -1/+25 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 2 | -7/+29 |
| | |||||
* | test of GROBID client | Bryan Newbold | 2019-09-25 | 1 | -0/+53 |
| | |||||
* | refactor old python hadoop code into new directory | Bryan Newbold | 2019-09-25 | 4 | -591/+0 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 1 | -1/+31 |
| |