| Commit message (Collapse) | Author | Age | Files | Lines | |
|---|---|---|---|---|---|
| * | improve test running and config | Bryan Newbold | 2020-10-29 | 1 | -0/+2 | 
| | | |||||
| * | html: more metadata tests | Bryan Newbold | 2020-10-29 | 2 | -0/+2453 | 
| | | |||||
| * | HTML metadata: fix type warnings | Bryan Newbold | 2020-10-27 | 1 | -1/+2 | 
| | | |||||
| * | start HTML metadata extraction code | Bryan Newbold | 2020-10-27 | 5 | -0/+2628 | 
| | | |||||
| * | check for simple URL patterns that are usually paywalls or loginwalls | Bryan Newbold | 2020-08-11 | 1 | -0/+18 | 
| | | |||||
| * | fix tests passing str as HTML | Bryan Newbold | 2020-08-08 | 1 | -3/+3 | 
| | | |||||
| * | another bad/non PDF test; catch correct error | Bryan Newbold | 2020-06-25 | 1 | -0/+5 | 
| | | | | | | | This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though. | ||||
| * | pdfextract support in ingest worker | Bryan Newbold | 2020-06-25 | 1 | -0/+7 | 
| | | |||||
| * | fix tests for page0_height/width | Bryan Newbold | 2020-06-25 | 1 | -2/+2 | 
| | | |||||
| * | lint fixes | Bryan Newbold | 2020-06-17 | 1 | -1/+1 | 
| | | |||||
| * | rename pdf tools to pdfextract | Bryan Newbold | 2020-06-17 | 1 | -0/+0 | 
| | | |||||
| * | partial test coverage of pdf extract worker | Bryan Newbold | 2020-06-17 | 1 | -0/+61 | 
| | | |||||
| * | remove unused common.py | Bryan Newbold | 2020-06-17 | 1 | -40/+0 | 
| | | |||||
| * | url cleaning (canonicalization) for ingest base_url | Bryan Newbold | 2020-03-10 | 1 | -1/+7 | 
| | | | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script. | ||||
| * | ingest: add URL blocklist feature | Bryan Newbold | 2020-01-17 | 1 | -0/+17 | 
| | | | | | And, temporarily, block zenodo and figshare. | ||||
| * | clarify ingest result schema and semantics | Bryan Newbold | 2020-01-15 | 2 | -3/+21 | 
| | | |||||
| * | add postgrest checks to test mocks | Bryan Newbold | 2020-01-14 | 1 | -1/+9 | 
| | | |||||
| * | tests: don't use localhost as a responses mock host | Bryan Newbold | 2020-01-14 | 2 | -6/+6 | 
| | | |||||
| * | SPNv2 doesn't support FTP; add a live test for non-revist FTP | Bryan Newbold | 2020-01-14 | 1 | -0/+16 | 
| | | |||||
| * | more ftp status 226 support | Bryan Newbold | 2020-01-14 | 3 | -3/+9 | 
| | | |||||
| * | add live tests for ftp, revisits | Bryan Newbold | 2020-01-14 | 1 | -1/+36 | 
| | | |||||
| * | more live tests (for regressions) | Bryan Newbold | 2020-01-10 | 1 | -0/+41 | 
| | | |||||
| * | refactor ingest to a loop, allowing multiple hops | Bryan Newbold | 2020-01-09 | 1 | -2/+9 | 
| | | |||||
| * | add (skipped) live tests for wayback services | Bryan Newbold | 2020-01-09 | 1 | -0/+73 | 
| | | |||||
| * | add ingest test file | Bryan Newbold | 2020-01-09 | 1 | -0/+120 | 
| | | | | | Forgot to commit earlier! | ||||
| * | lots of progress on wayback refactoring | Bryan Newbold | 2020-01-09 | 1 | -1/+7 | 
| | | | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper | ||||
| * | location comes as a string, not list | Bryan Newbold | 2020-01-09 | 1 | -4/+4 | 
| | | |||||
| * | wrap up basic (locally testable) ingest refactor | Bryan Newbold | 2020-01-09 | 1 | -4/+48 | 
| | | |||||
| * | basic elife+plos extraction tests | Bryan Newbold | 2020-01-09 | 3 | -0/+4842 | 
| | | | | | | Ripped out some HTML, but these could have been minimized even further to keep repository from growing large. | ||||
| * | fix grobid test (ISO-8859-1 encoding) | Bryan Newbold | 2020-01-09 | 1 | -6/+4 | 
| | | | | | Also changes for wayback refactor | ||||
| * | fix grobid tests for new wayback refactors | Bryan Newbold | 2020-01-09 | 2 | -12/+14 | 
| | | |||||
| * | more wayback and SPN tests and fixes | Bryan Newbold | 2020-01-09 | 2 | -13/+67 | 
| | | |||||
| * | refactor CdxApiClient, add tests | Bryan Newbold | 2020-01-08 | 1 | -0/+110 | 
| | | | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level | ||||
| * | refactor SavePaperNowClient and add test | Bryan Newbold | 2020-01-07 | 1 | -0/+160 | 
| | | | | | | | - response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception | ||||
| * | teixml2json test update for skipping null JSON keys | Bryan Newbold | 2020-01-02 | 1 | -10/+1 | 
| | | |||||
| * | grobid2json: language_code | Bryan Newbold | 2019-10-04 | 1 | -1/+2 | 
| | | |||||
| * | python tests for pusher classes | Bryan Newbold | 2019-10-02 | 2 | -0/+28 | 
| | | |||||
| * | add tests for affiliation extraction | Bryan Newbold | 2019-10-02 | 2 | -1/+25 | 
| | | |||||
| * | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 2 | -7/+29 | 
| | | |||||
| * | test of GROBID client | Bryan Newbold | 2019-09-25 | 1 | -0/+53 | 
| | | |||||
| * | refactor old python hadoop code into new directory | Bryan Newbold | 2019-09-25 | 4 | -591/+0 | 
| | | |||||
| * | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 1 | -1/+31 | 
| | | |||||
| * | fix test grobid2json test | Bryan Newbold | 2019-09-25 | 1 | -1/+4 | 
| | | | | | For new extra fields | ||||
| * | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 2 | -0/+41 | 
| | | |||||
| * | update grobid2json to include given_name/surname | Bryan Newbold | 2019-05-13 | 1 | -3/+3 | 
| | | |||||
| * | python test fixes | Bryan Newbold | 2019-02-21 | 1 | -1/+1 | 
| | | |||||
| * | fix ungrobid extraction tests | Bryan Newbold | 2018-11-22 | 1 | -2/+4 | 
| | | |||||
| * | longtail grobid metadata parse/filter WIP | Bryan Newbold | 2018-09-22 | 1 | -0/+5 | 
| | | |||||
| * | WIP: ungrobided doesn't inherit (copypasta) | Bryan Newbold | 2018-08-25 | 1 | -4/+4 | 
| | | |||||
| * | ungrobided: example real output | Bryan Newbold | 2018-08-25 | 1 | -0/+20 | 
| | | |||||
