aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests
Commit message (Collapse)AuthorAgeFilesLines
* wrap up previous renaming workBryan Newbold2021-10-151-1/+1
|
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-031-0/+14
|
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-082-9/+3
|
* xml: re-encode XML docs into UTF-8 for persistingBryan Newbold2020-11-032-0/+354
|
* html: some refactoringBryan Newbold2020-11-031-1/+1
|
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-7/+8
|
* html: work around firstmonday DOCTYPE issueBryan Newbold2020-10-302-0/+455
|
* tests: fix conditional on poppler version checkBryan Newbold2020-10-301-1/+1
|
* improve test running and configBryan Newbold2020-10-291-0/+2
|
* html: more metadata testsBryan Newbold2020-10-292-0/+2453
|
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+2
|
* start HTML metadata extraction codeBryan Newbold2020-10-275-0/+2628
|
* check for simple URL patterns that are usually paywalls or loginwallsBryan Newbold2020-08-111-0/+18
|
* fix tests passing str as HTMLBryan Newbold2020-08-081-3/+3
|
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-0/+5
| | | | | | This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though.
* pdfextract support in ingest workerBryan Newbold2020-06-251-0/+7
|
* fix tests for page0_height/widthBryan Newbold2020-06-251-2/+2
|
* lint fixesBryan Newbold2020-06-171-1/+1
|
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+0
|
* partial test coverage of pdf extract workerBryan Newbold2020-06-171-0/+61
|
* remove unused common.pyBryan Newbold2020-06-171-40/+0
|
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+7
| | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
* ingest: add URL blocklist featureBryan Newbold2020-01-171-0/+17
| | | | And, temporarily, block zenodo and figshare.
* clarify ingest result schema and semanticsBryan Newbold2020-01-152-3/+21
|
* add postgrest checks to test mocksBryan Newbold2020-01-141-1/+9
|
* tests: don't use localhost as a responses mock hostBryan Newbold2020-01-142-6/+6
|
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-141-0/+16
|
* more ftp status 226 supportBryan Newbold2020-01-143-3/+9
|
* add live tests for ftp, revisitsBryan Newbold2020-01-141-1/+36
|
* more live tests (for regressions)Bryan Newbold2020-01-101-0/+41
|
* refactor ingest to a loop, allowing multiple hopsBryan Newbold2020-01-091-2/+9
|
* add (skipped) live tests for wayback servicesBryan Newbold2020-01-091-0/+73
|
* add ingest test fileBryan Newbold2020-01-091-0/+120
| | | | Forgot to commit earlier!
* lots of progress on wayback refactoringBryan Newbold2020-01-091-1/+7
| | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper
* location comes as a string, not listBryan Newbold2020-01-091-4/+4
|
* wrap up basic (locally testable) ingest refactorBryan Newbold2020-01-091-4/+48
|
* basic elife+plos extraction testsBryan Newbold2020-01-093-0/+4842
| | | | | Ripped out some HTML, but these could have been minimized even further to keep repository from growing large.
* fix grobid test (ISO-8859-1 encoding)Bryan Newbold2020-01-091-6/+4
| | | | Also changes for wayback refactor
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-092-12/+14
|
* more wayback and SPN tests and fixesBryan Newbold2020-01-092-13/+67
|
* refactor CdxApiClient, add testsBryan Newbold2020-01-081-0/+110
| | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level
* refactor SavePaperNowClient and add testBryan Newbold2020-01-071-0/+160
| | | | | | - response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception
* teixml2json test update for skipping null JSON keysBryan Newbold2020-01-021-10/+1
|
* grobid2json: language_codeBryan Newbold2019-10-041-1/+2
|
* python tests for pusher classesBryan Newbold2019-10-022-0/+28
|
* add tests for affiliation extractionBryan Newbold2019-10-022-1/+25
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-262-7/+29
|
* test of GROBID clientBryan Newbold2019-09-251-0/+53
|
* refactor old python hadoop code into new directoryBryan Newbold2019-09-254-591/+0
|
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-1/+31
|