aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests
Commit message (Expand)AuthorAgeFilesLines
* wrap up previous renaming workBryan Newbold2021-10-151-1/+1
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-031-0/+14
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-082-9/+3
* xml: re-encode XML docs into UTF-8 for persistingBryan Newbold2020-11-032-0/+354
* html: some refactoringBryan Newbold2020-11-031-1/+1
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-7/+8
* html: work around firstmonday DOCTYPE issueBryan Newbold2020-10-302-0/+455
* tests: fix conditional on poppler version checkBryan Newbold2020-10-301-1/+1
* improve test running and configBryan Newbold2020-10-291-0/+2
* html: more metadata testsBryan Newbold2020-10-292-0/+2453
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+2
* start HTML metadata extraction codeBryan Newbold2020-10-275-0/+2628
* check for simple URL patterns that are usually paywalls or loginwallsBryan Newbold2020-08-111-0/+18
* fix tests passing str as HTMLBryan Newbold2020-08-081-3/+3
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-0/+5
* pdfextract support in ingest workerBryan Newbold2020-06-251-0/+7
* fix tests for page0_height/widthBryan Newbold2020-06-251-2/+2
* lint fixesBryan Newbold2020-06-171-1/+1
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+0
* partial test coverage of pdf extract workerBryan Newbold2020-06-171-0/+61
* remove unused common.pyBryan Newbold2020-06-171-40/+0
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+7
* ingest: add URL blocklist featureBryan Newbold2020-01-171-0/+17
* clarify ingest result schema and semanticsBryan Newbold2020-01-152-3/+21
* add postgrest checks to test mocksBryan Newbold2020-01-141-1/+9
* tests: don't use localhost as a responses mock hostBryan Newbold2020-01-142-6/+6
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-141-0/+16
* more ftp status 226 supportBryan Newbold2020-01-143-3/+9
* add live tests for ftp, revisitsBryan Newbold2020-01-141-1/+36
* more live tests (for regressions)Bryan Newbold2020-01-101-0/+41
* refactor ingest to a loop, allowing multiple hopsBryan Newbold2020-01-091-2/+9
* add (skipped) live tests for wayback servicesBryan Newbold2020-01-091-0/+73
* add ingest test fileBryan Newbold2020-01-091-0/+120
* lots of progress on wayback refactoringBryan Newbold2020-01-091-1/+7
* location comes as a string, not listBryan Newbold2020-01-091-4/+4
* wrap up basic (locally testable) ingest refactorBryan Newbold2020-01-091-4/+48
* basic elife+plos extraction testsBryan Newbold2020-01-093-0/+4842
* fix grobid test (ISO-8859-1 encoding)Bryan Newbold2020-01-091-6/+4
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-092-12/+14
* more wayback and SPN tests and fixesBryan Newbold2020-01-092-13/+67
* refactor CdxApiClient, add testsBryan Newbold2020-01-081-0/+110
* refactor SavePaperNowClient and add testBryan Newbold2020-01-071-0/+160
* teixml2json test update for skipping null JSON keysBryan Newbold2020-01-021-10/+1
* grobid2json: language_codeBryan Newbold2019-10-041-1/+2
* python tests for pusher classesBryan Newbold2019-10-022-0/+28
* add tests for affiliation extractionBryan Newbold2019-10-022-1/+25
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-262-7/+29
* test of GROBID clientBryan Newbold2019-09-251-0/+53
* refactor old python hadoop code into new directoryBryan Newbold2019-09-254-591/+0
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-251-1/+31