aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests
Commit message (Expand)AuthorAgeFilesLines
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-045-1/+698
* updates/corrections to old small.json GROBID metadata example fileBryan Newbold2021-10-271-6/+1
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-271-5/+9
* make fmt (black 21.9b0)Bryan Newbold2021-10-2713-402/+595
* more progress on type annotations and lintingBryan Newbold2021-10-262-2/+2
* live tests: FTP wayback replay now returns 200, not 226Bryan Newbold2021-10-261-2/+2
* flake8 clean (with current settings)Bryan Newbold2021-10-262-1/+2
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-2610-42/+26
* make fmtBryan Newbold2021-10-2613-194/+294
* python: isort all importsBryan Newbold2021-10-2612-20/+30
* local-file version of gen_file_metadataBryan Newbold2021-10-151-1/+13
* wrap up previous renaming workBryan Newbold2021-10-151-1/+1
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-031-0/+14
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-082-9/+3
* xml: re-encode XML docs into UTF-8 for persistingBryan Newbold2020-11-032-0/+354
* html: some refactoringBryan Newbold2020-11-031-1/+1
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-7/+8
* html: work around firstmonday DOCTYPE issueBryan Newbold2020-10-302-0/+455
* tests: fix conditional on poppler version checkBryan Newbold2020-10-301-1/+1
* improve test running and configBryan Newbold2020-10-291-0/+2
* html: more metadata testsBryan Newbold2020-10-292-0/+2453
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+2
* start HTML metadata extraction codeBryan Newbold2020-10-275-0/+2628
* check for simple URL patterns that are usually paywalls or loginwallsBryan Newbold2020-08-111-0/+18
* fix tests passing str as HTMLBryan Newbold2020-08-081-3/+3
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-0/+5
* pdfextract support in ingest workerBryan Newbold2020-06-251-0/+7
* fix tests for page0_height/widthBryan Newbold2020-06-251-2/+2
* lint fixesBryan Newbold2020-06-171-1/+1
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+0
* partial test coverage of pdf extract workerBryan Newbold2020-06-171-0/+61
* remove unused common.pyBryan Newbold2020-06-171-40/+0
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+7
* ingest: add URL blocklist featureBryan Newbold2020-01-171-0/+17
* clarify ingest result schema and semanticsBryan Newbold2020-01-152-3/+21
* add postgrest checks to test mocksBryan Newbold2020-01-141-1/+9
* tests: don't use localhost as a responses mock hostBryan Newbold2020-01-142-6/+6
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-141-0/+16
* more ftp status 226 supportBryan Newbold2020-01-143-3/+9
* add live tests for ftp, revisitsBryan Newbold2020-01-141-1/+36
* more live tests (for regressions)Bryan Newbold2020-01-101-0/+41
* refactor ingest to a loop, allowing multiple hopsBryan Newbold2020-01-091-2/+9
* add (skipped) live tests for wayback servicesBryan Newbold2020-01-091-0/+73
* add ingest test fileBryan Newbold2020-01-091-0/+120
* lots of progress on wayback refactoringBryan Newbold2020-01-091-1/+7
* location comes as a string, not listBryan Newbold2020-01-091-4/+4
* wrap up basic (locally testable) ingest refactorBryan Newbold2020-01-091-4/+48
* basic elife+plos extraction testsBryan Newbold2020-01-093-0/+4842
* fix grobid test (ISO-8859-1 encoding)Bryan Newbold2020-01-091-6/+4
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-092-12/+14