aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests
Commit message (Collapse)AuthorAgeFilesLines
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-271-5/+9
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-2713-402/+595
|
* more progress on type annotations and lintingBryan Newbold2021-10-262-2/+2
|
* live tests: FTP wayback replay now returns 200, not 226Bryan Newbold2021-10-261-2/+2
|
* flake8 clean (with current settings)Bryan Newbold2021-10-262-1/+2
|
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-2610-42/+26
|
* make fmtBryan Newbold2021-10-2613-194/+294
|
* python: isort all importsBryan Newbold2021-10-2612-20/+30
|
* local-file version of gen_file_metadataBryan Newbold2021-10-151-1/+13
|
* wrap up previous renaming workBryan Newbold2021-10-151-1/+1
|
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-031-0/+14
|
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-082-9/+3
|
* xml: re-encode XML docs into UTF-8 for persistingBryan Newbold2020-11-032-0/+354
|
* html: some refactoringBryan Newbold2020-11-031-1/+1
|
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-7/+8
|
* html: work around firstmonday DOCTYPE issueBryan Newbold2020-10-302-0/+455
|
* tests: fix conditional on poppler version checkBryan Newbold2020-10-301-1/+1
|
* improve test running and configBryan Newbold2020-10-291-0/+2
|
* html: more metadata testsBryan Newbold2020-10-292-0/+2453
|
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+2
|
* start HTML metadata extraction codeBryan Newbold2020-10-275-0/+2628
|
* check for simple URL patterns that are usually paywalls or loginwallsBryan Newbold2020-08-111-0/+18
|
* fix tests passing str as HTMLBryan Newbold2020-08-081-3/+3
|
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-0/+5
| | | | | | This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though.
* pdfextract support in ingest workerBryan Newbold2020-06-251-0/+7
|
* fix tests for page0_height/widthBryan Newbold2020-06-251-2/+2
|
* lint fixesBryan Newbold2020-06-171-1/+1
|
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+0
|
* partial test coverage of pdf extract workerBryan Newbold2020-06-171-0/+61
|
* remove unused common.pyBryan Newbold2020-06-171-40/+0
|
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-101-1/+7
| | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
* ingest: add URL blocklist featureBryan Newbold2020-01-171-0/+17
| | | | And, temporarily, block zenodo and figshare.
* clarify ingest result schema and semanticsBryan Newbold2020-01-152-3/+21
|
* add postgrest checks to test mocksBryan Newbold2020-01-141-1/+9
|
* tests: don't use localhost as a responses mock hostBryan Newbold2020-01-142-6/+6
|
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-141-0/+16
|
* more ftp status 226 supportBryan Newbold2020-01-143-3/+9
|
* add live tests for ftp, revisitsBryan Newbold2020-01-141-1/+36
|
* more live tests (for regressions)Bryan Newbold2020-01-101-0/+41
|
* refactor ingest to a loop, allowing multiple hopsBryan Newbold2020-01-091-2/+9
|
* add (skipped) live tests for wayback servicesBryan Newbold2020-01-091-0/+73
|
* add ingest test fileBryan Newbold2020-01-091-0/+120
| | | | Forgot to commit earlier!
* lots of progress on wayback refactoringBryan Newbold2020-01-091-1/+7
| | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper
* location comes as a string, not listBryan Newbold2020-01-091-4/+4
|
* wrap up basic (locally testable) ingest refactorBryan Newbold2020-01-091-4/+48
|
* basic elife+plos extraction testsBryan Newbold2020-01-093-0/+4842
| | | | | Ripped out some HTML, but these could have been minimized even further to keep repository from growing large.
* fix grobid test (ISO-8859-1 encoding)Bryan Newbold2020-01-091-6/+4
| | | | Also changes for wayback refactor
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-092-12/+14
|
* more wayback and SPN tests and fixesBryan Newbold2020-01-092-13/+67
|
* refactor CdxApiClient, add testsBryan Newbold2020-01-081-0/+110
| | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level