aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ingest_file.py
Commit message (Expand)AuthorAgeFilesLines
* ingest: another wall pattern, and check for walls in more placesBryan Newbold2022-10-241-1/+14
* html ingest: handle TEI-XML parse errorBryan Newbold2022-07-281-1/+4
* ingest: bump max-hops from 6 to 8Bryan Newbold2022-07-201-1/+1
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-121-1/+0
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-221-0/+7
* null-body -> empty-blobBryan Newbold2022-01-131-2/+2
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-216/+261
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-261-1/+1
* ingest file HTTP API: fixes from type checkingBryan Newbold2021-10-261-3/+3
* more progress on type annotationsBryan Newbold2021-10-261-12/+21
* more progress on type annotations and lintingBryan Newbold2021-10-261-0/+2
* flake8 clean (with current settings)Bryan Newbold2021-10-261-2/+2
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-13/+9
* make fmtBryan Newbold2021-10-261-64/+81
* python: isort all importsBryan Newbold2021-10-261-13/+13
* improve fileset ingest integration with file ingestBryan Newbold2021-10-151-4/+8
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-151-27/+1
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-151-0/+4
* wrap up previous renaming workBryan Newbold2021-10-151-3/+1
* refactoring; progress on filesetsBryan Newbold2021-10-151-0/+5
* rename some python files for clarityBryan Newbold2021-10-151-0/+833