aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ingest_file.py
Commit message (Collapse)AuthorAgeFilesLines
* ingest: another wall pattern, and check for walls in more placesBryan Newbold2022-10-241-1/+14
|
* html ingest: handle TEI-XML parse errorBryan Newbold2022-07-281-1/+4
|
* ingest: bump max-hops from 6 to 8Bryan Newbold2022-07-201-1/+1
|
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
|
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
| | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-121-1/+0
|
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
|
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
|
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
|
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
|
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-221-0/+7
| | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.
* null-body -> empty-blobBryan Newbold2022-01-131-2/+2
|
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
|
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-216/+261
|
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-261-1/+1
| | | | This was caught during lint cleanup
* ingest file HTTP API: fixes from type checkingBryan Newbold2021-10-261-3/+3
| | | | | This code is deprecated and should be removed anyways, but still interesting to see the fixes
* more progress on type annotationsBryan Newbold2021-10-261-12/+21
|
* more progress on type annotations and lintingBryan Newbold2021-10-261-0/+2
|
* flake8 clean (with current settings)Bryan Newbold2021-10-261-2/+2
|
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-13/+9
|
* make fmtBryan Newbold2021-10-261-64/+81
|
* python: isort all importsBryan Newbold2021-10-261-13/+13
|
* improve fileset ingest integration with file ingestBryan Newbold2021-10-151-4/+8
|
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-151-27/+1
|
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-151-0/+4
|
* wrap up previous renaming workBryan Newbold2021-10-151-3/+1
|
* refactoring; progress on filesetsBryan Newbold2021-10-151-0/+5
|
* rename some python files for clarityBryan Newbold2021-10-151-0/+833