Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ingest: more bogus domain patterns | Bryan Newbold | 2022-07-15 | 1 | -0/+3 |
| | |||||
* | ingest: another form of cookie block URL | Bryan Newbold | 2022-07-15 | 1 | -0/+2 |
| | | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py. | ||||
* | ingest: doaj.org article landing page access links | Bryan Newbold | 2022-07-12 | 1 | -1/+0 |
| | |||||
* | ingest: IEEE domain is blocking us | Bryan Newbold | 2022-07-07 | 1 | -1/+2 |
| | |||||
* | ingest: skip arxiv.org DOIs, we already direct-ingest | Bryan Newbold | 2022-05-11 | 1 | -0/+1 |
| | |||||
* | ingest: more loginwall patterns | Bryan Newbold | 2022-05-05 | 1 | -0/+3 |
| | |||||
* | block isiarticles.com from future PDF crawls | Bryan Newbold | 2022-04-20 | 1 | -0/+2 |
| | |||||
* | file ingest: don't 'backoff' on spn2 backoff error | Bryan Newbold | 2022-03-22 | 1 | -0/+7 |
| | | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those. | ||||
* | null-body -> empty-blob | Bryan Newbold | 2022-01-13 | 1 | -2/+2 |
| | |||||
* | ingest_file: more efficient GROBID metadata copy | Bryan Newbold | 2021-11-12 | 1 | -3/+3 |
| | |||||
* | ingest: start re-processing GROBID with newer version | Bryan Newbold | 2021-11-10 | 1 | -2/+6 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -216/+261 |
| | |||||
* | bugfix: setting html_biblio on ingest results | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | | | | This was caught during lint cleanup | ||||
* | ingest file HTTP API: fixes from type checking | Bryan Newbold | 2021-10-26 | 1 | -3/+3 |
| | | | | | This code is deprecated and should be removed anyways, but still interesting to see the fixes | ||||
* | more progress on type annotations | Bryan Newbold | 2021-10-26 | 1 | -12/+21 |
| | |||||
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 1 | -0/+2 |
| | |||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -13/+9 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -64/+81 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -13/+13 |
| | |||||
* | improve fileset ingest integration with file ingest | Bryan Newbold | 2021-10-15 | 1 | -4/+8 |
| | |||||
* | move SPNv2 'simple_get' logic to SPN client | Bryan Newbold | 2021-10-15 | 1 | -27/+1 |
| | |||||
* | component ingest support for dataverse files (individual) | Bryan Newbold | 2021-10-15 | 1 | -0/+4 |
| | |||||
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 1 | -3/+1 |
| | |||||
* | refactoring; progress on filesets | Bryan Newbold | 2021-10-15 | 1 | -0/+5 |
| | |||||
* | rename some python files for clarity | Bryan Newbold | 2021-10-15 | 1 | -0/+833 |