Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | refactoring; progress on filesets | Bryan Newbold | 2021-10-15 | 3 | -9/+27 | |
| | ||||||
* | rename some python files for clarity | Bryan Newbold | 2021-10-15 | 3 | -0/+0 | |
| | ||||||
* | pdf ingest: journals.uchicago.edu pattern | Bryan Newbold | 2021-10-11 | 1 | -0/+8 | |
| | ||||||
* | spn: avoid 'None' job_id | Bryan Newbold | 2021-10-11 | 1 | -2/+2 | |
| | | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem. | |||||
* | cdx_collection.py: minor lint issue | Bryan Newbold | 2021-10-04 | 1 | -1/+1 | |
| | ||||||
* | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 2 | -20/+84 | |
| | ||||||
* | html ingest: report dt with broken CDX records | Bryan Newbold | 2021-10-04 | 1 | -1/+1 | |
| | ||||||
* | allow through unknown-scope HTML ingests, for possible SPN import | Bryan Newbold | 2021-10-01 | 1 | -11/+5 | |
| | ||||||
* | html: fix logging of broken CDX URL | Bryan Newbold | 2021-10-01 | 1 | -1/+1 | |
| | ||||||
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-not | Bryan Newbold | 2021-09-30 | 1 | -0/+1 | |
| | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try. | |||||
* | fix typo with spn_cdx_retry_sec arg | Bryan Newbold | 2021-09-30 | 1 | -1/+1 | |
| | ||||||
* | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 3 | -3/+9 | |
| | ||||||
* | yet another bad PDF sha1 | Bryan Newbold | 2021-09-30 | 1 | -0/+1 | |
| | ||||||
* | new 'daily' and 'priority' ingest request topics | Bryan Newbold | 2021-09-30 | 1 | -1/+7 | |
| | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode. | |||||
* | old HTML extractors: handle null tag | Bryan Newbold | 2021-09-08 | 1 | -8/+9 | |
| | ||||||
* | ingest: more block patterns, for huge databases | Bryan Newbold | 2021-09-08 | 1 | -1/+4 | |
| | ||||||
* | yet more PDF sha1 to skip | Bryan Newbold | 2021-09-03 | 1 | -0/+5 | |
| | ||||||
* | yet more PDF URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+48 | |
| | ||||||
* | ingest: check URL blocklist again after redirects | Bryan Newbold | 2021-09-03 | 1 | -0/+7 | |
| | ||||||
* | refactor and expand wall/block/cookie URL patterns | Bryan Newbold | 2021-09-03 | 2 | -6/+39 | |
| | ||||||
* | HTML ingest: several more PDF fulltext URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+87 | |
| | ||||||
* | HTML ingest: skip noisy print() statement | Bryan Newbold | 2021-09-03 | 1 | -1/+1 | |
| | ||||||
* | HTML ingest: more meta-URI prefixes | Bryan Newbold | 2021-08-24 | 1 | -2/+8 | |
| | ||||||
* | html ingest: detect some blog platforms, and allow lower wordcount threshold | Bryan Newbold | 2021-08-16 | 1 | -0/+6 | |
| | ||||||
* | html ingest: detect domain homepage (no path) as special case | Bryan Newbold | 2021-08-16 | 1 | -0/+8 | |
| | ||||||
* | html ingest: skip 'about:blank' | Bryan Newbold | 2021-08-16 | 1 | -0/+3 | |
| | | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case? | |||||
* | more bad PDF hashes | Bryan Newbold | 2021-07-26 | 1 | -0/+2 | |
| | ||||||
* | ingest: fix postgrest lookup bug (double get of GROBID) | Bryan Newbold | 2021-07-26 | 1 | -1/+1 | |
| | ||||||
* | more blocked-cookie patterns; fix old typo | Bryan Newbold | 2021-07-14 | 1 | -2/+2 | |
| | ||||||
* | another bad PDF sha1 | Bryan Newbold | 2021-07-13 | 1 | -0/+1 | |
| | ||||||
* | crawl: SPN2 non-200 success code path | Bryan Newbold | 2021-07-13 | 1 | -11/+25 | |
| | ||||||
* | crawl: SPN self-redirect hack | Bryan Newbold | 2021-07-13 | 1 | -0/+9 | |
| | ||||||
* | crawl: small comment updates | Bryan Newbold | 2021-07-13 | 1 | -3/+6 | |
| | ||||||
* | another lowercase DOI in an (unused?) script | Bryan Newbold | 2021-07-13 | 1 | -1/+1 | |
| | ||||||
* | gitignore: samples/ | Bryan Newbold | 2021-07-13 | 1 | -0/+1 | |
| | ||||||
* | add crossref postgrest fetch support to python db helpers | Bryan Newbold | 2021-06-02 | 1 | -0/+9 | |
| | ||||||
* | python Makefile: fix test/*.py linting with newer pylint | Bryan Newbold | 2021-05-24 | 1 | -1/+1 | |
| | ||||||
* | ingest: fix html PDF extraction exception catch behavior | Bryan Newbold | 2021-05-24 | 1 | -3/+2 | |
| | ||||||
* | ingest PDF extraction updates | Bryan Newbold | 2021-05-21 | 3 | -2/+74 | |
| | ||||||
* | better OSF preprint download re-writing | Bryan Newbold | 2021-05-21 | 1 | -6/+23 | |
| | ||||||
* | html ingest: remove whitespace around relative URLs (eg, for d-lib) | Bryan Newbold | 2021-05-21 | 1 | -1/+1 | |
| | ||||||
* | add cdx_collection.py python script (from scratch repo) | Bryan Newbold | 2021-05-04 | 1 | -0/+80 | |
| | ||||||
* | ingest: cap max body size to ~128 MByte | Bryan Newbold | 2021-04-27 | 1 | -0/+6 | |
| | | | | Should resolve 'magic' OOM errors in production. | |||||
* | persist: skip very long URLs | Bryan Newbold | 2021-04-12 | 1 | -0/+4 | |
| | ||||||
* | update default postgrest ('db') API endpoint | Bryan Newbold | 2021-04-09 | 1 | -1/+1 | |
| | ||||||
* | grobid: disable biblio-glutton consolidation | Bryan Newbold | 2021-04-07 | 1 | -3/+3 | |
| | ||||||
* | ingest: handle current degruyter PDF link pattern | Bryan Newbold | 2021-03-26 | 1 | -0/+8 | |
| | ||||||
* | add missing dotfiles (due to gitignore oops) | Bryan Newbold | 2021-01-18 | 2 | -0/+12 | |
| | ||||||
* | pipenv: lock minio S3 library to <7.0.0 | Bryan Newbold | 2021-01-14 | 2 | -242/+196 | |
| | | | | | | | | | | | In this upstream commit: https://github.com/minio/minio-py/commit/b81883a98e6f8a09e2903609caabbf0956dd0ec9 The API for errors changes, which makes it harder for use to catch specific exceptions (such as "NoSuchKey" as a Not Found / 404 error). Instead of refactoring, just going to pin the library. We should probably remove this library for a non-implementation-specific S3 client at some point; minio seems simpler than, eg, boto3, but there is probably something ever simpler out there. | |||||
* | more expansive python/.gitignore rules (all .gz) | Bryan Newbold | 2021-01-05 | 1 | -1/+1 | |
| |