aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* scripts: example archiveorg-to-fileset importerBryan Newbold2021-10-151-0/+138
|
* refactoring; progress on filesetsBryan Newbold2021-10-153-9/+27
|
* rename some python files for clarityBryan Newbold2021-10-153-0/+0
|
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
|
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* cdx_collection.py: minor lint issueBryan Newbold2021-10-041-1/+1
|
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-042-20/+84
|
* html ingest: report dt with broken CDX recordsBryan Newbold2021-10-041-1/+1
|
* allow through unknown-scope HTML ingests, for possible SPN importBryan Newbold2021-10-011-11/+5
|
* html: fix logging of broken CDX URLBryan Newbold2021-10-011-1/+1
|
* ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* fix typo with spn_cdx_retry_sec argBryan Newbold2021-09-301-1/+1
|
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-303-3/+9
|
* yet another bad PDF sha1Bryan Newbold2021-09-301-0/+1
|
* new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-301-1/+7
| | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
* old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
|
* ingest: more block patterns, for huge databasesBryan Newbold2021-09-081-1/+4
|
* yet more PDF sha1 to skipBryan Newbold2021-09-031-0/+5
|
* yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
|
* ingest: check URL blocklist again after redirectsBryan Newbold2021-09-031-0/+7
|
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-032-6/+39
|
* HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
|
* HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
|
* HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
|
* html ingest: detect some blog platforms, and allow lower wordcount thresholdBryan Newbold2021-08-161-0/+6
|
* html ingest: detect domain homepage (no path) as special caseBryan Newbold2021-08-161-0/+8
|
* html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
| | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
* more bad PDF hashesBryan Newbold2021-07-261-0/+2
|
* ingest: fix postgrest lookup bug (double get of GROBID)Bryan Newbold2021-07-261-1/+1
|
* more blocked-cookie patterns; fix old typoBryan Newbold2021-07-141-2/+2
|
* another bad PDF sha1Bryan Newbold2021-07-131-0/+1
|
* crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
|
* crawl: SPN self-redirect hackBryan Newbold2021-07-131-0/+9
|
* crawl: small comment updatesBryan Newbold2021-07-131-3/+6
|
* another lowercase DOI in an (unused?) scriptBryan Newbold2021-07-131-1/+1
|
* gitignore: samples/Bryan Newbold2021-07-131-0/+1
|
* add crossref postgrest fetch support to python db helpersBryan Newbold2021-06-021-0/+9
|
* python Makefile: fix test/*.py linting with newer pylintBryan Newbold2021-05-241-1/+1
|
* ingest: fix html PDF extraction exception catch behaviorBryan Newbold2021-05-241-3/+2
|
* ingest PDF extraction updatesBryan Newbold2021-05-213-2/+74
|
* better OSF preprint download re-writingBryan Newbold2021-05-211-6/+23
|
* html ingest: remove whitespace around relative URLs (eg, for d-lib)Bryan Newbold2021-05-211-1/+1
|
* add cdx_collection.py python script (from scratch repo)Bryan Newbold2021-05-041-0/+80
|
* ingest: cap max body size to ~128 MByteBryan Newbold2021-04-271-0/+6
| | | | Should resolve 'magic' OOM errors in production.
* persist: skip very long URLsBryan Newbold2021-04-121-0/+4
|
* update default postgrest ('db') API endpointBryan Newbold2021-04-091-1/+1
|
* grobid: disable biblio-glutton consolidationBryan Newbold2021-04-071-3/+3
|
* ingest: handle current degruyter PDF link patternBryan Newbold2021-03-261-0/+8
|
* add missing dotfiles (due to gitignore oops)Bryan Newbold2021-01-182-0/+12
|
* pipenv: lock minio S3 library to <7.0.0Bryan Newbold2021-01-142-242/+196
| | | | | | | | | | | In this upstream commit: https://github.com/minio/minio-py/commit/b81883a98e6f8a09e2903609caabbf0956dd0ec9 The API for errors changes, which makes it harder for use to catch specific exceptions (such as "NoSuchKey" as a Not Found / 404 error). Instead of refactoring, just going to pin the library. We should probably remove this library for a non-implementation-specific S3 client at some point; minio seems simpler than, eg, boto3, but there is probably something ever simpler out there.