aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* progress on dataset ingestBryan Newbold2021-10-154-122/+333
|
* ingest tool: always require ingest type as part of 'single' commandBryan Newbold2021-10-151-3/+3
|
* wrap up previous renaming workBryan Newbold2021-10-154-6/+4
|
* progress on fileset/dataset ingestBryan Newbold2021-10-154-0/+403
|
* scripts: example archiveorg-to-fileset importerBryan Newbold2021-10-151-0/+138
|
* refactoring; progress on filesetsBryan Newbold2021-10-153-9/+27
|
* rename some python files for clarityBryan Newbold2021-10-153-0/+0
|
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
|
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* cdx_collection.py: minor lint issueBryan Newbold2021-10-041-1/+1
|
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-042-20/+84
|
* html ingest: report dt with broken CDX recordsBryan Newbold2021-10-041-1/+1
|
* allow through unknown-scope HTML ingests, for possible SPN importBryan Newbold2021-10-011-11/+5
|
* html: fix logging of broken CDX URLBryan Newbold2021-10-011-1/+1
|
* ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* fix typo with spn_cdx_retry_sec argBryan Newbold2021-09-301-1/+1
|
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-303-3/+9
|
* yet another bad PDF sha1Bryan Newbold2021-09-301-0/+1
|
* new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-301-1/+7
| | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
* old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
|
* ingest: more block patterns, for huge databasesBryan Newbold2021-09-081-1/+4
|
* yet more PDF sha1 to skipBryan Newbold2021-09-031-0/+5
|
* yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
|
* ingest: check URL blocklist again after redirectsBryan Newbold2021-09-031-0/+7
|
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-032-6/+39
|
* HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
|
* HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
|
* HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
|
* html ingest: detect some blog platforms, and allow lower wordcount thresholdBryan Newbold2021-08-161-0/+6
|
* html ingest: detect domain homepage (no path) as special caseBryan Newbold2021-08-161-0/+8
|
* html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
| | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
* more bad PDF hashesBryan Newbold2021-07-261-0/+2
|
* ingest: fix postgrest lookup bug (double get of GROBID)Bryan Newbold2021-07-261-1/+1
|
* more blocked-cookie patterns; fix old typoBryan Newbold2021-07-141-2/+2
|
* another bad PDF sha1Bryan Newbold2021-07-131-0/+1
|
* crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
|
* crawl: SPN self-redirect hackBryan Newbold2021-07-131-0/+9
|
* crawl: small comment updatesBryan Newbold2021-07-131-3/+6
|
* another lowercase DOI in an (unused?) scriptBryan Newbold2021-07-131-1/+1
|
* gitignore: samples/Bryan Newbold2021-07-131-0/+1
|
* add crossref postgrest fetch support to python db helpersBryan Newbold2021-06-021-0/+9
|
* python Makefile: fix test/*.py linting with newer pylintBryan Newbold2021-05-241-1/+1
|
* ingest: fix html PDF extraction exception catch behaviorBryan Newbold2021-05-241-3/+2
|
* ingest PDF extraction updatesBryan Newbold2021-05-213-2/+74
|
* better OSF preprint download re-writingBryan Newbold2021-05-211-6/+23
|
* html ingest: remove whitespace around relative URLs (eg, for d-lib)Bryan Newbold2021-05-211-1/+1
|
* add cdx_collection.py python script (from scratch repo)Bryan Newbold2021-05-041-0/+80
|
* ingest: cap max body size to ~128 MByteBryan Newbold2021-04-271-0/+6
| | | | Should resolve 'magic' OOM errors in production.
* persist: skip very long URLsBryan Newbold2021-04-121-0/+4
|
* update default postgrest ('db') API endpointBryan Newbold2021-04-091-1/+1
|