aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* more fileset iterationBryan Newbold2021-10-155-45/+81
|
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-153-52/+31
|
* filesets: iteration of implementation and docsBryan Newbold2021-10-154-82/+148
|
* fileset ingest: improve platform parsingBryan Newbold2021-10-151-12/+196
|
* fileset ingest: improve error handlingBryan Newbold2021-10-154-48/+106
|
* initial implementation of zenodo platform importBryan Newbold2021-10-151-0/+100
|
* initial figshare platform helperBryan Newbold2021-10-151-0/+95
|
* improvements to platform helpersBryan Newbold2021-10-153-34/+44
|
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-152-13/+31
|
* progress on web ingest strategyBryan Newbold2021-10-153-12/+121
|
* fileset ingest progress for dataverseBryan Newbold2021-10-154-23/+291
|
* local-file version of gen_file_metadataBryan Newbold2021-10-153-3/+56
|
* progress on dataset ingestBryan Newbold2021-10-154-122/+333
|
* ingest tool: always require ingest type as part of 'single' commandBryan Newbold2021-10-151-3/+3
|
* wrap up previous renaming workBryan Newbold2021-10-154-6/+4
|
* progress on fileset/dataset ingestBryan Newbold2021-10-154-0/+403
|
* scripts: example archiveorg-to-fileset importerBryan Newbold2021-10-151-0/+138
|
* refactoring; progress on filesetsBryan Newbold2021-10-153-9/+27
|
* rename some python files for clarityBryan Newbold2021-10-153-0/+0
|
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
|
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* cdx_collection.py: minor lint issueBryan Newbold2021-10-041-1/+1
|
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-042-20/+84
|
* html ingest: report dt with broken CDX recordsBryan Newbold2021-10-041-1/+1
|
* allow through unknown-scope HTML ingests, for possible SPN importBryan Newbold2021-10-011-11/+5
|
* html: fix logging of broken CDX URLBryan Newbold2021-10-011-1/+1
|
* ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* fix typo with spn_cdx_retry_sec argBryan Newbold2021-09-301-1/+1
|
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-303-3/+9
|
* yet another bad PDF sha1Bryan Newbold2021-09-301-0/+1
|
* new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-301-1/+7
| | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
* old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
|
* ingest: more block patterns, for huge databasesBryan Newbold2021-09-081-1/+4
|
* yet more PDF sha1 to skipBryan Newbold2021-09-031-0/+5
|
* yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
|
* ingest: check URL blocklist again after redirectsBryan Newbold2021-09-031-0/+7
|
* refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-032-6/+39
|
* HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
|
* HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
|
* HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
|
* html ingest: detect some blog platforms, and allow lower wordcount thresholdBryan Newbold2021-08-161-0/+6
|
* html ingest: detect domain homepage (no path) as special caseBryan Newbold2021-08-161-0/+8
|
* html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
| | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
* more bad PDF hashesBryan Newbold2021-07-261-0/+2
|
* ingest: fix postgrest lookup bug (double get of GROBID)Bryan Newbold2021-07-261-1/+1
|
* more blocked-cookie patterns; fix old typoBryan Newbold2021-07-141-2/+2
|
* another bad PDF sha1Bryan Newbold2021-07-131-0/+1
|
* crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
|
* crawl: SPN self-redirect hackBryan Newbold2021-07-131-0/+9
|
* crawl: small comment updatesBryan Newbold2021-07-131-3/+6
|