aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Expand)AuthorAgeFilesLines
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-122-1/+12
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-162-4/+19
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
* python make fmtBryan Newbold2022-05-051-3/+1
* ingest spn2: fix testsBryan Newbold2022-05-054-6/+108
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
* ingest_tool: fix arg parsingBryan Newbold2022-05-031-8/+8
* switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350Bryan Newbold2022-05-032-2/+2
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
* make fmtBryan Newbold2022-04-261-2/+5
* ingest_tool: spn-status command to check user's quotaBryan Newbold2022-04-261-0/+19
* flake8: allow 'Any' typesBryan Newbold2022-04-261-1/+2
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
* pipenv: update; newer devpi hostnameBryan Newbold2022-04-062-781/+850
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
* more sentry config changesBryan Newbold2022-02-255-5/+5
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
* switch from 'raven' to 'sentry-sdk'Bryan Newbold2022-02-245-37/+41
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
* sandcrawler_worker: add --skip-spn flagBryan Newbold2022-02-081-2/+7
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
* pipenv: update lock fileBryan Newbold2022-02-031-592/+614
* pipenv: black (code style tool) has a stable releaseBryan Newbold2022-02-031-4/+1
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
* lint ('not in')Bryan Newbold2021-12-151-2/+2
* lint: ignore unused 'sentry_client'Bryan Newbold2021-12-151-1/+1
* fix type with --enable-sentryBryan Newbold2021-12-151-1/+1
* ingest tool: allow enabling sentry (for exception debugging)Bryan Newbold2021-12-151-0/+13
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
* pipenv: add pymupdf; update trafilaturaBryan Newbold2021-12-152-420/+644
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
* ingest tool: allow configuration of GROBID endpointBryan Newbold2021-12-071-0/+7