aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
|
* pipenv: update; newer devpi hostnameBryan Newbold2022-04-062-781/+850
|
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
|
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
|
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
| | | | Caught this prepping to ingest in to fatcat. Derp!
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
| | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.
* more sentry config changesBryan Newbold2022-02-255-5/+5
|
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
|
* switch from 'raven' to 'sentry-sdk'Bryan Newbold2022-02-245-37/+41
|
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
|
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
|
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
|
* sandcrawler_worker: add --skip-spn flagBryan Newbold2022-02-081-2/+7
|
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
|
* pipenv: update lock fileBryan Newbold2022-02-031-592/+614
|
* pipenv: black (code style tool) has a stable releaseBryan Newbold2022-02-031-4/+1
|
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
|
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
|
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
|
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
|
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
|
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
|
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
|
* lint ('not in')Bryan Newbold2021-12-151-2/+2
|
* lint: ignore unused 'sentry_client'Bryan Newbold2021-12-151-1/+1
|
* fix type with --enable-sentryBryan Newbold2021-12-151-1/+1
|
* ingest tool: allow enabling sentry (for exception debugging)Bryan Newbold2021-12-151-0/+13
|
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
|
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
|
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
|
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
| | | | | Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM.
* pipenv: add pymupdf; update trafilaturaBryan Newbold2021-12-152-420/+644
|
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
|
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
|
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
|
* ingest tool: allow configuration of GROBID endpointBryan Newbold2021-12-071-0/+7
|
* Revert "pipenv: update deps"Bryan Newbold2021-12-012-574/+382
| | | | | | This reverts commit 7a5b203dbb37958a452eb1be3bd1bf8ed94cbbce. There is a problem with `internetarchive` 2.2.0, so reverting for now.
* pipenv: update depsBryan Newbold2021-12-012-382/+574
|
* add CDX sha1hex lookup/fetch helper scriptBryan Newbold2021-11-301-0/+170
|
* codespell typos in python (comments)Bryan Newbold2021-11-245-5/+5
|
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
|
* ingest tool: new backfill modeBryan Newbold2021-11-161-0/+76
|
* make fmtBryan Newbold2021-11-161-1/+1
|
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
| | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
|
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
|
* grobid_tool: helper to process a single fileBryan Newbold2021-11-101-0/+15
|
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
|
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-102-0/+62
|
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
|