aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* partial notes on .ua urgent crawlingBryan Newbold2022-03-111-0/+196
|
* 2022 patch crawl bulk ingest notesBryan Newbold2022-03-021-0/+106
|
* update old OAI-PMH patch crawl notesBryan Newbold2022-02-281-1/+36
|
* more sentry config changesBryan Newbold2022-02-255-5/+5
|
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
|
* switch from 'raven' to 'sentry-sdk'Bryan Newbold2022-02-245-37/+41
|
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
|
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
|
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
|
* sandcrawler_worker: add --skip-spn flagBryan Newbold2022-02-081-2/+7
|
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
|
* more patch crawlingBryan Newbold2022-02-082-9/+209
|
* OAI-PMH patch crawl more updatesBryan Newbold2022-02-081-2/+71
|
* sql: script to reingest recent spn2 lookup failure in bulk modeBryan Newbold2022-02-085-18/+71
|
* pipenv: update lock fileBryan Newbold2022-02-031-592/+614
|
* pipenv: black (code style tool) has a stable releaseBryan Newbold2022-02-031-4/+1
|
* 'trawling' proposal (in progress)Bryan Newbold2022-01-271-0/+177
|
* ingest notes: various in-progress projectsBryan Newbold2022-01-274-3/+800
|
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
|
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
|
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
|
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
|
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
|
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
|
* enqueue PLATFORM PDFs for crawlBryan Newbold2022-01-071-0/+23
|
* document progress on re-GROBID-ingBryan Newbold2022-01-051-0/+89
|
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
|
* lint ('not in')Bryan Newbold2021-12-151-2/+2
|
* lint: ignore unused 'sentry_client'Bryan Newbold2021-12-151-1/+1
|
* fix type with --enable-sentryBryan Newbold2021-12-151-1/+1
|
* ingest tool: allow enabling sentry (for exception debugging)Bryan Newbold2021-12-151-0/+13
|
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
|
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
|
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
|
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
| | | | | Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM.
* pipenv: add pymupdf; update trafilaturaBryan Newbold2021-12-152-420/+644
|
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
|
* notes on re-GROBID-ing (and re-extracting) some filestrawlerBryan Newbold2021-12-091-0/+289
|
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
|
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
|
* ingest tool: allow configuration of GROBID endpointBryan Newbold2021-12-071-0/+7
|
* 2021-12-02 database table size statsBryan Newbold2021-12-071-0/+22
|
* sandcrawler SQL dump and upload updatesBryan Newbold2021-12-071-4/+12
|
* update fatcat_file SQL table schema, and add backfill notesBryan Newbold2021-12-071-1/+3
|
* update fatcat_file SQL table schema, and add backfill notesBryan Newbold2021-12-011-0/+13
|
* commit old patch crawl notesBryan Newbold2021-12-011-0/+488
|
* Revert "pipenv: update deps"Bryan Newbold2021-12-012-574/+382
| | | | | | This reverts commit 7a5b203dbb37958a452eb1be3bd1bf8ed94cbbce. There is a problem with `internetarchive` 2.2.0, so reverting for now.
* pipenv: update depsBryan Newbold2021-12-012-382/+574
|
* add CDX sha1hex lookup/fetch helper scriptBryan Newbold2021-11-301-0/+170
|
* sandcrawler SQL statsBryan Newbold2021-11-272-12/+425
|