aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* some weekly crawl numbers (not very helpful)Bryan Newbold2022-05-031-0/+191
|
* switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350Bryan Newbold2022-05-039-14/+14
|
* April 2022 sandcrawler DB statsBryan Newbold2022-04-271-0/+432
|
* more dataset crawl notesBryan Newbold2022-04-261-0/+53
|
* .ua crawling follow-up statsBryan Newbold2022-04-261-2/+2
|
* update HBase Thrift gateway hostBryan Newbold2022-04-261-1/+1
|
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
| | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture.
* make fmtBryan Newbold2022-04-261-2/+5
|
* ingest_tool: spn-status command to check user's quotaBryan Newbold2022-04-261-0/+19
|
* flake8: allow 'Any' typesBryan Newbold2022-04-261-1/+2
|
* start notes on unpaywall and targeted/patch crawlsBryan Newbold2022-04-202-0/+277
|
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
|
* pipenv: update; newer devpi hostnameBryan Newbold2022-04-062-781/+850
|
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
|
* .ua ingest notesBryan Newbold2022-04-041-0/+29
|
* sql: add source/created index on ingest_request tableBryan Newbold2022-04-041-0/+1
|
* sql: fix reingest query missing type on LEFT JOIN; wrap in read-only transactionBryan Newbold2022-04-045-5/+27
|
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
|
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
| | | | Caught this prepping to ingest in to fatcat. Derp!
* various ingest/task notesBryan Newbold2022-03-224-5/+97
|
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
| | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.
* DOAJ ingest/crawl notesBryan Newbold2022-03-111-0/+266
|
* partial notes on .ua urgent crawlingBryan Newbold2022-03-111-0/+196
|
* 2022 patch crawl bulk ingest notesBryan Newbold2022-03-021-0/+106
|
* update old OAI-PMH patch crawl notesBryan Newbold2022-02-281-1/+36
|
* more sentry config changesBryan Newbold2022-02-255-5/+5
|
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
|
* switch from 'raven' to 'sentry-sdk'Bryan Newbold2022-02-245-37/+41
|
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
|
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
|
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
|
* sandcrawler_worker: add --skip-spn flagBryan Newbold2022-02-081-2/+7
|
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
|
* more patch crawlingBryan Newbold2022-02-082-9/+209
|
* OAI-PMH patch crawl more updatesBryan Newbold2022-02-081-2/+71
|
* sql: script to reingest recent spn2 lookup failure in bulk modeBryan Newbold2022-02-085-18/+71
|
* pipenv: update lock fileBryan Newbold2022-02-031-592/+614
|
* pipenv: black (code style tool) has a stable releaseBryan Newbold2022-02-031-4/+1
|
* 'trawling' proposal (in progress)Bryan Newbold2022-01-271-0/+177
|
* ingest notes: various in-progress projectsBryan Newbold2022-01-274-3/+800
|
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
|
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
|
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
|
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
|
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
|
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
|
* enqueue PLATFORM PDFs for crawlBryan Newbold2022-01-071-0/+23
|
* document progress on re-GROBID-ingBryan Newbold2022-01-051-0/+89
|
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
|
* lint ('not in')Bryan Newbold2021-12-151-2/+2
|