Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350 | Bryan Newbold | 2022-05-03 | 9 | -14/+14 | |
| | ||||||
* | April 2022 sandcrawler DB stats | Bryan Newbold | 2022-04-27 | 1 | -0/+432 | |
| | ||||||
* | more dataset crawl notes | Bryan Newbold | 2022-04-26 | 1 | -0/+53 | |
| | ||||||
* | .ua crawling follow-up stats | Bryan Newbold | 2022-04-26 | 1 | -2/+2 | |
| | ||||||
* | update HBase Thrift gateway host | Bryan Newbold | 2022-04-26 | 1 | -1/+1 | |
| | ||||||
* | SPNv2: several fixes for prod throughput | Bryan Newbold | 2022-04-26 | 1 | -11/+34 | |
| | | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture. | |||||
* | make fmt | Bryan Newbold | 2022-04-26 | 1 | -2/+5 | |
| | ||||||
* | ingest_tool: spn-status command to check user's quota | Bryan Newbold | 2022-04-26 | 1 | -0/+19 | |
| | ||||||
* | flake8: allow 'Any' types | Bryan Newbold | 2022-04-26 | 1 | -1/+2 | |
| | ||||||
* | start notes on unpaywall and targeted/patch crawls | Bryan Newbold | 2022-04-20 | 2 | -0/+277 | |
| | ||||||
* | block isiarticles.com from future PDF crawls | Bryan Newbold | 2022-04-20 | 1 | -0/+2 | |
| | ||||||
* | pipenv: update; newer devpi hostname | Bryan Newbold | 2022-04-06 | 2 | -781/+850 | |
| | ||||||
* | ingest: drive.google.com ingest support | Bryan Newbold | 2022-04-04 | 1 | -0/+8 | |
| | ||||||
* | .ua ingest notes | Bryan Newbold | 2022-04-04 | 1 | -0/+29 | |
| | ||||||
* | sql: add source/created index on ingest_request table | Bryan Newbold | 2022-04-04 | 1 | -0/+1 | |
| | ||||||
* | sql: fix reingest query missing type on LEFT JOIN; wrap in read-only transaction | Bryan Newbold | 2022-04-04 | 5 | -5/+27 | |
| | ||||||
* | filesets: fix archive.org path naming | Bryan Newbold | 2022-03-29 | 1 | -7/+8 | |
| | ||||||
* | bugfix: sha1/md5 typo | Bryan Newbold | 2022-03-23 | 1 | -1/+1 | |
| | | | | Caught this prepping to ingest in to fatcat. Derp! | |||||
* | various ingest/task notes | Bryan Newbold | 2022-03-22 | 4 | -5/+97 | |
| | ||||||
* | file ingest: don't 'backoff' on spn2 backoff error | Bryan Newbold | 2022-03-22 | 2 | -0/+8 | |
| | | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those. | |||||
* | DOAJ ingest/crawl notes | Bryan Newbold | 2022-03-11 | 1 | -0/+266 | |
| | ||||||
* | partial notes on .ua urgent crawling | Bryan Newbold | 2022-03-11 | 1 | -0/+196 | |
| | ||||||
* | 2022 patch crawl bulk ingest notes | Bryan Newbold | 2022-03-02 | 1 | -0/+106 | |
| | ||||||
* | update old OAI-PMH patch crawl notes | Bryan Newbold | 2022-02-28 | 1 | -1/+36 | |
| | ||||||
* | more sentry config changes | Bryan Newbold | 2022-02-25 | 5 | -5/+5 | |
| | ||||||
* | small lint/typo/fmt fixes | Bryan Newbold | 2022-02-24 | 3 | -5/+5 | |
| | ||||||
* | switch from 'raven' to 'sentry-sdk' | Bryan Newbold | 2022-02-24 | 5 | -37/+41 | |
| | ||||||
* | another bad PDF sha1 | Bryan Newbold | 2022-02-23 | 1 | -0/+1 | |
| | ||||||
* | ingest: fix mistakenly commented except block (?) | Bryan Newbold | 2022-02-18 | 1 | -4/+3 | |
| | ||||||
* | ingest: handle more fileset failure modes | Bryan Newbold | 2022-02-18 | 2 | -3/+30 | |
| | ||||||
* | sandcrawler_worker: add --skip-spn flag | Bryan Newbold | 2022-02-08 | 1 | -2/+7 | |
| | ||||||
* | yet another bad PDF sha1 | Bryan Newbold | 2022-02-08 | 1 | -0/+1 | |
| | ||||||
* | more patch crawling | Bryan Newbold | 2022-02-08 | 2 | -9/+209 | |
| | ||||||
* | OAI-PMH patch crawl more updates | Bryan Newbold | 2022-02-08 | 1 | -2/+71 | |
| | ||||||
* | sql: script to reingest recent spn2 lookup failure in bulk mode | Bryan Newbold | 2022-02-08 | 5 | -18/+71 | |
| | ||||||
* | pipenv: update lock file | Bryan Newbold | 2022-02-03 | 1 | -592/+614 | |
| | ||||||
* | pipenv: black (code style tool) has a stable release | Bryan Newbold | 2022-02-03 | 1 | -4/+1 | |
| | ||||||
* | 'trawling' proposal (in progress) | Bryan Newbold | 2022-01-27 | 1 | -0/+177 | |
| | ||||||
* | ingest notes: various in-progress projects | Bryan Newbold | 2022-01-27 | 4 | -3/+800 | |
| | ||||||
* | sandcrawler: additional extracts, mostly OJS | Bryan Newbold | 2022-01-13 | 1 | -1/+23 | |
| | ||||||
* | filesets: more figshare URL patterns | Bryan Newbold | 2022-01-13 | 1 | -0/+13 | |
| | ||||||
* | fileset ingest: better verification of resources | Bryan Newbold | 2022-01-13 | 1 | -7/+23 | |
| | ||||||
* | ingest: PDF pattern for integrityresjournals.org | Bryan Newbold | 2022-01-13 | 1 | -0/+8 | |
| | ||||||
* | null-body -> empty-blob | Bryan Newbold | 2022-01-13 | 3 | -4/+8 | |
| | ||||||
* | spn: handle blocked-url (etc) better | Bryan Newbold | 2022-01-11 | 1 | -0/+10 | |
| | ||||||
* | enqueue PLATFORM PDFs for crawl | Bryan Newbold | 2022-01-07 | 1 | -0/+23 | |
| | ||||||
* | document progress on re-GROBID-ing | Bryan Newbold | 2022-01-05 | 1 | -0/+89 | |
| | ||||||
* | filesets: handle weird figshare link-only case better | Bryan Newbold | 2021-12-16 | 1 | -1/+4 | |
| | ||||||
* | lint ('not in') | Bryan Newbold | 2021-12-15 | 1 | -2/+2 | |
| | ||||||
* | lint: ignore unused 'sentry_client' | Bryan Newbold | 2021-12-15 | 1 | -1/+1 | |
| |