Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | 2022 patch crawl bulk ingest notes | Bryan Newbold | 2022-03-02 | 1 | -0/+106 | |
| | ||||||
* | update old OAI-PMH patch crawl notes | Bryan Newbold | 2022-02-28 | 1 | -1/+36 | |
| | ||||||
* | more sentry config changes | Bryan Newbold | 2022-02-25 | 5 | -5/+5 | |
| | ||||||
* | small lint/typo/fmt fixes | Bryan Newbold | 2022-02-24 | 3 | -5/+5 | |
| | ||||||
* | switch from 'raven' to 'sentry-sdk' | Bryan Newbold | 2022-02-24 | 5 | -37/+41 | |
| | ||||||
* | another bad PDF sha1 | Bryan Newbold | 2022-02-23 | 1 | -0/+1 | |
| | ||||||
* | ingest: fix mistakenly commented except block (?) | Bryan Newbold | 2022-02-18 | 1 | -4/+3 | |
| | ||||||
* | ingest: handle more fileset failure modes | Bryan Newbold | 2022-02-18 | 2 | -3/+30 | |
| | ||||||
* | sandcrawler_worker: add --skip-spn flag | Bryan Newbold | 2022-02-08 | 1 | -2/+7 | |
| | ||||||
* | yet another bad PDF sha1 | Bryan Newbold | 2022-02-08 | 1 | -0/+1 | |
| | ||||||
* | more patch crawling | Bryan Newbold | 2022-02-08 | 2 | -9/+209 | |
| | ||||||
* | OAI-PMH patch crawl more updates | Bryan Newbold | 2022-02-08 | 1 | -2/+71 | |
| | ||||||
* | sql: script to reingest recent spn2 lookup failure in bulk mode | Bryan Newbold | 2022-02-08 | 5 | -18/+71 | |
| | ||||||
* | pipenv: update lock file | Bryan Newbold | 2022-02-03 | 1 | -592/+614 | |
| | ||||||
* | pipenv: black (code style tool) has a stable release | Bryan Newbold | 2022-02-03 | 1 | -4/+1 | |
| | ||||||
* | 'trawling' proposal (in progress) | Bryan Newbold | 2022-01-27 | 1 | -0/+177 | |
| | ||||||
* | ingest notes: various in-progress projects | Bryan Newbold | 2022-01-27 | 4 | -3/+800 | |
| | ||||||
* | sandcrawler: additional extracts, mostly OJS | Bryan Newbold | 2022-01-13 | 1 | -1/+23 | |
| | ||||||
* | filesets: more figshare URL patterns | Bryan Newbold | 2022-01-13 | 1 | -0/+13 | |
| | ||||||
* | fileset ingest: better verification of resources | Bryan Newbold | 2022-01-13 | 1 | -7/+23 | |
| | ||||||
* | ingest: PDF pattern for integrityresjournals.org | Bryan Newbold | 2022-01-13 | 1 | -0/+8 | |
| | ||||||
* | null-body -> empty-blob | Bryan Newbold | 2022-01-13 | 3 | -4/+8 | |
| | ||||||
* | spn: handle blocked-url (etc) better | Bryan Newbold | 2022-01-11 | 1 | -0/+10 | |
| | ||||||
* | enqueue PLATFORM PDFs for crawl | Bryan Newbold | 2022-01-07 | 1 | -0/+23 | |
| | ||||||
* | document progress on re-GROBID-ing | Bryan Newbold | 2022-01-05 | 1 | -0/+89 | |
| | ||||||
* | filesets: handle weird figshare link-only case better | Bryan Newbold | 2021-12-16 | 1 | -1/+4 | |
| | ||||||
* | lint ('not in') | Bryan Newbold | 2021-12-15 | 1 | -2/+2 | |
| | ||||||
* | lint: ignore unused 'sentry_client' | Bryan Newbold | 2021-12-15 | 1 | -1/+1 | |
| | ||||||
* | fix type with --enable-sentry | Bryan Newbold | 2021-12-15 | 1 | -1/+1 | |
| | ||||||
* | ingest tool: allow enabling sentry (for exception debugging) | Bryan Newbold | 2021-12-15 | 1 | -0/+13 | |
| | ||||||
* | more fileset ingest tweaks | Bryan Newbold | 2021-12-15 | 2 | -0/+7 | |
| | ||||||
* | fileset ingest: more requests timeouts, sessions | Bryan Newbold | 2021-12-15 | 3 | -37/+68 | |
| | ||||||
* | fileset ingest: create tmp subdirectories if needed | Bryan Newbold | 2021-12-15 | 1 | -0/+5 | |
| | ||||||
* | fileset ingest: configure IA session from env | Bryan Newbold | 2021-12-15 | 1 | -1/+6 | |
| | | | | | Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM. | |||||
* | pipenv: add pymupdf; update trafilatura | Bryan Newbold | 2021-12-15 | 2 | -420/+644 | |
| | ||||||
* | fileset ingest: actually use spn2 CLI flag | Bryan Newbold | 2021-12-11 | 2 | -3/+4 | |
| | ||||||
* | notes on re-GROBID-ing (and re-extracting) some filestrawler | Bryan Newbold | 2021-12-09 | 1 | -0/+289 | |
| | ||||||
* | grobid: set a maximum file size (256 MByte) | Bryan Newbold | 2021-12-07 | 1 | -0/+8 | |
| | ||||||
* | worker: add kafka_group_suffix option | Bryan Newbold | 2021-12-07 | 1 | -3/+19 | |
| | ||||||
* | ingest tool: allow configuration of GROBID endpoint | Bryan Newbold | 2021-12-07 | 1 | -0/+7 | |
| | ||||||
* | 2021-12-02 database table size stats | Bryan Newbold | 2021-12-07 | 1 | -0/+22 | |
| | ||||||
* | sandcrawler SQL dump and upload updates | Bryan Newbold | 2021-12-07 | 1 | -4/+12 | |
| | ||||||
* | update fatcat_file SQL table schema, and add backfill notes | Bryan Newbold | 2021-12-07 | 1 | -1/+3 | |
| | ||||||
* | update fatcat_file SQL table schema, and add backfill notes | Bryan Newbold | 2021-12-01 | 1 | -0/+13 | |
| | ||||||
* | commit old patch crawl notes | Bryan Newbold | 2021-12-01 | 1 | -0/+488 | |
| | ||||||
* | Revert "pipenv: update deps" | Bryan Newbold | 2021-12-01 | 2 | -574/+382 | |
| | | | | | | This reverts commit 7a5b203dbb37958a452eb1be3bd1bf8ed94cbbce. There is a problem with `internetarchive` 2.2.0, so reverting for now. | |||||
* | pipenv: update deps | Bryan Newbold | 2021-12-01 | 2 | -382/+574 | |
| | ||||||
* | add CDX sha1hex lookup/fetch helper script | Bryan Newbold | 2021-11-30 | 1 | -0/+170 | |
| | ||||||
* | sandcrawler SQL stats | Bryan Newbold | 2021-11-27 | 2 | -12/+425 | |
| | ||||||
* | codespell typos in README and original RFC | Bryan Newbold | 2021-11-24 | 2 | -2/+2 | |
| |