aboutsummaryrefslogtreecommitdiffstats
Commit message (Expand)AuthorAgeFilesLines
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
* enqueue PLATFORM PDFs for crawlBryan Newbold2022-01-071-0/+23
* document progress on re-GROBID-ingBryan Newbold2022-01-051-0/+89
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
* lint ('not in')Bryan Newbold2021-12-151-2/+2
* lint: ignore unused 'sentry_client'Bryan Newbold2021-12-151-1/+1
* fix type with --enable-sentryBryan Newbold2021-12-151-1/+1
* ingest tool: allow enabling sentry (for exception debugging)Bryan Newbold2021-12-151-0/+13
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
* pipenv: add pymupdf; update trafilaturaBryan Newbold2021-12-152-420/+644
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
* notes on re-GROBID-ing (and re-extracting) some filestrawlerBryan Newbold2021-12-091-0/+289
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
* ingest tool: allow configuration of GROBID endpointBryan Newbold2021-12-071-0/+7
* 2021-12-02 database table size statsBryan Newbold2021-12-071-0/+22
* sandcrawler SQL dump and upload updatesBryan Newbold2021-12-071-4/+12
* update fatcat_file SQL table schema, and add backfill notesBryan Newbold2021-12-071-1/+3
* update fatcat_file SQL table schema, and add backfill notesBryan Newbold2021-12-011-0/+13
* commit old patch crawl notesBryan Newbold2021-12-011-0/+488
* Revert "pipenv: update deps"Bryan Newbold2021-12-012-574/+382
* pipenv: update depsBryan Newbold2021-12-012-382/+574
* add CDX sha1hex lookup/fetch helper scriptBryan Newbold2021-11-301-0/+170
* sandcrawler SQL statsBryan Newbold2021-11-272-12/+425
* codespell typos in README and original RFCBryan Newbold2021-11-242-2/+2
* codespell typos in python (comments)Bryan Newbold2021-11-245-5/+5
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
* codespell fixes in proposalsBryan Newbold2021-11-248-16/+16
* ingest tool: new backfill modeBryan Newbold2021-11-161-0/+76
* make fmtBryan Newbold2021-11-161-1/+1
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
* wrap up crossref refs backfill notesBryan Newbold2021-11-101-0/+47
* grobid_tool: helper to process a single fileBryan Newbold2021-11-101-0/+15
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-102-0/+62
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
* update crossref/grobid refs generation notesBryan Newbold2021-11-041-4/+96
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-042-5/+11
* db (postgrest): actually use an HTTP sessionBryan Newbold2021-11-041-12/+24
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10