aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Expand)AuthorAgeFilesLines
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
* pipenv: add pymupdf; update trafilaturaBryan Newbold2021-12-152-420/+644
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
* ingest tool: allow configuration of GROBID endpointBryan Newbold2021-12-071-0/+7
* Revert "pipenv: update deps"Bryan Newbold2021-12-012-574/+382
* pipenv: update depsBryan Newbold2021-12-012-382/+574
* add CDX sha1hex lookup/fetch helper scriptBryan Newbold2021-11-301-0/+170
* codespell typos in python (comments)Bryan Newbold2021-11-245-5/+5
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
* ingest tool: new backfill modeBryan Newbold2021-11-161-0/+76
* make fmtBryan Newbold2021-11-161-1/+1
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
* grobid_tool: helper to process a single fileBryan Newbold2021-11-101-0/+15
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-102-0/+62
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-042-5/+11
* db (postgrest): actually use an HTTP sessionBryan Newbold2021-11-041-12/+24
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-042-2/+8
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-043-9/+33
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-044-5/+212
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-047-6/+839
* pipenv: bump grobid_tei_xml version to 0.1.2Bryan Newbold2021-11-042-11/+11
* pdftrio client: use HTTP session for POSTsBryan Newbold2021-11-031-1/+1
* workers: use HTTP session for archive.org fetchesBryan Newbold2021-11-031-3/+3
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
* updates/corrections to old small.json GROBID metadata example fileBryan Newbold2021-10-271-6/+1
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-277-224/+22
* small type annotation things from additional packagesBryan Newbold2021-10-272-5/+14
* toolchain config updatesBryan Newbold2021-10-273-10/+6
* make fmt (black 21.9b0)Bryan Newbold2021-10-2757-3126/+3991
* pipenv: flipflop from yapf back to black; more type packages; bump grobid_tei...Bryan Newbold2021-10-272-27/+112
* fileset: refactor out tables of helpersBryan Newbold2021-10-273-21/+19
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-265-8/+11
* small type annotation hackBryan Newbold2021-10-261-1/+1
* fileset: fix field renaming bug (caught by mypy)Bryan Newbold2021-10-261-2/+2
* fileset ingest: fix table name typo (via mypy)Bryan Newbold2021-10-261-1/+1
* update 'XXX' notes from fileset ingest developmentBryan Newbold2021-10-262-9/+6
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-262-2/+2