aboutsummaryrefslogtreecommitdiffstats
Commit message (Expand)AuthorAgeFilesLines
...
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-042-2/+8
* sql: grobid_refs table JSON as 'JSON' not 'JSONB'Bryan Newbold2021-11-042-3/+3
* grobid refs backfill progressBryan Newbold2021-11-041-1/+43
* record SQL table sizes at start of crossref re-ingestBryan Newbold2021-11-041-0/+19
* start notes on crossref refs backfillBryan Newbold2021-11-041-0/+54
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-043-9/+33
* add grobid_refs and crossref_with_refs to sandcrawler-db SQL schemaBryan Newbold2021-11-041-0/+21
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-044-5/+212
* update grobid refs proposalBryan Newbold2021-11-041-10/+72
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
* initial proposal for GROBID refs table and pipelineBryan Newbold2021-11-041-0/+63
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-047-6/+839
* pipenv: bump grobid_tei_xml version to 0.1.2Bryan Newbold2021-11-042-11/+11
* pdftrio client: use HTTP session for POSTsBryan Newbold2021-11-031-1/+1
* workers: use HTTP session for archive.org fetchesBryan Newbold2021-11-031-3/+3
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
* SPN reingest: 6 hour minimum, 6 month maxBryan Newbold2021-11-031-2/+2
* sql: fix typo in quarterly (not weekly) scriptBryan Newbold2021-11-031-1/+1
* sql: fixes to ingest_fileset_platform schema (from table creation)Bryan Newbold2021-11-012-12/+12
* updates/corrections to old small.json GROBID metadata example fileBryan Newbold2021-10-271-6/+1
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-277-224/+22
* small type annotation things from additional packagesBryan Newbold2021-10-272-5/+14
* toolchain config updatesBryan Newbold2021-10-273-10/+6
* make fmt (black 21.9b0)Bryan Newbold2021-10-2757-3126/+3991
* pipenv: flipflop from yapf back to black; more type packages; bump grobid_tei...Bryan Newbold2021-10-272-27/+112
* fileset: refactor out tables of helpersBryan Newbold2021-10-273-21/+19
* gitlab-ci: copy env var in to place for testsBryan Newbold2021-10-271-0/+1
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-265-8/+11
* small type annotation hackBryan Newbold2021-10-261-1/+1
* fileset: fix field renaming bug (caught by mypy)Bryan Newbold2021-10-261-2/+2
* fileset ingest: fix table name typo (via mypy)Bryan Newbold2021-10-261-1/+1
* update 'XXX' notes from fileset ingest developmentBryan Newbold2021-10-262-9/+6
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-262-2/+2
* lint collection membership (last lint for now)Bryan Newbold2021-10-267-32/+32
* commit updated flake8 lint configurationBryan Newbold2021-10-261-6/+10
* ingest fileset: fix silly import typoBryan Newbold2021-10-261-1/+1
* type annotations for persist workers; required some workBryan Newbold2021-10-261-66/+59
* ingest file HTTP API: fixes from type checkingBryan Newbold2021-10-261-3/+3
* more progress on type annotationsBryan Newbold2021-10-268-34/+55
* grobid: fix a bug with consolidate_mode header, exposed by type annotationsBryan Newbold2021-10-261-1/+2
* grobid: type annotationsBryan Newbold2021-10-261-9/+19
* type annotations on SandcrawlerWorkerBryan Newbold2021-10-261-46/+57
* more progress on type annotations and lintingBryan Newbold2021-10-2611-55/+87
* live tests: FTP wayback replay now returns 200, not 226Bryan Newbold2021-10-261-2/+2
* ia: more tweaks to delicate code to satisfy type checkerBryan Newbold2021-10-261-10/+12
* ia helpers: enforce max_redirects count correctlyBryan Newbold2021-10-261-1/+1
* set CDX request params are str, not int or datetimeBryan Newbold2021-10-261-3/+6