aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
|
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
|
* wrap up crossref refs backfill notesBryan Newbold2021-11-101-0/+47
|
* grobid_tool: helper to process a single fileBryan Newbold2021-11-101-0/+15
|
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
|
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-102-0/+62
|
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
|
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
|
* update crossref/grobid refs generation notesBryan Newbold2021-11-041-4/+96
|
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-042-5/+11
| | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
* db (postgrest): actually use an HTTP sessionBryan Newbold2021-11-041-12/+24
| | | | Not as important with GET as POST, I think, but still best practice.
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
| | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
|
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
| | | | See also: https://github.com/kermitt2/grobid/issues/849
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-042-2/+8
|
* sql: grobid_refs table JSON as 'JSON' not 'JSONB'Bryan Newbold2021-11-042-3/+3
| | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it.
* grobid refs backfill progressBryan Newbold2021-11-041-1/+43
|
* record SQL table sizes at start of crossref re-ingestBryan Newbold2021-11-041-0/+19
|
* start notes on crossref refs backfillBryan Newbold2021-11-041-0/+54
|
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-043-9/+33
|
* add grobid_refs and crossref_with_refs to sandcrawler-db SQL schemaBryan Newbold2021-11-041-0/+21
|
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-044-5/+212
|
* update grobid refs proposalBryan Newbold2021-11-041-10/+72
|
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
| | | | Switched to using just 'key'/'id' for downstream matching.
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
|
* initial proposal for GROBID refs table and pipelineBryan Newbold2021-11-041-0/+63
|
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-047-6/+839
|
* pipenv: bump grobid_tei_xml version to 0.1.2Bryan Newbold2021-11-042-11/+11
|
* pdftrio client: use HTTP session for POSTsBryan Newbold2021-11-031-1/+1
|
* workers: use HTTP session for archive.org fetchesBryan Newbold2021-11-031-3/+3
|
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
| | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput.
* SPN reingest: 6 hour minimum, 6 month maxBryan Newbold2021-11-031-2/+2
|
* sql: fix typo in quarterly (not weekly) scriptBryan Newbold2021-11-031-1/+1
|
* sql: fixes to ingest_fileset_platform schema (from table creation)Bryan Newbold2021-11-012-12/+12
|
* updates/corrections to old small.json GROBID metadata example fileBryan Newbold2021-10-271-6/+1
|
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-277-224/+22
|
* small type annotation things from additional packagesBryan Newbold2021-10-272-5/+14
|
* toolchain config updatesBryan Newbold2021-10-273-10/+6
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-2757-3126/+3991
|
* pipenv: flipflop from yapf back to black; more type packages; bump ↵Bryan Newbold2021-10-272-27/+112
| | | | grobid_tei_xml
* fileset: refactor out tables of helpersBryan Newbold2021-10-273-21/+19
| | | | | | | Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized.
* gitlab-ci: copy env var in to place for testsBryan Newbold2021-10-271-0/+1
|
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-265-8/+11
|
* small type annotation hackBryan Newbold2021-10-261-1/+1
|
* fileset: fix field renaming bug (caught by mypy)Bryan Newbold2021-10-261-2/+2
|
* fileset ingest: fix table name typo (via mypy)Bryan Newbold2021-10-261-1/+1
|
* update 'XXX' notes from fileset ingest developmentBryan Newbold2021-10-262-9/+6
|
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-262-2/+2
| | | | This was caught during lint cleanup
* lint collection membership (last lint for now)Bryan Newbold2021-10-267-32/+32
|
* commit updated flake8 lint configurationBryan Newbold2021-10-261-6/+10
|