aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* add CDX sha1hex lookup/fetch helper scriptBryan Newbold2021-11-301-0/+170
|
* codespell typos in python (comments)Bryan Newbold2021-11-245-5/+5
|
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
|
* ingest tool: new backfill modeBryan Newbold2021-11-161-0/+76
|
* make fmtBryan Newbold2021-11-161-1/+1
|
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
| | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
|
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
|
* grobid_tool: helper to process a single fileBryan Newbold2021-11-101-0/+15
|
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
|
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-102-0/+62
|
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
|
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
|
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-042-5/+11
| | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
* db (postgrest): actually use an HTTP sessionBryan Newbold2021-11-041-12/+24
| | | | Not as important with GET as POST, I think, but still best practice.
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
| | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
|
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
| | | | See also: https://github.com/kermitt2/grobid/issues/849
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-042-2/+8
|
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-043-9/+33
|
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-044-5/+212
|
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
| | | | Switched to using just 'key'/'id' for downstream matching.
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
|
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-047-6/+839
|
* pipenv: bump grobid_tei_xml version to 0.1.2Bryan Newbold2021-11-042-11/+11
|
* pdftrio client: use HTTP session for POSTsBryan Newbold2021-11-031-1/+1
|
* workers: use HTTP session for archive.org fetchesBryan Newbold2021-11-031-3/+3
|
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
| | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput.
* updates/corrections to old small.json GROBID metadata example fileBryan Newbold2021-10-271-6/+1
|
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-277-224/+22
|
* small type annotation things from additional packagesBryan Newbold2021-10-272-5/+14
|
* toolchain config updatesBryan Newbold2021-10-273-10/+6
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-2757-3126/+3991
|
* pipenv: flipflop from yapf back to black; more type packages; bump ↵Bryan Newbold2021-10-272-27/+112
| | | | grobid_tei_xml
* fileset: refactor out tables of helpersBryan Newbold2021-10-273-21/+19
| | | | | | | Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized.
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-265-8/+11
|
* small type annotation hackBryan Newbold2021-10-261-1/+1
|
* fileset: fix field renaming bug (caught by mypy)Bryan Newbold2021-10-261-2/+2
|
* fileset ingest: fix table name typo (via mypy)Bryan Newbold2021-10-261-1/+1
|
* update 'XXX' notes from fileset ingest developmentBryan Newbold2021-10-262-9/+6
|
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-262-2/+2
| | | | This was caught during lint cleanup
* lint collection membership (last lint for now)Bryan Newbold2021-10-267-32/+32
|
* commit updated flake8 lint configurationBryan Newbold2021-10-261-6/+10
|
* ingest fileset: fix silly import typoBryan Newbold2021-10-261-1/+1
|
* type annotations for persist workers; required some workBryan Newbold2021-10-261-66/+59
| | | | | Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes.
* ingest file HTTP API: fixes from type checkingBryan Newbold2021-10-261-3/+3
| | | | | This code is deprecated and should be removed anyways, but still interesting to see the fixes
* more progress on type annotationsBryan Newbold2021-10-268-34/+55
|
* grobid: fix a bug with consolidate_mode header, exposed by type annotationsBryan Newbold2021-10-261-1/+2
|
* grobid: type annotationsBryan Newbold2021-10-261-9/+19
|
* type annotations on SandcrawlerWorkerBryan Newbold2021-10-261-46/+57
| | | | | These annoations have a broad impact! Being conservative to start: Any-to-Any for process(), etc.