Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | grobid_tool: helper to process a single file | Bryan Newbold | 2021-11-10 | 1 | -0/+15 | |
| | ||||||
* | ingest: start re-processing GROBID with newer version | Bryan Newbold | 2021-11-10 | 1 | -2/+6 | |
| | ||||||
* | simple persist worker/tool to backfill grobid_refs | Bryan Newbold | 2021-11-10 | 2 | -0/+62 | |
| | ||||||
* | grobid: extract more metadata in document TEI-XML | Bryan Newbold | 2021-11-10 | 1 | -0/+5 | |
| | ||||||
* | grobid: update 'TODO' comment based on review | Bryan Newbold | 2021-11-04 | 1 | -3/+0 | |
| | ||||||
* | update crossref/grobid refs generation notes | Bryan Newbold | 2021-11-04 | 1 | -4/+96 | |
| | ||||||
* | crossref grobid refs: another error case (ReadTimeout) | Bryan Newbold | 2021-11-04 | 2 | -5/+11 | |
| | | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML). | |||||
* | db (postgrest): actually use an HTTP session | Bryan Newbold | 2021-11-04 | 1 | -12/+24 | |
| | | | | Not as important with GET as POST, I think, but still best practice. | |||||
* | grobid: use requests session | Bryan Newbold | 2021-11-04 | 1 | -3/+4 | |
| | | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address')) | |||||
* | grobid crossref refs: try to handle HTTP 5xx and XML parse errors | Bryan Newbold | 2021-11-04 | 2 | -5/+33 | |
| | ||||||
* | grobid: handle weird whitespace unstructured from crossref | Bryan Newbold | 2021-11-04 | 1 | -1/+10 | |
| | | | | See also: https://github.com/kermitt2/grobid/issues/849 | |||||
* | crossref persist: batch size depends on whether parsing refs | Bryan Newbold | 2021-11-04 | 2 | -2/+8 | |
| | ||||||
* | sql: grobid_refs table JSON as 'JSON' not 'JSONB' | Bryan Newbold | 2021-11-04 | 2 | -3/+3 | |
| | | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it. | |||||
* | grobid refs backfill progress | Bryan Newbold | 2021-11-04 | 1 | -1/+43 | |
| | ||||||
* | record SQL table sizes at start of crossref re-ingest | Bryan Newbold | 2021-11-04 | 1 | -0/+19 | |
| | ||||||
* | start notes on crossref refs backfill | Bryan Newbold | 2021-11-04 | 1 | -0/+54 | |
| | ||||||
* | crossref persist: make GROBID ref parsing an option (not default) | Bryan Newbold | 2021-11-04 | 3 | -9/+33 | |
| | ||||||
* | add grobid_refs and crossref_with_refs to sandcrawler-db SQL schema | Bryan Newbold | 2021-11-04 | 1 | -0/+21 | |
| | ||||||
* | glue, utils, and worker code for crossref and grobid_refs | Bryan Newbold | 2021-11-04 | 4 | -5/+212 | |
| | ||||||
* | update grobid refs proposal | Bryan Newbold | 2021-11-04 | 1 | -10/+72 | |
| | ||||||
* | iterated GROBID citation cleaning and processing | Bryan Newbold | 2021-11-04 | 1 | -27/+45 | |
| | | | | Switched to using just 'key'/'id' for downstream matching. | |||||
* | grobid citations: first pass at cleaning unstructured | Bryan Newbold | 2021-11-04 | 1 | -2/+34 | |
| | ||||||
* | initial proposal for GROBID refs table and pipeline | Bryan Newbold | 2021-11-04 | 1 | -0/+63 | |
| | ||||||
* | initial crossref-refs via GROBID helper routine | Bryan Newbold | 2021-11-04 | 7 | -6/+839 | |
| | ||||||
* | pipenv: bump grobid_tei_xml version to 0.1.2 | Bryan Newbold | 2021-11-04 | 2 | -11/+11 | |
| | ||||||
* | pdftrio client: use HTTP session for POSTs | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | |
| | ||||||
* | workers: use HTTP session for archive.org fetches | Bryan Newbold | 2021-11-03 | 1 | -3/+3 | |
| | ||||||
* | IA (wayback): actually use an HTTP session for replay fetches | Bryan Newbold | 2021-11-03 | 1 | -2/+3 | |
| | | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput. | |||||
* | SPN reingest: 6 hour minimum, 6 month max | Bryan Newbold | 2021-11-03 | 1 | -2/+2 | |
| | ||||||
* | sql: fix typo in quarterly (not weekly) script | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | |
| | ||||||
* | sql: fixes to ingest_fileset_platform schema (from table creation) | Bryan Newbold | 2021-11-01 | 2 | -12/+12 | |
| | ||||||
* | updates/corrections to old small.json GROBID metadata example file | Bryan Newbold | 2021-10-27 | 1 | -6/+1 | |
| | ||||||
* | remove grobid2json helper file, replace with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 7 | -224/+22 | |
| | ||||||
* | small type annotation things from additional packages | Bryan Newbold | 2021-10-27 | 2 | -5/+14 | |
| | ||||||
* | toolchain config updates | Bryan Newbold | 2021-10-27 | 3 | -10/+6 | |
| | ||||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 57 | -3126/+3991 | |
| | ||||||
* | pipenv: flipflop from yapf back to black; more type packages; bump ↵ | Bryan Newbold | 2021-10-27 | 2 | -27/+112 | |
| | | | | grobid_tei_xml | |||||
* | fileset: refactor out tables of helpers | Bryan Newbold | 2021-10-27 | 3 | -21/+19 | |
| | | | | | | | Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized. | |||||
* | gitlab-ci: copy env var in to place for tests | Bryan Newbold | 2021-10-27 | 1 | -0/+1 | |
| | ||||||
* | fix type annotations for petabox body fetch helper | Bryan Newbold | 2021-10-26 | 5 | -8/+11 | |
| | ||||||
* | small type annotation hack | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | |
| | ||||||
* | fileset: fix field renaming bug (caught by mypy) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 | |
| | ||||||
* | fileset ingest: fix table name typo (via mypy) | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | |
| | ||||||
* | update 'XXX' notes from fileset ingest development | Bryan Newbold | 2021-10-26 | 2 | -9/+6 | |
| | ||||||
* | bugfix: setting html_biblio on ingest results | Bryan Newbold | 2021-10-26 | 2 | -2/+2 | |
| | | | | This was caught during lint cleanup | |||||
* | lint collection membership (last lint for now) | Bryan Newbold | 2021-10-26 | 7 | -32/+32 | |
| | ||||||
* | commit updated flake8 lint configuration | Bryan Newbold | 2021-10-26 | 1 | -6/+10 | |
| | ||||||
* | ingest fileset: fix silly import typo | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | |
| | ||||||
* | type annotations for persist workers; required some work | Bryan Newbold | 2021-10-26 | 1 | -66/+59 | |
| | | | | | Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes. | |||||
* | ingest file HTTP API: fixes from type checking | Bryan Newbold | 2021-10-26 | 1 | -3/+3 | |
| | | | | | This code is deprecated and should be removed anyways, but still interesting to see the fixes |