| Commit message (Collapse) | Author | Age | Files | Lines | |
|---|---|---|---|---|---|
| * | record SQL table sizes at start of crossref re-ingest | Bryan Newbold | 2021-11-04 | 1 | -0/+19 | 
| | | |||||
| * | start notes on crossref refs backfill | Bryan Newbold | 2021-11-04 | 1 | -0/+54 | 
| | | |||||
| * | crossref persist: make GROBID ref parsing an option (not default) | Bryan Newbold | 2021-11-04 | 3 | -9/+33 | 
| | | |||||
| * | add grobid_refs and crossref_with_refs to sandcrawler-db SQL schema | Bryan Newbold | 2021-11-04 | 1 | -0/+21 | 
| | | |||||
| * | glue, utils, and worker code for crossref and grobid_refs | Bryan Newbold | 2021-11-04 | 4 | -5/+212 | 
| | | |||||
| * | update grobid refs proposal | Bryan Newbold | 2021-11-04 | 1 | -10/+72 | 
| | | |||||
| * | iterated GROBID citation cleaning and processing | Bryan Newbold | 2021-11-04 | 1 | -27/+45 | 
| | | | | | Switched to using just 'key'/'id' for downstream matching. | ||||
| * | grobid citations: first pass at cleaning unstructured | Bryan Newbold | 2021-11-04 | 1 | -2/+34 | 
| | | |||||
| * | initial proposal for GROBID refs table and pipeline | Bryan Newbold | 2021-11-04 | 1 | -0/+63 | 
| | | |||||
| * | initial crossref-refs via GROBID helper routine | Bryan Newbold | 2021-11-04 | 7 | -6/+839 | 
| | | |||||
| * | pipenv: bump grobid_tei_xml version to 0.1.2 | Bryan Newbold | 2021-11-04 | 2 | -11/+11 | 
| | | |||||
| * | pdftrio client: use HTTP session for POSTs | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | 
| | | |||||
| * | workers: use HTTP session for archive.org fetches | Bryan Newbold | 2021-11-03 | 1 | -3/+3 | 
| | | |||||
| * | IA (wayback): actually use an HTTP session for replay fetches | Bryan Newbold | 2021-11-03 | 1 | -2/+3 | 
| | | | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput. | ||||
| * | SPN reingest: 6 hour minimum, 6 month max | Bryan Newbold | 2021-11-03 | 1 | -2/+2 | 
| | | |||||
| * | sql: fix typo in quarterly (not weekly) script | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | 
| | | |||||
| * | sql: fixes to ingest_fileset_platform schema (from table creation) | Bryan Newbold | 2021-11-01 | 2 | -12/+12 | 
| | | |||||
| * | updates/corrections to old small.json GROBID metadata example file | Bryan Newbold | 2021-10-27 | 1 | -6/+1 | 
| | | |||||
| * | remove grobid2json helper file, replace with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 7 | -224/+22 | 
| | | |||||
| * | small type annotation things from additional packages | Bryan Newbold | 2021-10-27 | 2 | -5/+14 | 
| | | |||||
| * | toolchain config updates | Bryan Newbold | 2021-10-27 | 3 | -10/+6 | 
| | | |||||
| * | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 57 | -3126/+3991 | 
| | | |||||
| * | pipenv: flipflop from yapf back to black; more type packages; bump ↵ | Bryan Newbold | 2021-10-27 | 2 | -27/+112 | 
| | | | | | grobid_tei_xml | ||||
| * | fileset: refactor out tables of helpers | Bryan Newbold | 2021-10-27 | 3 | -21/+19 | 
| | | | | | | | | Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized. | ||||
| * | gitlab-ci: copy env var in to place for tests | Bryan Newbold | 2021-10-27 | 1 | -0/+1 | 
| | | |||||
| * | fix type annotations for petabox body fetch helper | Bryan Newbold | 2021-10-26 | 5 | -8/+11 | 
| | | |||||
| * | small type annotation hack | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | 
| | | |||||
| * | fileset: fix field renaming bug (caught by mypy) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 | 
| | | |||||
| * | fileset ingest: fix table name typo (via mypy) | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | 
| | | |||||
| * | update 'XXX' notes from fileset ingest development | Bryan Newbold | 2021-10-26 | 2 | -9/+6 | 
| | | |||||
| * | bugfix: setting html_biblio on ingest results | Bryan Newbold | 2021-10-26 | 2 | -2/+2 | 
| | | | | | This was caught during lint cleanup | ||||
| * | lint collection membership (last lint for now) | Bryan Newbold | 2021-10-26 | 7 | -32/+32 | 
| | | |||||
| * | commit updated flake8 lint configuration | Bryan Newbold | 2021-10-26 | 1 | -6/+10 | 
| | | |||||
| * | ingest fileset: fix silly import typo | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | 
| | | |||||
| * | type annotations for persist workers; required some work | Bryan Newbold | 2021-10-26 | 1 | -66/+59 | 
| | | | | | | Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes. | ||||
| * | ingest file HTTP API: fixes from type checking | Bryan Newbold | 2021-10-26 | 1 | -3/+3 | 
| | | | | | | This code is deprecated and should be removed anyways, but still interesting to see the fixes | ||||
| * | more progress on type annotations | Bryan Newbold | 2021-10-26 | 8 | -34/+55 | 
| | | |||||
| * | grobid: fix a bug with consolidate_mode header, exposed by type annotations | Bryan Newbold | 2021-10-26 | 1 | -1/+2 | 
| | | |||||
| * | grobid: type annotations | Bryan Newbold | 2021-10-26 | 1 | -9/+19 | 
| | | |||||
| * | type annotations on SandcrawlerWorker | Bryan Newbold | 2021-10-26 | 1 | -46/+57 | 
| | | | | | | These annoations have a broad impact! Being conservative to start: Any-to-Any for process(), etc. | ||||
| * | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 11 | -55/+87 | 
| | | |||||
| * | live tests: FTP wayback replay now returns 200, not 226 | Bryan Newbold | 2021-10-26 | 1 | -2/+2 | 
| | | |||||
| * | ia: more tweaks to delicate code to satisfy type checker | Bryan Newbold | 2021-10-26 | 1 | -10/+12 | 
| | | | | | | Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed) | ||||
| * | ia helpers: enforce max_redirects count correctly | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | 
| | | | | | | AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect. | ||||
| * | set CDX request params are str, not int or datetime | Bryan Newbold | 2021-10-26 | 1 | -3/+6 | 
| | | | | | This might be a bugfix, changing CDX lookup behavior? | ||||
| * | bugfix: was setting 'from' parameter as a tuple, not a string | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | 
| | | |||||
| * | start type annotating IA helper code | Bryan Newbold | 2021-10-26 | 1 | -37/+65 | 
| | | |||||
| * | start adding python type annotations to db and persist code | Bryan Newbold | 2021-10-26 | 2 | -97/+124 | 
| | | |||||
| * | Makefile: don't fail on isort error (consider these minor) | Bryan Newbold | 2021-10-26 | 1 | -1/+1 | 
| | | |||||
| * | tweak flake8 config | Bryan Newbold | 2021-10-26 | 1 | -2/+11 | 
| | | |||||
