Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | more progress on type annotations | Bryan Newbold | 2021-10-26 | 8 | -34/+55 |
| | |||||
* | grobid: fix a bug with consolidate_mode header, exposed by type annotations | Bryan Newbold | 2021-10-26 | 1 | -1/+2 |
| | |||||
* | grobid: type annotations | Bryan Newbold | 2021-10-26 | 1 | -9/+19 |
| | |||||
* | type annotations on SandcrawlerWorker | Bryan Newbold | 2021-10-26 | 1 | -46/+57 |
| | | | | | These annoations have a broad impact! Being conservative to start: Any-to-Any for process(), etc. | ||||
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 8 | -49/+80 |
| | |||||
* | ia: more tweaks to delicate code to satisfy type checker | Bryan Newbold | 2021-10-26 | 1 | -10/+12 |
| | | | | | Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed) | ||||
* | ia helpers: enforce max_redirects count correctly | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | | | | | AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect. | ||||
* | set CDX request params are str, not int or datetime | Bryan Newbold | 2021-10-26 | 1 | -3/+6 |
| | | | | This might be a bugfix, changing CDX lookup behavior? | ||||
* | bugfix: was setting 'from' parameter as a tuple, not a string | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | start type annotating IA helper code | Bryan Newbold | 2021-10-26 | 1 | -37/+65 |
| | |||||
* | start adding python type annotations to db and persist code | Bryan Newbold | 2021-10-26 | 2 | -97/+124 |
| | |||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 7 | -24/+22 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 15 | -97/+57 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 19 | -571/+741 |
| | |||||
* | ingest_html: update trafilatura TEI-XML output kwarg | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 18 | -99/+108 |
| | |||||
* | more small fileset ingest tweaks | Bryan Newbold | 2021-10-26 | 2 | -6/+21 |
| | |||||
* | persist support for ingest platform table, using existing persist worker | Bryan Newbold | 2021-10-15 | 2 | -2/+129 |
| | |||||
* | improve fileset ingest integration with file ingest | Bryan Newbold | 2021-10-15 | 3 | -5/+24 |
| | |||||
* | more fileset iteration | Bryan Newbold | 2021-10-15 | 4 | -45/+80 |
| | |||||
* | move SPNv2 'simple_get' logic to SPN client | Bryan Newbold | 2021-10-15 | 3 | -52/+31 |
| | |||||
* | filesets: iteration of implementation and docs | Bryan Newbold | 2021-10-15 | 4 | -82/+148 |
| | |||||
* | fileset ingest: improve platform parsing | Bryan Newbold | 2021-10-15 | 1 | -12/+196 |
| | |||||
* | fileset ingest: improve error handling | Bryan Newbold | 2021-10-15 | 4 | -48/+106 |
| | |||||
* | initial implementation of zenodo platform import | Bryan Newbold | 2021-10-15 | 1 | -0/+100 |
| | |||||
* | initial figshare platform helper | Bryan Newbold | 2021-10-15 | 1 | -0/+95 |
| | |||||
* | improvements to platform helpers | Bryan Newbold | 2021-10-15 | 3 | -34/+44 |
| | |||||
* | component ingest support for dataverse files (individual) | Bryan Newbold | 2021-10-15 | 2 | -13/+31 |
| | |||||
* | progress on web ingest strategy | Bryan Newbold | 2021-10-15 | 3 | -12/+121 |
| | |||||
* | fileset ingest progress for dataverse | Bryan Newbold | 2021-10-15 | 4 | -23/+291 |
| | |||||
* | local-file version of gen_file_metadata | Bryan Newbold | 2021-10-15 | 2 | -2/+43 |
| | |||||
* | progress on dataset ingest | Bryan Newbold | 2021-10-15 | 4 | -122/+333 |
| | |||||
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 3 | -5/+3 |
| | |||||
* | progress on fileset/dataset ingest | Bryan Newbold | 2021-10-15 | 4 | -0/+403 |
| | |||||
* | refactoring; progress on filesets | Bryan Newbold | 2021-10-15 | 2 | -1/+7 |
| | |||||
* | rename some python files for clarity | Bryan Newbold | 2021-10-15 | 2 | -0/+0 |
| | |||||
* | pdf ingest: journals.uchicago.edu pattern | Bryan Newbold | 2021-10-11 | 1 | -0/+8 |
| | |||||
* | spn: avoid 'None' job_id | Bryan Newbold | 2021-10-11 | 1 | -2/+2 |
| | | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem. | ||||
* | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 2 | -20/+84 |
| | |||||
* | html ingest: report dt with broken CDX records | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | |||||
* | allow through unknown-scope HTML ingests, for possible SPN import | Bryan Newbold | 2021-10-01 | 1 | -11/+5 |
| | |||||
* | html: fix logging of broken CDX URL | Bryan Newbold | 2021-10-01 | 1 | -1/+1 |
| | |||||
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-not | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try. | ||||
* | fix typo with spn_cdx_retry_sec arg | Bryan Newbold | 2021-09-30 | 1 | -1/+1 |
| | |||||
* | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 2 | -3/+5 |
| | |||||
* | yet another bad PDF sha1 | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | |||||
* | old HTML extractors: handle null tag | Bryan Newbold | 2021-09-08 | 1 | -8/+9 |
| | |||||
* | ingest: more block patterns, for huge databases | Bryan Newbold | 2021-09-08 | 1 | -1/+4 |
| | |||||
* | yet more PDF sha1 to skip | Bryan Newbold | 2021-09-03 | 1 | -0/+5 |
| | |||||
* | yet more PDF URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+48 |
| |