Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | update pytest warning filters (they are pretty expansive) | Bryan Newbold | 2021-10-26 | 1 | -0/+3 |
| | |||||
* | ingest_html: update trafilatura TEI-XML output kwarg | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 57 | -178/+207 |
| | |||||
* | add pyproject.toml (for isort and yapf config), and update 'lint' and 'fmt' ↵ | Bryan Newbold | 2021-10-26 | 2 | -3/+13 |
| | | | | make targets | ||||
* | pipenv: general update; add isort, yapf (over black), grobid_tei_xml | Bryan Newbold | 2021-10-26 | 2 | -730/+880 |
| | |||||
* | more small fileset ingest tweaks | Bryan Newbold | 2021-10-26 | 2 | -6/+21 |
| | |||||
* | python: more aggressive gitignore | Bryan Newbold | 2021-10-15 | 1 | -0/+3 |
| | |||||
* | persist support for ingest platform table, using existing persist worker | Bryan Newbold | 2021-10-15 | 2 | -2/+129 |
| | |||||
* | improve fileset ingest integration with file ingest | Bryan Newbold | 2021-10-15 | 4 | -5/+25 |
| | |||||
* | more fileset iteration | Bryan Newbold | 2021-10-15 | 5 | -45/+81 |
| | |||||
* | move SPNv2 'simple_get' logic to SPN client | Bryan Newbold | 2021-10-15 | 3 | -52/+31 |
| | |||||
* | filesets: iteration of implementation and docs | Bryan Newbold | 2021-10-15 | 4 | -82/+148 |
| | |||||
* | fileset ingest: improve platform parsing | Bryan Newbold | 2021-10-15 | 1 | -12/+196 |
| | |||||
* | fileset ingest: improve error handling | Bryan Newbold | 2021-10-15 | 4 | -48/+106 |
| | |||||
* | initial implementation of zenodo platform import | Bryan Newbold | 2021-10-15 | 1 | -0/+100 |
| | |||||
* | initial figshare platform helper | Bryan Newbold | 2021-10-15 | 1 | -0/+95 |
| | |||||
* | improvements to platform helpers | Bryan Newbold | 2021-10-15 | 3 | -34/+44 |
| | |||||
* | component ingest support for dataverse files (individual) | Bryan Newbold | 2021-10-15 | 2 | -13/+31 |
| | |||||
* | progress on web ingest strategy | Bryan Newbold | 2021-10-15 | 3 | -12/+121 |
| | |||||
* | fileset ingest progress for dataverse | Bryan Newbold | 2021-10-15 | 4 | -23/+291 |
| | |||||
* | local-file version of gen_file_metadata | Bryan Newbold | 2021-10-15 | 3 | -3/+56 |
| | |||||
* | progress on dataset ingest | Bryan Newbold | 2021-10-15 | 4 | -122/+333 |
| | |||||
* | ingest tool: always require ingest type as part of 'single' command | Bryan Newbold | 2021-10-15 | 1 | -3/+3 |
| | |||||
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 4 | -6/+4 |
| | |||||
* | progress on fileset/dataset ingest | Bryan Newbold | 2021-10-15 | 4 | -0/+403 |
| | |||||
* | scripts: example archiveorg-to-fileset importer | Bryan Newbold | 2021-10-15 | 1 | -0/+138 |
| | |||||
* | refactoring; progress on filesets | Bryan Newbold | 2021-10-15 | 3 | -9/+27 |
| | |||||
* | rename some python files for clarity | Bryan Newbold | 2021-10-15 | 3 | -0/+0 |
| | |||||
* | pdf ingest: journals.uchicago.edu pattern | Bryan Newbold | 2021-10-11 | 1 | -0/+8 |
| | |||||
* | spn: avoid 'None' job_id | Bryan Newbold | 2021-10-11 | 1 | -2/+2 |
| | | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem. | ||||
* | cdx_collection.py: minor lint issue | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | |||||
* | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 2 | -20/+84 |
| | |||||
* | html ingest: report dt with broken CDX records | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | |||||
* | allow through unknown-scope HTML ingests, for possible SPN import | Bryan Newbold | 2021-10-01 | 1 | -11/+5 |
| | |||||
* | html: fix logging of broken CDX URL | Bryan Newbold | 2021-10-01 | 1 | -1/+1 |
| | |||||
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-not | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try. | ||||
* | fix typo with spn_cdx_retry_sec arg | Bryan Newbold | 2021-09-30 | 1 | -1/+1 |
| | |||||
* | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 3 | -3/+9 |
| | |||||
* | yet another bad PDF sha1 | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | |||||
* | new 'daily' and 'priority' ingest request topics | Bryan Newbold | 2021-09-30 | 1 | -1/+7 |
| | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode. | ||||
* | old HTML extractors: handle null tag | Bryan Newbold | 2021-09-08 | 1 | -8/+9 |
| | |||||
* | ingest: more block patterns, for huge databases | Bryan Newbold | 2021-09-08 | 1 | -1/+4 |
| | |||||
* | yet more PDF sha1 to skip | Bryan Newbold | 2021-09-03 | 1 | -0/+5 |
| | |||||
* | yet more PDF URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+48 |
| | |||||
* | ingest: check URL blocklist again after redirects | Bryan Newbold | 2021-09-03 | 1 | -0/+7 |
| | |||||
* | refactor and expand wall/block/cookie URL patterns | Bryan Newbold | 2021-09-03 | 2 | -6/+39 |
| | |||||
* | HTML ingest: several more PDF fulltext URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+87 |
| | |||||
* | HTML ingest: skip noisy print() statement | Bryan Newbold | 2021-09-03 | 1 | -1/+1 |
| | |||||
* | HTML ingest: more meta-URI prefixes | Bryan Newbold | 2021-08-24 | 1 | -2/+8 |
| | |||||
* | html ingest: detect some blog platforms, and allow lower wordcount threshold | Bryan Newbold | 2021-08-16 | 1 | -0/+6 |
| |