Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | improve fileset ingest integration with file ingest | Bryan Newbold | 2021-10-15 | 4 | -5/+25 |
| | |||||
* | more fileset iteration | Bryan Newbold | 2021-10-15 | 5 | -45/+81 |
| | |||||
* | move SPNv2 'simple_get' logic to SPN client | Bryan Newbold | 2021-10-15 | 3 | -52/+31 |
| | |||||
* | filesets: iteration of implementation and docs | Bryan Newbold | 2021-10-15 | 5 | -96/+167 |
| | |||||
* | updates to fileset ingest proposal | Bryan Newbold | 2021-10-15 | 2 | -239/+337 |
| | |||||
* | fileset ingest notes | Bryan Newbold | 2021-10-15 | 1 | -3/+23 |
| | |||||
* | fileset ingest: improve platform parsing | Bryan Newbold | 2021-10-15 | 1 | -12/+196 |
| | |||||
* | fileset ingest: improve error handling | Bryan Newbold | 2021-10-15 | 4 | -48/+106 |
| | |||||
* | initial implementation of zenodo platform import | Bryan Newbold | 2021-10-15 | 1 | -0/+100 |
| | |||||
* | initial figshare platform helper | Bryan Newbold | 2021-10-15 | 1 | -0/+95 |
| | |||||
* | improvements to platform helpers | Bryan Newbold | 2021-10-15 | 3 | -34/+44 |
| | |||||
* | component ingest support for dataverse files (individual) | Bryan Newbold | 2021-10-15 | 2 | -13/+31 |
| | |||||
* | progress on web ingest strategy | Bryan Newbold | 2021-10-15 | 3 | -12/+121 |
| | |||||
* | fileset ingest progress for dataverse | Bryan Newbold | 2021-10-15 | 4 | -23/+291 |
| | |||||
* | local-file version of gen_file_metadata | Bryan Newbold | 2021-10-15 | 3 | -3/+56 |
| | |||||
* | progress on dataset ingest | Bryan Newbold | 2021-10-15 | 4 | -122/+333 |
| | |||||
* | dataset ingest: start enumerating examples | Bryan Newbold | 2021-10-15 | 1 | -0/+34 |
| | |||||
* | ingest tool: always require ingest type as part of 'single' command | Bryan Newbold | 2021-10-15 | 1 | -3/+3 |
| | |||||
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 4 | -6/+4 |
| | |||||
* | progress on fileset/dataset ingest | Bryan Newbold | 2021-10-15 | 4 | -0/+403 |
| | |||||
* | scripts: example archiveorg-to-fileset importer | Bryan Newbold | 2021-10-15 | 1 | -0/+138 |
| | |||||
* | initial dataset/fileset ingest proposal | Bryan Newbold | 2021-10-15 | 1 | -0/+185 |
| | |||||
* | sql: initial ingest fileset table | Bryan Newbold | 2021-10-15 | 1 | -0/+38 |
| | |||||
* | sql: fix typo in CHECK statement | Bryan Newbold | 2021-10-15 | 1 | -1/+1 |
| | |||||
* | refactoring; progress on filesets | Bryan Newbold | 2021-10-15 | 3 | -9/+27 |
| | |||||
* | rename some python files for clarity | Bryan Newbold | 2021-10-15 | 3 | -0/+0 |
| | |||||
* | pdf ingest: journals.uchicago.edu pattern | Bryan Newbold | 2021-10-11 | 1 | -0/+8 |
| | |||||
* | spn: avoid 'None' job_id | Bryan Newbold | 2021-10-11 | 1 | -2/+2 |
| | | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem. | ||||
* | Merge branch 'bnewbold-backfill' into 'master' | bnewbold | 2021-10-04 | 3 | -0/+384 |
|\ | | | | | | | | | CDX Backfill (scalding version) See merge request webgroup/sandcrawler!12 | ||||
| * | temporary please option for scala backfill | Bryan Newbold | 2018-07-24 | 1 | -0/+22 |
| | | |||||
| * | small CdxBackfillJob refactor (code quality) | Bryan Newbold | 2018-07-24 | 1 | -5/+5 |
| | | |||||
| * | do sha1 pattern match correctly | Bryan Newbold | 2018-07-24 | 2 | -3/+18 |
| | | |||||
| * | more PDF mimetypes; fix return refactor | Bryan Newbold | 2018-07-24 | 1 | -2/+5 |
| | | |||||
| * | CdxBackfillJob: comment cleanup | Bryan Newbold | 2018-07-24 | 1 | -6/+0 |
| | | |||||
| * | CdxBackfillJob: scalastyle | Bryan Newbold | 2018-07-24 | 1 | -22/+14 |
| | | |||||
| * | address some (but not all) review comments | Bryan Newbold | 2018-07-24 | 1 | -20/+21 |
| | | |||||
| * | reference TDsl note in docs | Bryan Newbold | 2018-07-24 | 1 | -0/+16 |
| | | |||||
| * | fix CdxBackfillJob tests | Bryan Newbold | 2018-07-24 | 2 | -6/+13 |
| | | |||||
| * | some scalastyle on CdxBackfillJob | Bryan Newbold | 2018-07-24 | 1 | -7/+8 |
| | | |||||
| * | CdxBackfillJob: implement other fields | Bryan Newbold | 2018-07-24 | 2 | -19/+84 |
| | | |||||
| * | CdxBackfillJob back to HBase; tests work | Bryan Newbold | 2018-07-24 | 2 | -15/+13 |
| | | |||||
| * | variant of CdxBackfillJob that writes to TSV | Bryan Newbold | 2018-07-24 | 2 | -0/+286 |
| | | | | | | | | | | Has the same test failure ("java.lang.IndexOutOfBoundsException: Index: 1, Size: 1") | ||||
* | | cdx_collection.py: minor lint issue | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | | |||||
* | | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 4 | -20/+251 |
| | | |||||
* | | old (2020) notes on pdfextract cleanup | Bryan Newbold | 2021-10-04 | 1 | -0/+74 |
| | | |||||
* | | notes on dumping PDF URL lists for partners | Bryan Newbold | 2021-10-04 | 1 | -0/+66 |
| | | |||||
* | | new SQL recent SPN request monitoring query | Bryan Newbold | 2021-10-04 | 1 | -0/+32 |
| | | |||||
* | | html ingest: report dt with broken CDX records | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | | |||||
* | | allow through unknown-scope HTML ingests, for possible SPN import | Bryan Newbold | 2021-10-01 | 1 | -11/+5 |
| | | |||||
* | | html: fix logging of broken CDX URL | Bryan Newbold | 2021-10-01 | 1 | -1/+1 |
| | |