aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* improve fileset ingest integration with file ingestBryan Newbold2021-10-154-5/+25
|
* more fileset iterationBryan Newbold2021-10-155-45/+81
|
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-153-52/+31
|
* filesets: iteration of implementation and docsBryan Newbold2021-10-155-96/+167
|
* updates to fileset ingest proposalBryan Newbold2021-10-152-239/+337
|
* fileset ingest notesBryan Newbold2021-10-151-3/+23
|
* fileset ingest: improve platform parsingBryan Newbold2021-10-151-12/+196
|
* fileset ingest: improve error handlingBryan Newbold2021-10-154-48/+106
|
* initial implementation of zenodo platform importBryan Newbold2021-10-151-0/+100
|
* initial figshare platform helperBryan Newbold2021-10-151-0/+95
|
* improvements to platform helpersBryan Newbold2021-10-153-34/+44
|
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-152-13/+31
|
* progress on web ingest strategyBryan Newbold2021-10-153-12/+121
|
* fileset ingest progress for dataverseBryan Newbold2021-10-154-23/+291
|
* local-file version of gen_file_metadataBryan Newbold2021-10-153-3/+56
|
* progress on dataset ingestBryan Newbold2021-10-154-122/+333
|
* dataset ingest: start enumerating examplesBryan Newbold2021-10-151-0/+34
|
* ingest tool: always require ingest type as part of 'single' commandBryan Newbold2021-10-151-3/+3
|
* wrap up previous renaming workBryan Newbold2021-10-154-6/+4
|
* progress on fileset/dataset ingestBryan Newbold2021-10-154-0/+403
|
* scripts: example archiveorg-to-fileset importerBryan Newbold2021-10-151-0/+138
|
* initial dataset/fileset ingest proposalBryan Newbold2021-10-151-0/+185
|
* sql: initial ingest fileset tableBryan Newbold2021-10-151-0/+38
|
* sql: fix typo in CHECK statementBryan Newbold2021-10-151-1/+1
|
* refactoring; progress on filesetsBryan Newbold2021-10-153-9/+27
|
* rename some python files for clarityBryan Newbold2021-10-153-0/+0
|
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
|
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* Merge branch 'bnewbold-backfill' into 'master'bnewbold2021-10-043-0/+384
|\ | | | | | | | | CDX Backfill (scalding version) See merge request webgroup/sandcrawler!12
| * temporary please option for scala backfillBryan Newbold2018-07-241-0/+22
| |
| * small CdxBackfillJob refactor (code quality)Bryan Newbold2018-07-241-5/+5
| |
| * do sha1 pattern match correctlyBryan Newbold2018-07-242-3/+18
| |
| * more PDF mimetypes; fix return refactorBryan Newbold2018-07-241-2/+5
| |
| * CdxBackfillJob: comment cleanupBryan Newbold2018-07-241-6/+0
| |
| * CdxBackfillJob: scalastyleBryan Newbold2018-07-241-22/+14
| |
| * address some (but not all) review commentsBryan Newbold2018-07-241-20/+21
| |
| * reference TDsl note in docsBryan Newbold2018-07-241-0/+16
| |
| * fix CdxBackfillJob testsBryan Newbold2018-07-242-6/+13
| |
| * some scalastyle on CdxBackfillJobBryan Newbold2018-07-241-7/+8
| |
| * CdxBackfillJob: implement other fieldsBryan Newbold2018-07-242-19/+84
| |
| * CdxBackfillJob back to HBase; tests workBryan Newbold2018-07-242-15/+13
| |
| * variant of CdxBackfillJob that writes to TSVBryan Newbold2018-07-242-0/+286
| | | | | | | | | | Has the same test failure ("java.lang.IndexOutOfBoundsException: Index: 1, Size: 1")
* | cdx_collection.py: minor lint issueBryan Newbold2021-10-041-1/+1
| |
* | ingest: basic 'component' and 'src' supportBryan Newbold2021-10-044-20/+251
| |
* | old (2020) notes on pdfextract cleanupBryan Newbold2021-10-041-0/+74
| |
* | notes on dumping PDF URL lists for partnersBryan Newbold2021-10-041-0/+66
| |
* | new SQL recent SPN request monitoring queryBryan Newbold2021-10-041-0/+32
| |
* | html ingest: report dt with broken CDX recordsBryan Newbold2021-10-041-1/+1
| |
* | allow through unknown-scope HTML ingests, for possible SPN importBryan Newbold2021-10-011-11/+5
| |
* | html: fix logging of broken CDX URLBryan Newbold2021-10-011-1/+1
| |