aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Collapse)AuthorAgeFilesLines
...
* fileset: refactor out tables of helpersBryan Newbold2021-10-273-21/+19
| | | | | | | Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized.
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-265-8/+11
|
* small type annotation hackBryan Newbold2021-10-261-1/+1
|
* fileset: fix field renaming bug (caught by mypy)Bryan Newbold2021-10-261-2/+2
|
* fileset ingest: fix table name typo (via mypy)Bryan Newbold2021-10-261-1/+1
|
* update 'XXX' notes from fileset ingest developmentBryan Newbold2021-10-262-9/+6
|
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-262-2/+2
| | | | This was caught during lint cleanup
* lint collection membership (last lint for now)Bryan Newbold2021-10-267-32/+32
|
* ingest fileset: fix silly import typoBryan Newbold2021-10-261-1/+1
|
* type annotations for persist workers; required some workBryan Newbold2021-10-261-66/+59
| | | | | Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes.
* ingest file HTTP API: fixes from type checkingBryan Newbold2021-10-261-3/+3
| | | | | This code is deprecated and should be removed anyways, but still interesting to see the fixes
* more progress on type annotationsBryan Newbold2021-10-268-34/+55
|
* grobid: fix a bug with consolidate_mode header, exposed by type annotationsBryan Newbold2021-10-261-1/+2
|
* grobid: type annotationsBryan Newbold2021-10-261-9/+19
|
* type annotations on SandcrawlerWorkerBryan Newbold2021-10-261-46/+57
| | | | | These annoations have a broad impact! Being conservative to start: Any-to-Any for process(), etc.
* more progress on type annotations and lintingBryan Newbold2021-10-268-49/+80
|
* ia: more tweaks to delicate code to satisfy type checkerBryan Newbold2021-10-261-10/+12
| | | | | Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed)
* ia helpers: enforce max_redirects count correctlyBryan Newbold2021-10-261-1/+1
| | | | | AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect.
* set CDX request params are str, not int or datetimeBryan Newbold2021-10-261-3/+6
| | | | This might be a bugfix, changing CDX lookup behavior?
* bugfix: was setting 'from' parameter as a tuple, not a stringBryan Newbold2021-10-261-1/+1
|
* start type annotating IA helper codeBryan Newbold2021-10-261-37/+65
|
* start adding python type annotations to db and persist codeBryan Newbold2021-10-262-97/+124
|
* flake8 clean (with current settings)Bryan Newbold2021-10-267-24/+22
|
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-2615-97/+57
|
* make fmtBryan Newbold2021-10-2619-571/+741
|
* ingest_html: update trafilatura TEI-XML output kwargBryan Newbold2021-10-261-1/+1
|
* python: isort all importsBryan Newbold2021-10-2618-99/+108
|
* more small fileset ingest tweaksBryan Newbold2021-10-262-6/+21
|
* persist support for ingest platform table, using existing persist workerBryan Newbold2021-10-152-2/+129
|
* improve fileset ingest integration with file ingestBryan Newbold2021-10-153-5/+24
|
* more fileset iterationBryan Newbold2021-10-154-45/+80
|
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-153-52/+31
|
* filesets: iteration of implementation and docsBryan Newbold2021-10-154-82/+148
|
* fileset ingest: improve platform parsingBryan Newbold2021-10-151-12/+196
|
* fileset ingest: improve error handlingBryan Newbold2021-10-154-48/+106
|
* initial implementation of zenodo platform importBryan Newbold2021-10-151-0/+100
|
* initial figshare platform helperBryan Newbold2021-10-151-0/+95
|
* improvements to platform helpersBryan Newbold2021-10-153-34/+44
|
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-152-13/+31
|
* progress on web ingest strategyBryan Newbold2021-10-153-12/+121
|
* fileset ingest progress for dataverseBryan Newbold2021-10-154-23/+291
|
* local-file version of gen_file_metadataBryan Newbold2021-10-152-2/+43
|
* progress on dataset ingestBryan Newbold2021-10-154-122/+333
|
* wrap up previous renaming workBryan Newbold2021-10-153-5/+3
|
* progress on fileset/dataset ingestBryan Newbold2021-10-154-0/+403
|
* refactoring; progress on filesetsBryan Newbold2021-10-152-1/+7
|
* rename some python files for clarityBryan Newbold2021-10-152-0/+0
|
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
|
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-042-20/+84
|