aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Expand)AuthorAgeFilesLines
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-041-7/+16
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-042-3/+151
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-041-4/+121
* pdftrio client: use HTTP session for POSTsBryan Newbold2021-11-031-1/+1
* workers: use HTTP session for archive.org fetchesBryan Newbold2021-11-031-3/+3
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-272-4/+5
* small type annotation things from additional packagesBryan Newbold2021-10-272-5/+14
* make fmt (black 21.9b0)Bryan Newbold2021-10-2718-1840/+2332
* fileset: refactor out tables of helpersBryan Newbold2021-10-273-21/+19
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-265-8/+11
* small type annotation hackBryan Newbold2021-10-261-1/+1
* fileset: fix field renaming bug (caught by mypy)Bryan Newbold2021-10-261-2/+2
* fileset ingest: fix table name typo (via mypy)Bryan Newbold2021-10-261-1/+1
* update 'XXX' notes from fileset ingest developmentBryan Newbold2021-10-262-9/+6
* bugfix: setting html_biblio on ingest resultsBryan Newbold2021-10-262-2/+2
* lint collection membership (last lint for now)Bryan Newbold2021-10-267-32/+32
* ingest fileset: fix silly import typoBryan Newbold2021-10-261-1/+1
* type annotations for persist workers; required some workBryan Newbold2021-10-261-66/+59
* ingest file HTTP API: fixes from type checkingBryan Newbold2021-10-261-3/+3
* more progress on type annotationsBryan Newbold2021-10-268-34/+55
* grobid: fix a bug with consolidate_mode header, exposed by type annotationsBryan Newbold2021-10-261-1/+2
* grobid: type annotationsBryan Newbold2021-10-261-9/+19
* type annotations on SandcrawlerWorkerBryan Newbold2021-10-261-46/+57
* more progress on type annotations and lintingBryan Newbold2021-10-268-49/+80
* ia: more tweaks to delicate code to satisfy type checkerBryan Newbold2021-10-261-10/+12
* ia helpers: enforce max_redirects count correctlyBryan Newbold2021-10-261-1/+1
* set CDX request params are str, not int or datetimeBryan Newbold2021-10-261-3/+6
* bugfix: was setting 'from' parameter as a tuple, not a stringBryan Newbold2021-10-261-1/+1
* start type annotating IA helper codeBryan Newbold2021-10-261-37/+65
* start adding python type annotations to db and persist codeBryan Newbold2021-10-262-97/+124
* flake8 clean (with current settings)Bryan Newbold2021-10-267-24/+22
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-2615-97/+57
* make fmtBryan Newbold2021-10-2619-571/+741
* ingest_html: update trafilatura TEI-XML output kwargBryan Newbold2021-10-261-1/+1
* python: isort all importsBryan Newbold2021-10-2618-99/+108
* more small fileset ingest tweaksBryan Newbold2021-10-262-6/+21
* persist support for ingest platform table, using existing persist workerBryan Newbold2021-10-152-2/+129
* improve fileset ingest integration with file ingestBryan Newbold2021-10-153-5/+24
* more fileset iterationBryan Newbold2021-10-154-45/+80
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-153-52/+31
* filesets: iteration of implementation and docsBryan Newbold2021-10-154-82/+148
* fileset ingest: improve platform parsingBryan Newbold2021-10-151-12/+196
* fileset ingest: improve error handlingBryan Newbold2021-10-154-48/+106
* initial implementation of zenodo platform importBryan Newbold2021-10-151-0/+100
* initial figshare platform helperBryan Newbold2021-10-151-0/+95
* improvements to platform helpersBryan Newbold2021-10-153-34/+44
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-152-13/+31