aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Expand)AuthorAgeFilesLines
...
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
* lint ('not in')Bryan Newbold2021-12-151-2/+2
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
* codespell typos in python (comments)Bryan Newbold2021-11-244-4/+4
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
* make fmtBryan Newbold2021-11-161-1/+1
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
* ingest_file: more efficient GROBID metadata copyBryan Newbold2021-11-121-3/+3
* ingest: start re-processing GROBID with newer versionBryan Newbold2021-11-101-2/+6
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-101-0/+40
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-042-5/+11
* db (postgrest): actually use an HTTP sessionBryan Newbold2021-11-041-12/+24
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-042-5/+33
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-041-7/+16
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-042-3/+151
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-041-4/+121
* pdftrio client: use HTTP session for POSTsBryan Newbold2021-11-031-1/+1
* workers: use HTTP session for archive.org fetchesBryan Newbold2021-11-031-3/+3
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-272-4/+5
* small type annotation things from additional packagesBryan Newbold2021-10-272-5/+14
* make fmt (black 21.9b0)Bryan Newbold2021-10-2718-1840/+2332