aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Expand)AuthorAgeFilesLines
* html ingest: handle TEI-XML parse errorHEADmasterBryan Newbold2022-07-281-1/+4
* yet another bad PDF sha1Bryan Newbold2022-07-271-0/+1
* CDX: skip sha-256 digestsBryan Newbold2022-07-251-1/+5
* yet another bad SHA1 PDF hashBryan Newbold2022-07-241-0/+1
* ingest: bump max-hops from 6 to 8Bryan Newbold2022-07-201-1/+1
* ingest: more PDF fulltext tricksBryan Newbold2022-07-202-0/+36
* ingest: more PDF fulltext URL patternsBryan Newbold2022-07-201-0/+42
* ingest: record bad GZIP transfer decode, instead of crashing (HTML)Bryan Newbold2022-07-181-1/+4
* cdx: tweak CDX lookups and resolution (sort)Bryan Newbold2022-07-161-4/+7
* html ingest: allow fuzzy CDX sha1 match based on encoding/not-encodingBryan Newbold2022-07-161-3/+10
* html: mangled JSON-in-URL patternBryan Newbold2022-07-151-0/+1
* html: remove old citation_pdf_url code pathBryan Newbold2022-07-151-32/+1
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for replay...Bryan Newbold2022-07-151-7/+7
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
* yet another bad PDFBryan Newbold2022-07-131-0/+1
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
* shorten default HTTP backoff factorBryan Newbold2022-07-131-1/+1
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-122-1/+12
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-162-4/+19
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
* ingest spn2: fix testsBryan Newbold2022-05-052-1/+2
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
* make fmtBryan Newbold2022-04-261-2/+5
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4