aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Collapse)AuthorAgeFilesLines
* mypy lint fixesBryan Newbold2023-01-044-5/+5
|
* bad pdf hashBryan Newbold2022-12-161-0/+1
|
* sandcrawler: try to handle weird CDX API responseBryan Newbold2022-11-011-0/+5
| | | | Hard to debug this because sentry is broken.
* ingest: more generic OJS support, including pre-printsBryan Newbold2022-10-241-6/+22
| | | | | There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'.
* ingest: more generic PDF fulltext URL patternsBryan Newbold2022-10-241-0/+14
|
* ingest: another wall pattern, and check for walls in more placesBryan Newbold2022-10-241-1/+14
|
* ingest: don't prefer WARC over SPN so stronglyBryan Newbold2022-10-241-1/+2
| | | | | | | | | | We generally prefer an older WARC record over an SPN record, because the lookup is easier. But, this was causing problems with repeated ingest, so demote it. We may want to make this more configurable in the future, so things like HTML sub-resource lookups or bulk ingest won't prefer random new SPN captures.
* html: worldscientific PDF URL extractionBryan Newbold2022-10-241-0/+16
|
* html: pubpub platform detectionBryan Newbold2022-10-241-0/+2
|
* persist: skip huge URLsBryan Newbold2022-09-281-0/+4
| | | | and fix some minor doc typos
* filesets: handle unknown file sizes (mypy lint fix)Bryan Newbold2022-09-281-1/+1
|
* spn2: fix tests by not retrying on HTTP 500Bryan Newbold2022-09-141-1/+3
|
* catch poppler 'ValueError' when parsing PDFsBryan Newbold2022-09-141-1/+2
| | | | | Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF.
* bad PDF sha1Bryan Newbold2022-09-121-0/+4
|
* bad PDF sha1Bryan Newbold2022-09-111-0/+2
|
* another bad PDF sha1Bryan Newbold2022-09-091-0/+1
|
* yet more bad PDF hashesBryan Newbold2022-09-081-0/+4
|
* html ingest: handle TEI-XML parse errorBryan Newbold2022-07-281-1/+4
|
* yet another bad PDF sha1Bryan Newbold2022-07-271-0/+1
|
* CDX: skip sha-256 digestsBryan Newbold2022-07-251-1/+5
|
* yet another bad SHA1 PDF hashBryan Newbold2022-07-241-0/+1
|
* ingest: bump max-hops from 6 to 8Bryan Newbold2022-07-201-1/+1
|
* ingest: more PDF fulltext tricksBryan Newbold2022-07-202-0/+36
|
* ingest: more PDF fulltext URL patternsBryan Newbold2022-07-201-0/+42
|
* ingest: record bad GZIP transfer decode, instead of crashing (HTML)Bryan Newbold2022-07-181-1/+4
|
* cdx: tweak CDX lookups and resolution (sort)Bryan Newbold2022-07-161-4/+7
|
* html ingest: allow fuzzy CDX sha1 match based on encoding/not-encodingBryan Newbold2022-07-161-3/+10
|
* html: mangled JSON-in-URL patternBryan Newbold2022-07-151-0/+1
|
* html: remove old citation_pdf_url code pathBryan Newbold2022-07-151-32/+1
| | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵Bryan Newbold2022-07-151-7/+7
| | | | replay redirect
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
|
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
|
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
|
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
| | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
| | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
|
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
| | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
|
* yet another bad PDFBryan Newbold2022-07-131-0/+1
|
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
|
* shorten default HTTP backoff factorBryan Newbold2022-07-131-1/+1
| | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
|
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-122-1/+12
|
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
|
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-162-4/+19
|
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
|
* ingest spn2: fix testsBryan Newbold2022-05-052-1/+2
|
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
|
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
| | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture.
* make fmtBryan Newbold2022-04-261-2/+5
|