aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* mypy lint fixesBryan Newbold2023-01-044-5/+5
|
* python-specific README fileBryan Newbold2023-01-022-7/+46
|
* bump python depsBryan Newbold2022-12-232-685/+700
|
* bad pdf hashBryan Newbold2022-12-161-0/+1
|
* sandcrawler: try to handle weird CDX API responseBryan Newbold2022-11-011-0/+5
| | | | Hard to debug this because sentry is broken.
* ingest: more generic OJS support, including pre-printsBryan Newbold2022-10-241-6/+22
| | | | | There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'.
* ingest: more generic PDF fulltext URL patternsBryan Newbold2022-10-241-0/+14
|
* ingest: another wall pattern, and check for walls in more placesBryan Newbold2022-10-241-1/+14
|
* ingest: don't prefer WARC over SPN so stronglyBryan Newbold2022-10-241-1/+2
| | | | | | | | | | We generally prefer an older WARC record over an SPN record, because the lookup is easier. But, this was causing problems with repeated ingest, so demote it. We may want to make this more configurable in the future, so things like HTML sub-resource lookups or bulk ingest won't prefer random new SPN captures.
* html: worldscientific PDF URL extractionBryan Newbold2022-10-241-0/+16
|
* html: pubpub platform detectionBryan Newbold2022-10-241-0/+2
|
* persist: skip huge URLsBryan Newbold2022-09-281-0/+4
| | | | and fix some minor doc typos
* filesets: handle unknown file sizes (mypy lint fix)Bryan Newbold2022-09-281-1/+1
|
* update oai-pmh ingest request transform scriptBryan Newbold2022-09-281-2/+38
|
* pytest: supress another deprecationwarningBryan Newbold2022-09-141-0/+1
|
* spn2: fix tests by not retrying on HTTP 500Bryan Newbold2022-09-141-1/+3
|
* catch poppler 'ValueError' when parsing PDFsBryan Newbold2022-09-141-1/+2
| | | | | Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF.
* bad PDF sha1Bryan Newbold2022-09-121-0/+4
|
* bad PDF sha1Bryan Newbold2022-09-111-0/+2
|
* another bad PDF sha1Bryan Newbold2022-09-091-0/+1
|
* yet more bad PDF hashesBryan Newbold2022-09-081-0/+4
|
* pipenv: removed unused deps; re-lock depsBryan Newbold2022-09-072-783/+767
|
* html ingest: handle TEI-XML parse errorBryan Newbold2022-07-281-1/+4
|
* yet another bad PDF sha1Bryan Newbold2022-07-271-0/+1
|
* CDX: skip sha-256 digestsBryan Newbold2022-07-251-1/+5
|
* yet another bad SHA1 PDF hashBryan Newbold2022-07-241-0/+1
|
* ingest: bump max-hops from 6 to 8Bryan Newbold2022-07-201-1/+1
|
* ingest: more PDF fulltext tricksBryan Newbold2022-07-202-0/+36
|
* ingest: more PDF fulltext URL patternsBryan Newbold2022-07-201-0/+42
|
* doaj and unpaywall transforms: more domains to skipBryan Newbold2022-07-202-3/+1
|
* ingest: record bad GZIP transfer decode, instead of crashing (HTML)Bryan Newbold2022-07-181-1/+4
|
* make fmtBryan Newbold2022-07-181-1/+0
|
* cdx: tweak CDX lookups and resolution (sort)Bryan Newbold2022-07-161-4/+7
|
* html ingest: allow fuzzy CDX sha1 match based on encoding/not-encodingBryan Newbold2022-07-161-3/+10
|
* HTML: no longer extracting citation_pdf_url in main extract functionBryan Newbold2022-07-161-24/+0
|
* html: mangled JSON-in-URL patternBryan Newbold2022-07-151-0/+1
|
* html: remove old citation_pdf_url code pathBryan Newbold2022-07-151-32/+1
| | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵Bryan Newbold2022-07-151-7/+7
| | | | replay redirect
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
|
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
|
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
|
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
| | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
* row2json script: fix argument typeBryan Newbold2022-07-151-1/+1
|
* row2json script: add flag to enable recrawlingBryan Newbold2022-07-151-1/+8
|
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
| | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
|
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
| | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
|
* yet another bad PDFBryan Newbold2022-07-131-0/+1
|
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
|