Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | python-specific README file | Bryan Newbold | 2023-01-02 | 2 | -7/+46 |
| | |||||
* | bump python deps | Bryan Newbold | 2022-12-23 | 2 | -685/+700 |
| | |||||
* | bad pdf hash | Bryan Newbold | 2022-12-16 | 1 | -0/+1 |
| | |||||
* | sandcrawler: try to handle weird CDX API response | Bryan Newbold | 2022-11-01 | 1 | -0/+5 |
| | | | | Hard to debug this because sentry is broken. | ||||
* | ingest: more generic OJS support, including pre-prints | Bryan Newbold | 2022-10-24 | 1 | -6/+22 |
| | | | | | There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'. | ||||
* | ingest: more generic PDF fulltext URL patterns | Bryan Newbold | 2022-10-24 | 1 | -0/+14 |
| | |||||
* | ingest: another wall pattern, and check for walls in more places | Bryan Newbold | 2022-10-24 | 1 | -1/+14 |
| | |||||
* | ingest: don't prefer WARC over SPN so strongly | Bryan Newbold | 2022-10-24 | 1 | -1/+2 |
| | | | | | | | | | | We generally prefer an older WARC record over an SPN record, because the lookup is easier. But, this was causing problems with repeated ingest, so demote it. We may want to make this more configurable in the future, so things like HTML sub-resource lookups or bulk ingest won't prefer random new SPN captures. | ||||
* | html: worldscientific PDF URL extraction | Bryan Newbold | 2022-10-24 | 1 | -0/+16 |
| | |||||
* | html: pubpub platform detection | Bryan Newbold | 2022-10-24 | 1 | -0/+2 |
| | |||||
* | persist: skip huge URLs | Bryan Newbold | 2022-09-28 | 1 | -0/+4 |
| | | | | and fix some minor doc typos | ||||
* | filesets: handle unknown file sizes (mypy lint fix) | Bryan Newbold | 2022-09-28 | 1 | -1/+1 |
| | |||||
* | update oai-pmh ingest request transform script | Bryan Newbold | 2022-09-28 | 1 | -2/+38 |
| | |||||
* | pytest: supress another deprecationwarning | Bryan Newbold | 2022-09-14 | 1 | -0/+1 |
| | |||||
* | spn2: fix tests by not retrying on HTTP 500 | Bryan Newbold | 2022-09-14 | 1 | -1/+3 |
| | |||||
* | catch poppler 'ValueError' when parsing PDFs | Bryan Newbold | 2022-09-14 | 1 | -1/+2 |
| | | | | | Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF. | ||||
* | bad PDF sha1 | Bryan Newbold | 2022-09-12 | 1 | -0/+4 |
| | |||||
* | bad PDF sha1 | Bryan Newbold | 2022-09-11 | 1 | -0/+2 |
| | |||||
* | another bad PDF sha1 | Bryan Newbold | 2022-09-09 | 1 | -0/+1 |
| | |||||
* | yet more bad PDF hashes | Bryan Newbold | 2022-09-08 | 1 | -0/+4 |
| | |||||
* | pipenv: removed unused deps; re-lock deps | Bryan Newbold | 2022-09-07 | 2 | -783/+767 |
| | |||||
* | html ingest: handle TEI-XML parse error | Bryan Newbold | 2022-07-28 | 1 | -1/+4 |
| | |||||
* | yet another bad PDF sha1 | Bryan Newbold | 2022-07-27 | 1 | -0/+1 |
| | |||||
* | CDX: skip sha-256 digests | Bryan Newbold | 2022-07-25 | 1 | -1/+5 |
| | |||||
* | yet another bad SHA1 PDF hash | Bryan Newbold | 2022-07-24 | 1 | -0/+1 |
| | |||||
* | ingest: bump max-hops from 6 to 8 | Bryan Newbold | 2022-07-20 | 1 | -1/+1 |
| | |||||
* | ingest: more PDF fulltext tricks | Bryan Newbold | 2022-07-20 | 2 | -0/+36 |
| | |||||
* | ingest: more PDF fulltext URL patterns | Bryan Newbold | 2022-07-20 | 1 | -0/+42 |
| | |||||
* | doaj and unpaywall transforms: more domains to skip | Bryan Newbold | 2022-07-20 | 2 | -3/+1 |
| | |||||
* | ingest: record bad GZIP transfer decode, instead of crashing (HTML) | Bryan Newbold | 2022-07-18 | 1 | -1/+4 |
| | |||||
* | make fmt | Bryan Newbold | 2022-07-18 | 1 | -1/+0 |
| | |||||
* | cdx: tweak CDX lookups and resolution (sort) | Bryan Newbold | 2022-07-16 | 1 | -4/+7 |
| | |||||
* | html ingest: allow fuzzy CDX sha1 match based on encoding/not-encoding | Bryan Newbold | 2022-07-16 | 1 | -3/+10 |
| | |||||
* | HTML: no longer extracting citation_pdf_url in main extract function | Bryan Newbold | 2022-07-16 | 1 | -24/+0 |
| | |||||
* | html: mangled JSON-in-URL pattern | Bryan Newbold | 2022-07-15 | 1 | -0/+1 |
| | |||||
* | html: remove old citation_pdf_url code path | Bryan Newbold | 2022-07-15 | 1 | -32/+1 |
| | | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors | ||||
* | wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵ | Bryan Newbold | 2022-07-15 | 1 | -7/+7 |
| | | | | replay redirect | ||||
* | cdx api: add another allowable URL fuzzy-match pattern (double slashes) | Bryan Newbold | 2022-07-15 | 1 | -0/+9 |
| | |||||
* | ingest: more bogus domain patterns | Bryan Newbold | 2022-07-15 | 1 | -0/+3 |
| | |||||
* | spn2: handle case of re-attempting a recent crawl (race condition) | Bryan Newbold | 2022-07-15 | 1 | -0/+14 |
| | |||||
* | html: fulltext URL prefixes to skip; also fix broken pattern matching | Bryan Newbold | 2022-07-15 | 1 | -4/+19 |
| | | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working. | ||||
* | row2json script: fix argument type | Bryan Newbold | 2022-07-15 | 1 | -1/+1 |
| | |||||
* | row2json script: add flag to enable recrawling | Bryan Newbold | 2022-07-15 | 1 | -1/+8 |
| | |||||
* | ingest: another form of cookie block URL | Bryan Newbold | 2022-07-15 | 1 | -0/+2 |
| | | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py. | ||||
* | HTML ingest: most sub-resource patterns to skip | Bryan Newbold | 2022-07-15 | 1 | -1/+13 |
| | |||||
* | cdx lookups: prioritize truely exact URL matches | Bryan Newbold | 2022-07-14 | 1 | -0/+1 |
| | | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics. | ||||
* | ingest: handle another type of wayback redirect | Bryan Newbold | 2022-07-14 | 1 | -2/+5 |
| | |||||
* | yet another bad PDF | Bryan Newbold | 2022-07-13 | 1 | -0/+1 |
| | |||||
* | wayback fetch: handle upstream 5xx replays | Bryan Newbold | 2022-07-13 | 1 | -4/+15 |
| | |||||
* | shorten default HTTP backoff factor | Bryan Newbold | 2022-07-13 | 1 | -1/+1 |
| | | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts |