Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | OAI-PMH updates | Bryan Newbold | 2022-10-07 | 3 | -2/+391 |
| | |||||
* | reingests: update scripts and SQL | Bryan Newbold | 2022-10-03 | 7 | -6/+127 |
| | |||||
* | persist: skip huge URLs | Bryan Newbold | 2022-09-28 | 1 | -0/+4 |
| | | | | and fix some minor doc typos | ||||
* | filesets: handle unknown file sizes (mypy lint fix) | Bryan Newbold | 2022-09-28 | 1 | -1/+1 |
| | |||||
* | update oai-pmh ingest request transform script | Bryan Newbold | 2022-09-28 | 1 | -2/+38 |
| | |||||
* | pytest: supress another deprecationwarning | Bryan Newbold | 2022-09-14 | 1 | -0/+1 |
| | |||||
* | spn2: fix tests by not retrying on HTTP 500 | Bryan Newbold | 2022-09-14 | 1 | -1/+3 |
| | |||||
* | catch poppler 'ValueError' when parsing PDFs | Bryan Newbold | 2022-09-14 | 1 | -1/+2 |
| | | | | | Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF. | ||||
* | bad PDF sha1 | Bryan Newbold | 2022-09-12 | 1 | -0/+4 |
| | |||||
* | bad PDF sha1 | Bryan Newbold | 2022-09-11 | 1 | -0/+2 |
| | |||||
* | another bad PDF sha1 | Bryan Newbold | 2022-09-09 | 1 | -0/+1 |
| | |||||
* | yet more bad PDF hashes | Bryan Newbold | 2022-09-08 | 1 | -0/+4 |
| | |||||
* | pipenv: removed unused deps; re-lock deps | Bryan Newbold | 2022-09-07 | 2 | -783/+767 |
| | |||||
* | sandcrawler SQL-based status (sept 2022) | Bryan Newbold | 2022-09-07 | 1 | -0/+438 |
| | |||||
* | summer 2022 ingest notes | Bryan Newbold | 2022-09-06 | 3 | -0/+389 |
| | |||||
* | html ingest: handle TEI-XML parse error | Bryan Newbold | 2022-07-28 | 1 | -1/+4 |
| | |||||
* | yet another bad PDF sha1 | Bryan Newbold | 2022-07-27 | 1 | -0/+1 |
| | |||||
* | CDX: skip sha-256 digests | Bryan Newbold | 2022-07-25 | 1 | -1/+5 |
| | |||||
* | yet another bad SHA1 PDF hash | Bryan Newbold | 2022-07-24 | 1 | -0/+1 |
| | |||||
* | misc ingest fixes | Bryan Newbold | 2022-07-21 | 1 | -0/+831 |
| | |||||
* | ingest: bump max-hops from 6 to 8 | Bryan Newbold | 2022-07-20 | 1 | -1/+1 |
| | |||||
* | ingest: more PDF fulltext tricks | Bryan Newbold | 2022-07-20 | 2 | -0/+36 |
| | |||||
* | ingest: more PDF fulltext URL patterns | Bryan Newbold | 2022-07-20 | 1 | -0/+42 |
| | |||||
* | doaj and unpaywall transforms: more domains to skip | Bryan Newbold | 2022-07-20 | 2 | -3/+1 |
| | |||||
* | ingest: record bad GZIP transfer decode, instead of crashing (HTML) | Bryan Newbold | 2022-07-18 | 1 | -1/+4 |
| | |||||
* | make fmt | Bryan Newbold | 2022-07-18 | 1 | -1/+0 |
| | |||||
* | cdx: tweak CDX lookups and resolution (sort) | Bryan Newbold | 2022-07-16 | 1 | -4/+7 |
| | |||||
* | html ingest: allow fuzzy CDX sha1 match based on encoding/not-encoding | Bryan Newbold | 2022-07-16 | 1 | -3/+10 |
| | |||||
* | HTML: no longer extracting citation_pdf_url in main extract function | Bryan Newbold | 2022-07-16 | 1 | -24/+0 |
| | |||||
* | html: mangled JSON-in-URL pattern | Bryan Newbold | 2022-07-15 | 1 | -0/+1 |
| | |||||
* | html: remove old citation_pdf_url code path | Bryan Newbold | 2022-07-15 | 1 | -32/+1 |
| | | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors | ||||
* | wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵ | Bryan Newbold | 2022-07-15 | 1 | -7/+7 |
| | | | | replay redirect | ||||
* | cdx api: add another allowable URL fuzzy-match pattern (double slashes) | Bryan Newbold | 2022-07-15 | 1 | -0/+9 |
| | |||||
* | ingest: more bogus domain patterns | Bryan Newbold | 2022-07-15 | 1 | -0/+3 |
| | |||||
* | spn2: handle case of re-attempting a recent crawl (race condition) | Bryan Newbold | 2022-07-15 | 1 | -0/+14 |
| | |||||
* | html: fulltext URL prefixes to skip; also fix broken pattern matching | Bryan Newbold | 2022-07-15 | 1 | -4/+19 |
| | | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working. | ||||
* | row2json script: fix argument type | Bryan Newbold | 2022-07-15 | 1 | -1/+1 |
| | |||||
* | row2json script: add flag to enable recrawling | Bryan Newbold | 2022-07-15 | 1 | -1/+8 |
| | |||||
* | ingest: another form of cookie block URL | Bryan Newbold | 2022-07-15 | 1 | -0/+2 |
| | | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py. | ||||
* | HTML ingest: most sub-resource patterns to skip | Bryan Newbold | 2022-07-15 | 1 | -1/+13 |
| | |||||
* | cdx lookups: prioritize truely exact URL matches | Bryan Newbold | 2022-07-14 | 1 | -0/+1 |
| | | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics. | ||||
* | ingest: handle another type of wayback redirect | Bryan Newbold | 2022-07-14 | 1 | -2/+5 |
| | |||||
* | unpaywall crawl wrap-up notes | Bryan Newbold | 2022-07-14 | 1 | -2/+145 |
| | |||||
* | yet another bad PDF | Bryan Newbold | 2022-07-13 | 1 | -0/+1 |
| | |||||
* | wayback fetch: handle upstream 5xx replays | Bryan Newbold | 2022-07-13 | 1 | -4/+15 |
| | |||||
* | shorten default HTTP backoff factor | Bryan Newbold | 2022-07-13 | 1 | -1/+1 |
| | | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts | ||||
* | ingest: random site PDF link pattern | Bryan Newbold | 2022-07-12 | 1 | -0/+7 |
| | |||||
* | ingest: doaj.org article landing page access links | Bryan Newbold | 2022-07-12 | 2 | -1/+12 |
| | |||||
* | ingest: targeted 2022-04 notes | Bryan Newbold | 2022-07-07 | 1 | -1/+3 |
| | |||||
* | stats: may 2022 ingest-by-domain stats | Bryan Newbold | 2022-07-07 | 1 | -0/+410 |
| |