aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* persist: skip huge URLsBryan Newbold2022-09-281-0/+4
| | | | and fix some minor doc typos
* filesets: handle unknown file sizes (mypy lint fix)Bryan Newbold2022-09-281-1/+1
|
* update oai-pmh ingest request transform scriptBryan Newbold2022-09-281-2/+38
|
* pytest: supress another deprecationwarningBryan Newbold2022-09-141-0/+1
|
* spn2: fix tests by not retrying on HTTP 500Bryan Newbold2022-09-141-1/+3
|
* catch poppler 'ValueError' when parsing PDFsBryan Newbold2022-09-141-1/+2
| | | | | Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF.
* bad PDF sha1Bryan Newbold2022-09-121-0/+4
|
* bad PDF sha1Bryan Newbold2022-09-111-0/+2
|
* another bad PDF sha1Bryan Newbold2022-09-091-0/+1
|
* yet more bad PDF hashesBryan Newbold2022-09-081-0/+4
|
* pipenv: removed unused deps; re-lock depsBryan Newbold2022-09-072-783/+767
|
* sandcrawler SQL-based status (sept 2022)Bryan Newbold2022-09-071-0/+438
|
* summer 2022 ingest notesBryan Newbold2022-09-063-0/+389
|
* html ingest: handle TEI-XML parse errorBryan Newbold2022-07-281-1/+4
|
* yet another bad PDF sha1Bryan Newbold2022-07-271-0/+1
|
* CDX: skip sha-256 digestsBryan Newbold2022-07-251-1/+5
|
* yet another bad SHA1 PDF hashBryan Newbold2022-07-241-0/+1
|
* misc ingest fixesBryan Newbold2022-07-211-0/+831
|
* ingest: bump max-hops from 6 to 8Bryan Newbold2022-07-201-1/+1
|
* ingest: more PDF fulltext tricksBryan Newbold2022-07-202-0/+36
|
* ingest: more PDF fulltext URL patternsBryan Newbold2022-07-201-0/+42
|
* doaj and unpaywall transforms: more domains to skipBryan Newbold2022-07-202-3/+1
|
* ingest: record bad GZIP transfer decode, instead of crashing (HTML)Bryan Newbold2022-07-181-1/+4
|
* make fmtBryan Newbold2022-07-181-1/+0
|
* cdx: tweak CDX lookups and resolution (sort)Bryan Newbold2022-07-161-4/+7
|
* html ingest: allow fuzzy CDX sha1 match based on encoding/not-encodingBryan Newbold2022-07-161-3/+10
|
* HTML: no longer extracting citation_pdf_url in main extract functionBryan Newbold2022-07-161-24/+0
|
* html: mangled JSON-in-URL patternBryan Newbold2022-07-151-0/+1
|
* html: remove old citation_pdf_url code pathBryan Newbold2022-07-151-32/+1
| | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵Bryan Newbold2022-07-151-7/+7
| | | | replay redirect
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
|
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
|
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
|
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
| | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
* row2json script: fix argument typeBryan Newbold2022-07-151-1/+1
|
* row2json script: add flag to enable recrawlingBryan Newbold2022-07-151-1/+8
|
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
| | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
|
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
| | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
|
* unpaywall crawl wrap-up notesBryan Newbold2022-07-141-2/+145
|
* yet another bad PDFBryan Newbold2022-07-131-0/+1
|
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
|
* shorten default HTTP backoff factorBryan Newbold2022-07-131-1/+1
| | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
|
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-122-1/+12
|
* ingest: targeted 2022-04 notesBryan Newbold2022-07-071-1/+3
|
* stats: may 2022 ingest-by-domain statsBryan Newbold2022-07-071-0/+410
|
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
|
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-162-4/+19
|