sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	python-specific README file	Bryan Newbold	2023-01-02	2	-7/+46
\|
*	bump python deps	Bryan Newbold	2022-12-23	2	-685/+700
\|
*	bad pdf hash	Bryan Newbold	2022-12-16	1	-0/+1
\|
*	sandcrawler: try to handle weird CDX API response	Bryan Newbold	2022-11-01	1	-0/+5
\| \| \| \|	Hard to debug this because sentry is broken.
*	ingest: more generic OJS support, including pre-prints	Bryan Newbold	2022-10-24	1	-6/+22
\| \| \| \| \|	There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'.
*	ingest: more generic PDF fulltext URL patterns	Bryan Newbold	2022-10-24	1	-0/+14
\|
*	ingest: another wall pattern, and check for walls in more places	Bryan Newbold	2022-10-24	1	-1/+14
\|
*	ingest: don't prefer WARC over SPN so strongly	Bryan Newbold	2022-10-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	We generally prefer an older WARC record over an SPN record, because the lookup is easier. But, this was causing problems with repeated ingest, so demote it. We may want to make this more configurable in the future, so things like HTML sub-resource lookups or bulk ingest won't prefer random new SPN captures.
*	html: worldscientific PDF URL extraction	Bryan Newbold	2022-10-24	1	-0/+16
\|
*	html: pubpub platform detection	Bryan Newbold	2022-10-24	1	-0/+2
\|
*	persist: skip huge URLs	Bryan Newbold	2022-09-28	1	-0/+4
\| \| \| \|	and fix some minor doc typos
*	filesets: handle unknown file sizes (mypy lint fix)	Bryan Newbold	2022-09-28	1	-1/+1
\|
*	update oai-pmh ingest request transform script	Bryan Newbold	2022-09-28	1	-2/+38
\|
*	pytest: supress another deprecationwarning	Bryan Newbold	2022-09-14	1	-0/+1
\|
*	spn2: fix tests by not retrying on HTTP 500	Bryan Newbold	2022-09-14	1	-1/+3
\|
*	catch poppler 'ValueError' when parsing PDFs	Bryan Newbold	2022-09-14	1	-1/+2
\| \| \| \| \|	Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF.
*	bad PDF sha1	Bryan Newbold	2022-09-12	1	-0/+4
\|
*	bad PDF sha1	Bryan Newbold	2022-09-11	1	-0/+2
\|
*	another bad PDF sha1	Bryan Newbold	2022-09-09	1	-0/+1
\|
*	yet more bad PDF hashes	Bryan Newbold	2022-09-08	1	-0/+4
\|
*	pipenv: removed unused deps; re-lock deps	Bryan Newbold	2022-09-07	2	-783/+767
\|
*	html ingest: handle TEI-XML parse error	Bryan Newbold	2022-07-28	1	-1/+4
\|
*	yet another bad PDF sha1	Bryan Newbold	2022-07-27	1	-0/+1
\|
*	CDX: skip sha-256 digests	Bryan Newbold	2022-07-25	1	-1/+5
\|
*	yet another bad SHA1 PDF hash	Bryan Newbold	2022-07-24	1	-0/+1
\|
*	ingest: bump max-hops from 6 to 8	Bryan Newbold	2022-07-20	1	-1/+1
\|
*	ingest: more PDF fulltext tricks	Bryan Newbold	2022-07-20	2	-0/+36
\|
*	ingest: more PDF fulltext URL patterns	Bryan Newbold	2022-07-20	1	-0/+42
\|
*	doaj and unpaywall transforms: more domains to skip	Bryan Newbold	2022-07-20	2	-3/+1
\|
*	ingest: record bad GZIP transfer decode, instead of crashing (HTML)	Bryan Newbold	2022-07-18	1	-1/+4
\|
*	make fmt	Bryan Newbold	2022-07-18	1	-1/+0
\|
*	cdx: tweak CDX lookups and resolution (sort)	Bryan Newbold	2022-07-16	1	-4/+7
\|
*	html ingest: allow fuzzy CDX sha1 match based on encoding/not-encoding	Bryan Newbold	2022-07-16	1	-3/+10
\|
*	HTML: no longer extracting citation_pdf_url in main extract function	Bryan Newbold	2022-07-16	1	-24/+0
\|
*	html: mangled JSON-in-URL pattern	Bryan Newbold	2022-07-15	1	-0/+1
\|
*	html: remove old citation_pdf_url code path	Bryan Newbold	2022-07-15	1	-32/+1
\| \| \| \| \|	This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
*	wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵	Bryan Newbold	2022-07-15	1	-7/+7
\| \| \| \|	replay redirect
*	cdx api: add another allowable URL fuzzy-match pattern (double slashes)	Bryan Newbold	2022-07-15	1	-0/+9
\|
*	ingest: more bogus domain patterns	Bryan Newbold	2022-07-15	1	-0/+3
\|
*	spn2: handle case of re-attempting a recent crawl (race condition)	Bryan Newbold	2022-07-15	1	-0/+14
\|
*	html: fulltext URL prefixes to skip; also fix broken pattern matching	Bryan Newbold	2022-07-15	1	-4/+19
\| \| \| \| \|	Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
*	row2json script: fix argument type	Bryan Newbold	2022-07-15	1	-1/+1
\|
*	row2json script: add flag to enable recrawling	Bryan Newbold	2022-07-15	1	-1/+8
\|
*	ingest: another form of cookie block URL	Bryan Newbold	2022-07-15	1	-0/+2
\| \| \| \| \|	This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
*	HTML ingest: most sub-resource patterns to skip	Bryan Newbold	2022-07-15	1	-1/+13
\|
*	cdx lookups: prioritize truely exact URL matches	Bryan Newbold	2022-07-14	1	-0/+1
\| \| \| \| \| \|	This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
*	ingest: handle another type of wayback redirect	Bryan Newbold	2022-07-14	1	-2/+5
\|
*	yet another bad PDF	Bryan Newbold	2022-07-13	1	-0/+1
\|
*	wayback fetch: handle upstream 5xx replays	Bryan Newbold	2022-07-13	1	-4/+15
\|
*	shorten default HTTP backoff factor	Bryan Newbold	2022-07-13	1	-1/+1
\| \| \| \| \|	The existing factor was resulting in many-minute long backoffs, and Kafka timeouts