sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	catch poppler 'ValueError' when parsing PDFs	Bryan Newbold	2022-09-14	1	-1/+2
\| \| \| \| \|	Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF.
*	bad PDF sha1	Bryan Newbold	2022-09-12	1	-0/+4
\|
*	bad PDF sha1	Bryan Newbold	2022-09-11	1	-0/+2
\|
*	another bad PDF sha1	Bryan Newbold	2022-09-09	1	-0/+1
\|
*	yet more bad PDF hashes	Bryan Newbold	2022-09-08	1	-0/+4
\|
*	pipenv: removed unused deps; re-lock deps	Bryan Newbold	2022-09-07	2	-783/+767
\|
*	sandcrawler SQL-based status (sept 2022)	Bryan Newbold	2022-09-07	1	-0/+438
\|
*	summer 2022 ingest notes	Bryan Newbold	2022-09-06	3	-0/+389
\|
*	html ingest: handle TEI-XML parse error	Bryan Newbold	2022-07-28	1	-1/+4
\|
*	yet another bad PDF sha1	Bryan Newbold	2022-07-27	1	-0/+1
\|
*	CDX: skip sha-256 digests	Bryan Newbold	2022-07-25	1	-1/+5
\|
*	yet another bad SHA1 PDF hash	Bryan Newbold	2022-07-24	1	-0/+1
\|
*	misc ingest fixes	Bryan Newbold	2022-07-21	1	-0/+831
\|
*	ingest: bump max-hops from 6 to 8	Bryan Newbold	2022-07-20	1	-1/+1
\|
*	ingest: more PDF fulltext tricks	Bryan Newbold	2022-07-20	2	-0/+36
\|
*	ingest: more PDF fulltext URL patterns	Bryan Newbold	2022-07-20	1	-0/+42
\|
*	doaj and unpaywall transforms: more domains to skip	Bryan Newbold	2022-07-20	2	-3/+1
\|
*	ingest: record bad GZIP transfer decode, instead of crashing (HTML)	Bryan Newbold	2022-07-18	1	-1/+4
\|
*	make fmt	Bryan Newbold	2022-07-18	1	-1/+0
\|
*	cdx: tweak CDX lookups and resolution (sort)	Bryan Newbold	2022-07-16	1	-4/+7
\|
*	html ingest: allow fuzzy CDX sha1 match based on encoding/not-encoding	Bryan Newbold	2022-07-16	1	-3/+10
\|
*	HTML: no longer extracting citation_pdf_url in main extract function	Bryan Newbold	2022-07-16	1	-24/+0
\|
*	html: mangled JSON-in-URL pattern	Bryan Newbold	2022-07-15	1	-0/+1
\|
*	html: remove old citation_pdf_url code path	Bryan Newbold	2022-07-15	1	-32/+1
\| \| \| \| \|	This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
*	wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵	Bryan Newbold	2022-07-15	1	-7/+7
\| \| \| \|	replay redirect
*	cdx api: add another allowable URL fuzzy-match pattern (double slashes)	Bryan Newbold	2022-07-15	1	-0/+9
\|
*	ingest: more bogus domain patterns	Bryan Newbold	2022-07-15	1	-0/+3
\|
*	spn2: handle case of re-attempting a recent crawl (race condition)	Bryan Newbold	2022-07-15	1	-0/+14
\|
*	html: fulltext URL prefixes to skip; also fix broken pattern matching	Bryan Newbold	2022-07-15	1	-4/+19
\| \| \| \| \|	Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
*	row2json script: fix argument type	Bryan Newbold	2022-07-15	1	-1/+1
\|
*	row2json script: add flag to enable recrawling	Bryan Newbold	2022-07-15	1	-1/+8
\|
*	ingest: another form of cookie block URL	Bryan Newbold	2022-07-15	1	-0/+2
\| \| \| \| \|	This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
*	HTML ingest: most sub-resource patterns to skip	Bryan Newbold	2022-07-15	1	-1/+13
\|
*	cdx lookups: prioritize truely exact URL matches	Bryan Newbold	2022-07-14	1	-0/+1
\| \| \| \| \| \|	This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
*	ingest: handle another type of wayback redirect	Bryan Newbold	2022-07-14	1	-2/+5
\|
*	unpaywall crawl wrap-up notes	Bryan Newbold	2022-07-14	1	-2/+145
\|
*	yet another bad PDF	Bryan Newbold	2022-07-13	1	-0/+1
\|
*	wayback fetch: handle upstream 5xx replays	Bryan Newbold	2022-07-13	1	-4/+15
\|
*	shorten default HTTP backoff factor	Bryan Newbold	2022-07-13	1	-1/+1
\| \| \| \| \|	The existing factor was resulting in many-minute long backoffs, and Kafka timeouts
*	ingest: random site PDF link pattern	Bryan Newbold	2022-07-12	1	-0/+7
\|
*	ingest: doaj.org article landing page access links	Bryan Newbold	2022-07-12	2	-1/+12
\|
*	ingest: targeted 2022-04 notes	Bryan Newbold	2022-07-07	1	-1/+3
\|
*	stats: may 2022 ingest-by-domain stats	Bryan Newbold	2022-07-07	1	-0/+410
\|
*	ingest: IEEE domain is blocking us	Bryan Newbold	2022-07-07	1	-1/+2
\|
*	ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)	Bryan Newbold	2022-05-16	2	-4/+19
\|
*	ingest: skip arxiv.org DOIs, we already direct-ingest	Bryan Newbold	2022-05-11	1	-0/+1
\|
*	python make fmt	Bryan Newbold	2022-05-05	1	-3/+1
\|
*	ingest spn2: fix tests	Bryan Newbold	2022-05-05	4	-6/+108
\|
*	ingest: more loginwall patterns	Bryan Newbold	2022-05-05	1	-0/+3
\|
*	ingest_tool: fix arg parsing	Bryan Newbold	2022-05-03	1	-8/+8
\|