sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	tune SPN CDX retry/wait depending on mode (priority vs daily)	Bryan Newbold	2021-09-30	2	-3/+5
\|
*	yet another bad PDF sha1	Bryan Newbold	2021-09-30	1	-0/+1
\|
*	old HTML extractors: handle null tag	Bryan Newbold	2021-09-08	1	-8/+9
\|
*	ingest: more block patterns, for huge databases	Bryan Newbold	2021-09-08	1	-1/+4
\|
*	yet more PDF sha1 to skip	Bryan Newbold	2021-09-03	1	-0/+5
\|
*	yet more PDF URL patterns	Bryan Newbold	2021-09-03	1	-0/+48
\|
*	ingest: check URL blocklist again after redirects	Bryan Newbold	2021-09-03	1	-0/+7
\|
*	refactor and expand wall/block/cookie URL patterns	Bryan Newbold	2021-09-03	1	-6/+25
\|
*	HTML ingest: several more PDF fulltext URL patterns	Bryan Newbold	2021-09-03	1	-0/+87
\|
*	HTML ingest: skip noisy print() statement	Bryan Newbold	2021-09-03	1	-1/+1
\|
*	HTML ingest: more meta-URI prefixes	Bryan Newbold	2021-08-24	1	-2/+8
\|
*	html ingest: detect some blog platforms, and allow lower wordcount threshold	Bryan Newbold	2021-08-16	1	-0/+6
\|
*	html ingest: detect domain homepage (no path) as special case	Bryan Newbold	2021-08-16	1	-0/+8
\|
*	html ingest: skip 'about:blank'	Bryan Newbold	2021-08-16	1	-0/+3
\| \| \| \| \|	Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
*	more bad PDF hashes	Bryan Newbold	2021-07-26	1	-0/+2
\|
*	ingest: fix postgrest lookup bug (double get of GROBID)	Bryan Newbold	2021-07-26	1	-1/+1
\|
*	more blocked-cookie patterns; fix old typo	Bryan Newbold	2021-07-14	1	-2/+2
\|
*	another bad PDF sha1	Bryan Newbold	2021-07-13	1	-0/+1
\|
*	crawl: SPN2 non-200 success code path	Bryan Newbold	2021-07-13	1	-11/+25
\|
*	crawl: SPN self-redirect hack	Bryan Newbold	2021-07-13	1	-0/+9
\|
*	crawl: small comment updates	Bryan Newbold	2021-07-13	1	-3/+6
\|
*	add crossref postgrest fetch support to python db helpers	Bryan Newbold	2021-06-02	1	-0/+9
\|
*	ingest: fix html PDF extraction exception catch behavior	Bryan Newbold	2021-05-24	1	-3/+2
\|
*	ingest PDF extraction updates	Bryan Newbold	2021-05-21	3	-2/+74
\|
*	better OSF preprint download re-writing	Bryan Newbold	2021-05-21	1	-6/+23
\|
*	html ingest: remove whitespace around relative URLs (eg, for d-lib)	Bryan Newbold	2021-05-21	1	-1/+1
\|
*	ingest: cap max body size to ~128 MByte	Bryan Newbold	2021-04-27	1	-0/+6
\| \| \| \|	Should resolve 'magic' OOM errors in production.
*	persist: skip very long URLs	Bryan Newbold	2021-04-12	1	-0/+4
\|
*	update default postgrest ('db') API endpoint	Bryan Newbold	2021-04-09	1	-1/+1
\|
*	grobid: disable biblio-glutton consolidation	Bryan Newbold	2021-04-07	1	-3/+3
\|
*	ingest: handle current degruyter PDF link pattern	Bryan Newbold	2021-03-26	1	-0/+8
\|
*	pdf: yet more bad SHA1 (commiting lines from prod)	Bryan Newbold	2021-01-05	1	-0/+20
\|
*	ia CDX: handle bad CDX rows	Bryan Newbold	2021-01-05	1	-2/+4
\|
*	spn: more status codes	Bryan Newbold	2020-12-21	1	-1/+2
\|
*	persist: html_meta is ON CONFLICT DO UPDATE	Bryan Newbold	2020-12-15	1	-1/+1
\|
*	persist: don't expect HTML TEI-XML in result object	Bryan Newbold	2020-12-15	1	-1/+1
\|
*	handle more wayback error conditions	Bryan Newbold	2020-11-20	1	-0/+6
\|
*	html: more conservative parsing of element attr	Bryan Newbold	2020-11-20	1	-2/+4
\|
*	xml: catch parse error	Bryan Newbold	2020-11-19	1	-3/+8
\|
*	spn 'forbidden' status code	Bryan Newbold	2020-11-12	1	-1/+1
\|
*	html biblio: handle 'content not in attrs' case	Bryan Newbold	2020-11-12	1	-2/+2
\|
*	DOAJ and HTML ingest tweaks from QA run	Bryan Newbold	2020-11-10	1	-2/+2
\|
*	html: handle more traf error cases	Bryan Newbold	2020-11-08	1	-2/+2
\|
*	html: more adblock	Bryan Newbold	2020-11-08	1	-1/+3
\|
*	ingest: small html_bibli typo	Bryan Newbold	2020-11-08	1	-1/+1
\|
*	html: most small platform tweaks	Bryan Newbold	2020-11-08	1	-5/+4
\|
*	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	3	-19/+20
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	3	-134/+174
\|
*	ingest: default to html_biblio for PDF URL extraction	Bryan Newbold	2020-11-08	1	-24/+17
\|
*	ingest: shorted scope+platform keys; use html_biblio extraction for PDFs	Bryan Newbold	2020-11-08	1	-15/+35
\|