sandcrawler - [no description]

	Commit message (Expand)	Author	Age	Files	Lines
*	handle UnboundLocalError in HTML parsing	Bryan Newbold	2020-05-19	1	-1/+4
*	hotfix for html meta extract codepath	Bryan Newbold	2020-05-03	1	-1/+1
*	ingest: handle partial citation_pdf_url tag	Bryan Newbold	2020-05-03	1	-0/+3
*	workers: add missing want() dataflow path	Bryan Newbold	2020-04-30	1	-0/+9
*	ingest: don't 'want' non-PDF ingest	Bryan Newbold	2020-04-30	1	-0/+5
*	timeouts: don't push through None error messages	Bryan Newbold	2020-04-29	1	-2/+2
*	timeout message implementation for GROBID and ingest workers	Bryan Newbold	2020-04-27	2	-0/+18
*	worker timeout wrapper, and use for kafka	Bryan Newbold	2020-04-27	1	-2/+40
*	fix KeyError in HTML PDF URL extraction	Bryan Newbold	2020-04-17	1	-1/+1
*	persist: only GROBID updates file_meta, not file-result	Bryan Newbold	2020-04-16	1	-1/+1
*	batch/multiprocess for ZipfilePusher	Bryan Newbold	2020-04-16	1	-3/+18
*	ingest: quick hack to capture CNKI outlinks	Bryan Newbold	2020-04-13	1	-2/+9
*	html: attempt at CNKI href extraction	Bryan Newbold	2020-04-13	1	-0/+11
*	ia: set User-Agent for replay fetch from wayback	Bryan Newbold	2020-03-29	1	-0/+5
*	ingest: block another large domain (and DOI prefix)	Bryan Newbold	2020-03-27	1	-0/+2
*	ingest: better spn2 pending error code	Bryan Newbold	2020-03-27	1	-0/+2
*	ingest: eurosurveillance PDF parser	Bryan Newbold	2020-03-25	1	-0/+11
*	ia: more conservative use of clean_url()	Bryan Newbold	2020-03-24	1	-3/+5
*	ingest: clean_url() in more places	Bryan Newbold	2020-03-23	3	-1/+6
*	persist grobid: add option to skip S3 upload	Bryan Newbold	2020-03-19	1	-7/+10
*	ingest: log every URL (from ia code side)	Bryan Newbold	2020-03-18	1	-0/+1
*	implement (unused) force_get flag for SPN2	Bryan Newbold	2020-03-18	2	-4/+19
*	work around local redirect (resource.location)	Bryan Newbold	2020-03-17	1	-1/+6
*	Merge branch 'martin-abstract-class-process' into 'master'	bnewbold	2020-03-12	1	-0/+6
\|\
\| *	workers: add explicit process to base class	Martin Czygan	2020-03-12	1	-0/+6
* \|	url cleaning (canonicalization) for ingest base_url	Bryan Newbold	2020-03-10	3	-3/+14
\|/
*	fixes to ingest-request persist	Bryan Newbold	2020-03-05	1	-3/+1
*	persist: ingest_request tool (with no ingest_file_result)	Bryan Newbold	2020-03-05	2	-1/+30
*	ia: catch wayback ChunkedEncodingError	Bryan Newbold	2020-03-05	1	-0/+3
*	ingest: make content-decoding more robust	Bryan Newbold	2020-03-03	1	-1/+2
*	make gzip content-encoding path more robust	Bryan Newbold	2020-03-03	1	-1/+10
*	ingest: crude content-encoding support	Bryan Newbold	2020-03-02	1	-1/+19
*	ingest: add force_recrawl flag to skip historical wayback lookup	Bryan Newbold	2020-03-02	1	-3/+5
*	remove protocols.io octet-stream hack	Bryan Newbold	2020-03-02	1	-6/+2
*	more mime normalization	Bryan Newbold	2020-02-27	1	-1/+18
*	ingest: narrow xhtml filter	Bryan Newbold	2020-02-25	1	-1/+1
*	pdftrio: tweaks to avoid connection errors	Bryan Newbold	2020-02-24	1	-1/+9
*	fix warc_offset -> offset	Bryan Newbold	2020-02-24	1	-1/+1
*	ingest: handle broken revisit records	Bryan Newbold	2020-02-24	1	-1/+4
*	ingest: handle missing chemrxvi tag	Bryan Newbold	2020-02-24	1	-1/+1
*	ingest: treat CDX lookup error as a wayback-error	Bryan Newbold	2020-02-24	1	-1/+4
*	ingest: more direct americanarchivist PDF url guess	Bryan Newbold	2020-02-24	1	-0/+4
*	ingest: make ehp.niehs.nih.gov rule more robust	Bryan Newbold	2020-02-24	1	-2/+3
*	small tweak to americanarchivist.org URL extraction	Bryan Newbold	2020-02-24	1	-1/+1
*	fetch_petabox_body: allow non-200 status code fetches	Bryan Newbold	2020-02-24	1	-2/+10
*	allow fuzzy revisit matches	Bryan Newbold	2020-02-24	1	-1/+26
*	ingest: more revisit fixes	Bryan Newbold	2020-02-22	1	-4/+4
*	html: more publisher-specific fulltext extraction tricks	Bryan Newbold	2020-02-22	1	-0/+47
*	ia: improve warc/revisit implementation	Bryan Newbold	2020-02-22	1	-26/+46
*	html: degruyter extraction; disabled journals.lww.com	Bryan Newbold	2020-02-22	1	-0/+19