sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	ingest: make content-decoding more robust	Bryan Newbold	2020-03-03	1	-1/+2
\|
*	make gzip content-encoding path more robust	Bryan Newbold	2020-03-03	1	-1/+10
\|
*	ingest: crude content-encoding support	Bryan Newbold	2020-03-02	1	-1/+19
\| \| \| \| \| \|	This perhaps should be handled in IA wrapper tool directly, instead of in ingest code. Or really, possibly a bug in wayback python library or SPN?
*	ingest: add force_recrawl flag to skip historical wayback lookup	Bryan Newbold	2020-03-02	1	-3/+5
\|
*	remove protocols.io octet-stream hack	Bryan Newbold	2020-03-02	1	-6/+2
\|
*	more mime normalization	Bryan Newbold	2020-02-27	1	-1/+18
\|
*	ingest: narrow xhtml filter	Bryan Newbold	2020-02-25	1	-1/+1
\|
*	pdftrio: tweaks to avoid connection errors	Bryan Newbold	2020-02-24	1	-1/+9
\|
*	fix warc_offset -> offset	Bryan Newbold	2020-02-24	1	-1/+1
\|
*	ingest: handle broken revisit records	Bryan Newbold	2020-02-24	1	-1/+4
\|
*	ingest: handle missing chemrxvi tag	Bryan Newbold	2020-02-24	1	-1/+1
\|
*	ingest: treat CDX lookup error as a wayback-error	Bryan Newbold	2020-02-24	1	-1/+4
\|
*	ingest: more direct americanarchivist PDF url guess	Bryan Newbold	2020-02-24	1	-0/+4
\|
*	ingest: make ehp.niehs.nih.gov rule more robust	Bryan Newbold	2020-02-24	1	-2/+3
\|
*	small tweak to americanarchivist.org URL extraction	Bryan Newbold	2020-02-24	1	-1/+1
\|
*	fetch_petabox_body: allow non-200 status code fetches	Bryan Newbold	2020-02-24	1	-2/+10
\| \| \| \| \| \|	But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching.
*	allow fuzzy revisit matches	Bryan Newbold	2020-02-24	1	-1/+26
\|
*	ingest: more revisit fixes	Bryan Newbold	2020-02-22	1	-4/+4
\|
*	html: more publisher-specific fulltext extraction tricks	Bryan Newbold	2020-02-22	1	-0/+47
\|
*	ia: improve warc/revisit implementation	Bryan Newbold	2020-02-22	1	-26/+46
\| \| \| \| \|	A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None.
*	html: degruyter extraction; disabled journals.lww.com	Bryan Newbold	2020-02-22	1	-0/+19
\|
*	ingest: include better terminal URL/status_code/dt	Bryan Newbold	2020-02-22	1	-0/+8
\| \| \| \|	Was getting a lot of "last hit" metadata for these columns.
*	ingest: skip more non-pdf, non-paper domains	Bryan Newbold	2020-02-22	1	-0/+9
\|
*	cdx: handle empty/null CDX response	Bryan Newbold	2020-02-22	1	-0/+2
\| \| \| \|	Sometimes seem to get empty string instead of empty JSON list
*	html: handle TypeError during bs4 parse	Bryan Newbold	2020-02-22	1	-1/+7
\|
*	filter out CDX rows missing WARC playback fields	Bryan Newbold	2020-02-19	1	-0/+4
\|
*	pdf_trio persist fixes from prod	Bryan Newbold	2020-02-19	2	-5/+9
\|
*	allow <meta property=citation_pdf_url>	Bryan Newbold	2020-02-18	1	-0/+3
\| \| \| \|	at least researchgate does this (!)
*	X-Archive-Src more robust than X-Archive-Redirect-Reason	Bryan Newbold	2020-02-18	1	-2/+3
\|
*	wayback: on bad redirects, log instead of assert	Bryan Newbold	2020-02-18	1	-2/+13
\| \| \| \|	This is a different form of mangled redirect.
*	attempt to work around corrupt ARC files from alexa issue	Bryan Newbold	2020-02-18	1	-0/+5
\|
*	unpaywall2ingestrequest transform script	Bryan Newbold	2020-02-18	2	-1/+104
\|
*	pdftrio: mode controlled by CLI arg	Bryan Newbold	2020-02-18	2	-10/+14
\|
*	pdftrio: fix error nesting in pdftrio key	Bryan Newbold	2020-02-18	1	-12/+20
\|
*	include rel and oa_status in ingest request 'extra'	Bryan Newbold	2020-02-18	2	-2/+2
\|
*	ingest: bulk workers don't hit SPNv2	Bryan Newbold	2020-02-13	1	-0/+2
\|
*	pdftrio fixes from testing	Bryan Newbold	2020-02-13	1	-3/+9
\|
*	move pdf_trio results back under key in JSON/Kafka	Bryan Newbold	2020-02-13	2	-7/+31
\|
*	pdftrio: small fixes from testing	Bryan Newbold	2020-02-12	1	-2/+2
\|
*	pdftrio basic python code	Bryan Newbold	2020-02-12	7	-1/+393
\| \| \| \|	This is basically just a copy/paste of GROBID code, only simpler!
*	add ingestrequest_row2json.py	Bryan Newbold	2020-02-05	1	-0/+48
\|
*	fix persist bug where ingest_request_source not saved	Bryan Newbold	2020-02-05	1	-0/+1
\|
*	fix bug where ingest_request extra fields not persisted	Bryan Newbold	2020-02-05	1	-1/+2
\|
*	handle alternative dt format in WARC headers	Bryan Newbold	2020-02-05	1	-2/+4
\| \| \| \| \|	If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one.
*	decrease SPNv2 polling timeout to 3 minutes	Bryan Newbold	2020-02-05	1	-2/+2
\|
*	improvements to reliability from prod testing	Bryan Newbold	2020-02-03	2	-7/+20
\|
*	hack-y backoff ingest attempt	Bryan Newbold	2020-02-03	2	-3/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
*	grobid petabox: fix fetch body/content	Bryan Newbold	2020-02-03	1	-1/+1
\|
*	wayback: try to resolve HTTPException due to many HTTP headers	Bryan Newbold	2020-02-02	1	-1/+9
\| \| \| \| \| \| \| \| \|	This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on
*	sandcrawler_worker: ingest worker distinct consumer groups	Bryan Newbold	2020-01-29	1	-1/+3
\| \| \| \| \| \|	I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format.