sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	pipenv: lock minio S3 library to <7.0.0	Bryan Newbold	2021-01-14	2	-242/+196
\| \| \| \| \| \| \| \| \| \| \|	In this upstream commit: https://github.com/minio/minio-py/commit/b81883a98e6f8a09e2903609caabbf0956dd0ec9 The API for errors changes, which makes it harder for use to catch specific exceptions (such as "NoSuchKey" as a Not Found / 404 error). Instead of refactoring, just going to pin the library. We should probably remove this library for a non-implementation-specific S3 client at some point; minio seems simpler than, eg, boto3, but there is probably something ever simpler out there.
*	more expansive python/.gitignore rules (all .gz)	Bryan Newbold	2021-01-05	1	-1/+1
\|
*	doaj ingest request updates (from prod)	Bryan Newbold	2021-01-05	1	-1/+5
\|
*	python makefile: don't duplicate 'lint' commands during 'test'	Bryan Newbold	2021-01-05	1	-2/+0
\|
*	update to python3.8	Bryan Newbold	2021-01-05	2	-400/+413
\|
*	pdf: yet more bad SHA1 (commiting lines from prod)	Bryan Newbold	2021-01-05	1	-0/+20
\|
*	ia CDX: handle bad CDX rows	Bryan Newbold	2021-01-05	1	-2/+4
\|
*	spn: more status codes	Bryan Newbold	2020-12-21	1	-1/+2
\|
*	persist: html_meta is ON CONFLICT DO UPDATE	Bryan Newbold	2020-12-15	1	-1/+1
\|
*	persist: don't expect HTML TEI-XML in result object	Bryan Newbold	2020-12-15	1	-1/+1
\|
*	handle more wayback error conditions	Bryan Newbold	2020-11-20	1	-0/+6
\|
*	html: more conservative parsing of element attr	Bryan Newbold	2020-11-20	1	-2/+4
\|
*	xml: catch parse error	Bryan Newbold	2020-11-19	1	-3/+8
\|
*	spn 'forbidden' status code	Bryan Newbold	2020-11-12	1	-1/+1
\|
*	html biblio: handle 'content not in attrs' case	Bryan Newbold	2020-11-12	1	-2/+2
\|
*	blacklist -> denylist	Bryan Newbold	2020-11-10	2	-9/+9
\|
*	pipenv: updates (mostly for trafilatura 0.6.0)	Bryan Newbold	2020-11-10	1	-25/+32
\|
*	DOAJ and HTML ingest tweaks from QA run	Bryan Newbold	2020-11-10	2	-3/+3
\|
*	html: handle more traf error cases	Bryan Newbold	2020-11-08	1	-2/+2
\|
*	html: more adblock	Bryan Newbold	2020-11-08	1	-1/+3
\|
*	ingest: small html_bibli typo	Bryan Newbold	2020-11-08	1	-1/+1
\|
*	html: most small platform tweaks	Bryan Newbold	2020-11-08	1	-5/+4
\|
*	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	3	-19/+20
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	5	-143/+177
\|
*	basic DOAJ ingest request conversion script	Bryan Newbold	2020-11-08	1	-0/+139
\|
*	ingest: default to html_biblio for PDF URL extraction	Bryan Newbold	2020-11-08	1	-24/+17
\|
*	ingest: shorted scope+platform keys; use html_biblio extraction for PDFs	Bryan Newbold	2020-11-08	1	-15/+35
\|
*	html: more robust ingest; better platform and scope detection	Bryan Newbold	2020-11-08	1	-32/+96
\|
*	html: more extraction patterns; bugfix; skip more crossmark	Bryan Newbold	2020-11-08	1	-1/+24
\|
*	ingest html: return better status based on sniffed scope	Bryan Newbold	2020-11-08	1	-9/+31
\|
*	ingest tool: more ingest control args	Bryan Newbold	2020-11-08	1	-1/+10
\|
*	spn2-internal-server-error is a problem with remote server, not SPN2	Bryan Newbold	2020-11-08	1	-0/+2
\|
*	ingest: better non-full URL fixup	Bryan Newbold	2020-11-08	1	-4/+3
\|
*	html: small ingest improvements	Bryan Newbold	2020-11-08	2	-0/+19
\|
*	html: start improving scope detection	Bryan Newbold	2020-11-08	2	-5/+49
\|
*	ingest: retain html_biblio through hops; all ingest types	Bryan Newbold	2020-11-08	1	-1/+13
\|
*	ingest tool: flag for HTML quick mode (CDX-only)	Bryan Newbold	2020-11-08	2	-1/+6
\|
*	html: try to detect and mark XHTML (vs. HTML or XML)	Bryan Newbold	2020-11-08	2	-4/+6
\|
*	gen_file_metadata: allow empty/null bodies (if flag set)	Bryan Newbold	2020-11-08	2	-3/+5
\| \| \| \|	This is for HTML sub-resources, which can validly be empty (I think)
*	html: missing fetch is wayback-content-error, not wayback-error	Bryan Newbold	2020-11-08	1	-2/+2
\|
*	direct some more warnings to sys.stderr, not stdout	Bryan Newbold	2020-11-08	1	-2/+2
\|
*	html: handle no-capture for sub-resources	Bryan Newbold	2020-11-08	3	-9/+13
\|
*	ingest tool: consistency about ingest-type arg	Bryan Newbold	2020-11-08	1	-2/+2
\|
*	ingest: fix null-body case	Bryan Newbold	2020-11-08	2	-0/+6
\| \| \| \|	Broke this in earlier refactor.
*	remove unused pytype tool	Bryan Newbold	2020-11-06	3	-76/+25
\| \| \| \| \|	Having trouble getting this to install on Xenial, and we aren't even using it in tests/lint yet. Can revisit after Focal upgrade.
*	many bad PDF sha1 from prod	Bryan Newbold	2020-11-06	1	-0/+36
\|
*	html: catch and report exceptions at process_hit() stage	Bryan Newbold	2020-11-06	1	-4/+27
\|
*	html: pdf and html extract similar to XML	Bryan Newbold	2020-11-06	2	-22/+55
\| \| \| \|	Note that the primary PDF URL extraction path is a separate code path.
*	html: refactors/tweaks from testing	Bryan Newbold	2020-11-06	3	-17/+23
\|
*	ia: use newer gwb (petabox) loading class	Bryan Newbold	2020-11-04	1	-5/+8
\| \| \| \|	This fixes zstandard WARC reading.