sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	more blocked-cookie patterns; fix old typo	Bryan Newbold	2021-07-14	1	-2/+2
\|
*	another bad PDF sha1	Bryan Newbold	2021-07-13	1	-0/+1
\|
*	crawl: SPN2 non-200 success code path	Bryan Newbold	2021-07-13	1	-11/+25
\|
*	crawl: SPN self-redirect hack	Bryan Newbold	2021-07-13	1	-0/+9
\|
*	crawl: small comment updates	Bryan Newbold	2021-07-13	1	-3/+6
\|
*	another lowercase DOI in an (unused?) script	Bryan Newbold	2021-07-13	1	-1/+1
\|
*	gitignore: samples/	Bryan Newbold	2021-07-13	1	-0/+1
\|
*	add crossref postgrest fetch support to python db helpers	Bryan Newbold	2021-06-02	1	-0/+9
\|
*	python Makefile: fix test/*.py linting with newer pylint	Bryan Newbold	2021-05-24	1	-1/+1
\|
*	ingest: fix html PDF extraction exception catch behavior	Bryan Newbold	2021-05-24	1	-3/+2
\|
*	ingest PDF extraction updates	Bryan Newbold	2021-05-21	3	-2/+74
\|
*	better OSF preprint download re-writing	Bryan Newbold	2021-05-21	1	-6/+23
\|
*	html ingest: remove whitespace around relative URLs (eg, for d-lib)	Bryan Newbold	2021-05-21	1	-1/+1
\|
*	add cdx_collection.py python script (from scratch repo)	Bryan Newbold	2021-05-04	1	-0/+80
\|
*	ingest: cap max body size to ~128 MByte	Bryan Newbold	2021-04-27	1	-0/+6
\| \| \| \|	Should resolve 'magic' OOM errors in production.
*	persist: skip very long URLs	Bryan Newbold	2021-04-12	1	-0/+4
\|
*	update default postgrest ('db') API endpoint	Bryan Newbold	2021-04-09	1	-1/+1
\|
*	grobid: disable biblio-glutton consolidation	Bryan Newbold	2021-04-07	1	-3/+3
\|
*	ingest: handle current degruyter PDF link pattern	Bryan Newbold	2021-03-26	1	-0/+8
\|
*	add missing dotfiles (due to gitignore oops)	Bryan Newbold	2021-01-18	2	-0/+12
\|
*	pipenv: lock minio S3 library to <7.0.0	Bryan Newbold	2021-01-14	2	-242/+196
\| \| \| \| \| \| \| \| \| \| \|	In this upstream commit: https://github.com/minio/minio-py/commit/b81883a98e6f8a09e2903609caabbf0956dd0ec9 The API for errors changes, which makes it harder for use to catch specific exceptions (such as "NoSuchKey" as a Not Found / 404 error). Instead of refactoring, just going to pin the library. We should probably remove this library for a non-implementation-specific S3 client at some point; minio seems simpler than, eg, boto3, but there is probably something ever simpler out there.
*	more expansive python/.gitignore rules (all .gz)	Bryan Newbold	2021-01-05	1	-1/+1
\|
*	doaj ingest request updates (from prod)	Bryan Newbold	2021-01-05	1	-1/+5
\|
*	python makefile: don't duplicate 'lint' commands during 'test'	Bryan Newbold	2021-01-05	1	-2/+0
\|
*	update to python3.8	Bryan Newbold	2021-01-05	2	-400/+413
\|
*	pdf: yet more bad SHA1 (commiting lines from prod)	Bryan Newbold	2021-01-05	1	-0/+20
\|
*	ia CDX: handle bad CDX rows	Bryan Newbold	2021-01-05	1	-2/+4
\|
*	spn: more status codes	Bryan Newbold	2020-12-21	1	-1/+2
\|
*	persist: html_meta is ON CONFLICT DO UPDATE	Bryan Newbold	2020-12-15	1	-1/+1
\|
*	persist: don't expect HTML TEI-XML in result object	Bryan Newbold	2020-12-15	1	-1/+1
\|
*	handle more wayback error conditions	Bryan Newbold	2020-11-20	1	-0/+6
\|
*	html: more conservative parsing of element attr	Bryan Newbold	2020-11-20	1	-2/+4
\|
*	xml: catch parse error	Bryan Newbold	2020-11-19	1	-3/+8
\|
*	spn 'forbidden' status code	Bryan Newbold	2020-11-12	1	-1/+1
\|
*	html biblio: handle 'content not in attrs' case	Bryan Newbold	2020-11-12	1	-2/+2
\|
*	blacklist -> denylist	Bryan Newbold	2020-11-10	2	-9/+9
\|
*	pipenv: updates (mostly for trafilatura 0.6.0)	Bryan Newbold	2020-11-10	1	-25/+32
\|
*	DOAJ and HTML ingest tweaks from QA run	Bryan Newbold	2020-11-10	2	-3/+3
\|
*	html: handle more traf error cases	Bryan Newbold	2020-11-08	1	-2/+2
\|
*	html: more adblock	Bryan Newbold	2020-11-08	1	-1/+3
\|
*	ingest: small html_bibli typo	Bryan Newbold	2020-11-08	1	-1/+1
\|
*	html: most small platform tweaks	Bryan Newbold	2020-11-08	1	-5/+4
\|
*	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	3	-19/+20
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	5	-143/+177
\|
*	basic DOAJ ingest request conversion script	Bryan Newbold	2020-11-08	1	-0/+139
\|
*	ingest: default to html_biblio for PDF URL extraction	Bryan Newbold	2020-11-08	1	-24/+17
\|
*	ingest: shorted scope+platform keys; use html_biblio extraction for PDFs	Bryan Newbold	2020-11-08	1	-15/+35
\|
*	html: more robust ingest; better platform and scope detection	Bryan Newbold	2020-11-08	1	-32/+96
\|
*	html: more extraction patterns; bugfix; skip more crossmark	Bryan Newbold	2020-11-08	1	-1/+24
\|
*	ingest html: return better status based on sniffed scope	Bryan Newbold	2020-11-08	1	-9/+31
\|