sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	scripts: example archiveorg-to-fileset importer	Bryan Newbold	2021-10-15	1	-0/+138
\|
*	refactoring; progress on filesets	Bryan Newbold	2021-10-15	3	-9/+27
\|
*	rename some python files for clarity	Bryan Newbold	2021-10-15	3	-0/+0
\|
*	pdf ingest: journals.uchicago.edu pattern	Bryan Newbold	2021-10-11	1	-0/+8
\|
*	spn: avoid 'None' job_id	Bryan Newbold	2021-10-11	1	-2/+2
\| \| \| \| \| \|	Thanks Vanglis for reporting these. Not sure this commit fixes all instances of the problem.
*	cdx_collection.py: minor lint issue	Bryan Newbold	2021-10-04	1	-1/+1
\|
*	ingest: basic 'component' and 'src' support	Bryan Newbold	2021-10-04	2	-20/+84
\|
*	html ingest: report dt with broken CDX records	Bryan Newbold	2021-10-04	1	-1/+1
\|
*	allow through unknown-scope HTML ingests, for possible SPN import	Bryan Newbold	2021-10-01	1	-11/+5
\|
*	html: fix logging of broken CDX URL	Bryan Newbold	2021-10-01	1	-1/+1
\|
*	ingest CDX lookup: weigh year+month of capture against in-petabox-or-not	Bryan Newbold	2021-09-30	1	-0/+1
\| \| \| \| \| \| \| \|	This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
*	fix typo with spn_cdx_retry_sec arg	Bryan Newbold	2021-09-30	1	-1/+1
\|
*	tune SPN CDX retry/wait depending on mode (priority vs daily)	Bryan Newbold	2021-09-30	3	-3/+9
\|
*	yet another bad PDF sha1	Bryan Newbold	2021-09-30	1	-0/+1
\|
*	new 'daily' and 'priority' ingest request topics	Bryan Newbold	2021-09-30	1	-1/+7
\| \| \| \| \| \| \| \| \|	The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
*	old HTML extractors: handle null tag	Bryan Newbold	2021-09-08	1	-8/+9
\|
*	ingest: more block patterns, for huge databases	Bryan Newbold	2021-09-08	1	-1/+4
\|
*	yet more PDF sha1 to skip	Bryan Newbold	2021-09-03	1	-0/+5
\|
*	yet more PDF URL patterns	Bryan Newbold	2021-09-03	1	-0/+48
\|
*	ingest: check URL blocklist again after redirects	Bryan Newbold	2021-09-03	1	-0/+7
\|
*	refactor and expand wall/block/cookie URL patterns	Bryan Newbold	2021-09-03	2	-6/+39
\|
*	HTML ingest: several more PDF fulltext URL patterns	Bryan Newbold	2021-09-03	1	-0/+87
\|
*	HTML ingest: skip noisy print() statement	Bryan Newbold	2021-09-03	1	-1/+1
\|
*	HTML ingest: more meta-URI prefixes	Bryan Newbold	2021-08-24	1	-2/+8
\|
*	html ingest: detect some blog platforms, and allow lower wordcount threshold	Bryan Newbold	2021-08-16	1	-0/+6
\|
*	html ingest: detect domain homepage (no path) as special case	Bryan Newbold	2021-08-16	1	-0/+8
\|
*	html ingest: skip 'about:blank'	Bryan Newbold	2021-08-16	1	-0/+3
\| \| \| \| \|	Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
*	more bad PDF hashes	Bryan Newbold	2021-07-26	1	-0/+2
\|
*	ingest: fix postgrest lookup bug (double get of GROBID)	Bryan Newbold	2021-07-26	1	-1/+1
\|
*	more blocked-cookie patterns; fix old typo	Bryan Newbold	2021-07-14	1	-2/+2
\|
*	another bad PDF sha1	Bryan Newbold	2021-07-13	1	-0/+1
\|
*	crawl: SPN2 non-200 success code path	Bryan Newbold	2021-07-13	1	-11/+25
\|
*	crawl: SPN self-redirect hack	Bryan Newbold	2021-07-13	1	-0/+9
\|
*	crawl: small comment updates	Bryan Newbold	2021-07-13	1	-3/+6
\|
*	another lowercase DOI in an (unused?) script	Bryan Newbold	2021-07-13	1	-1/+1
\|
*	gitignore: samples/	Bryan Newbold	2021-07-13	1	-0/+1
\|
*	add crossref postgrest fetch support to python db helpers	Bryan Newbold	2021-06-02	1	-0/+9
\|
*	python Makefile: fix test/*.py linting with newer pylint	Bryan Newbold	2021-05-24	1	-1/+1
\|
*	ingest: fix html PDF extraction exception catch behavior	Bryan Newbold	2021-05-24	1	-3/+2
\|
*	ingest PDF extraction updates	Bryan Newbold	2021-05-21	3	-2/+74
\|
*	better OSF preprint download re-writing	Bryan Newbold	2021-05-21	1	-6/+23
\|
*	html ingest: remove whitespace around relative URLs (eg, for d-lib)	Bryan Newbold	2021-05-21	1	-1/+1
\|
*	add cdx_collection.py python script (from scratch repo)	Bryan Newbold	2021-05-04	1	-0/+80
\|
*	ingest: cap max body size to ~128 MByte	Bryan Newbold	2021-04-27	1	-0/+6
\| \| \| \|	Should resolve 'magic' OOM errors in production.
*	persist: skip very long URLs	Bryan Newbold	2021-04-12	1	-0/+4
\|
*	update default postgrest ('db') API endpoint	Bryan Newbold	2021-04-09	1	-1/+1
\|
*	grobid: disable biblio-glutton consolidation	Bryan Newbold	2021-04-07	1	-3/+3
\|
*	ingest: handle current degruyter PDF link pattern	Bryan Newbold	2021-03-26	1	-0/+8
\|
*	add missing dotfiles (due to gitignore oops)	Bryan Newbold	2021-01-18	2	-0/+12
\|
*	pipenv: lock minio S3 library to <7.0.0	Bryan Newbold	2021-01-14	2	-242/+196
\| \| \| \| \| \| \| \| \| \| \|	In this upstream commit: https://github.com/minio/minio-py/commit/b81883a98e6f8a09e2903609caabbf0956dd0ec9 The API for errors changes, which makes it harder for use to catch specific exceptions (such as "NoSuchKey" as a Not Found / 404 error). Instead of refactoring, just going to pin the library. We should probably remove this library for a non-implementation-specific S3 client at some point; minio seems simpler than, eg, boto3, but there is probably something ever simpler out there.