sandcrawler - [no description]

	Commit message (Expand)	Author	Age	Files	Lines
*	wrap up previous renaming work	Bryan Newbold	2021-10-15	4	-6/+4
*	progress on fileset/dataset ingest	Bryan Newbold	2021-10-15	4	-0/+403
*	scripts: example archiveorg-to-fileset importer	Bryan Newbold	2021-10-15	1	-0/+138
*	refactoring; progress on filesets	Bryan Newbold	2021-10-15	3	-9/+27
*	rename some python files for clarity	Bryan Newbold	2021-10-15	3	-0/+0
*	pdf ingest: journals.uchicago.edu pattern	Bryan Newbold	2021-10-11	1	-0/+8
*	spn: avoid 'None' job_id	Bryan Newbold	2021-10-11	1	-2/+2
*	cdx_collection.py: minor lint issue	Bryan Newbold	2021-10-04	1	-1/+1
*	ingest: basic 'component' and 'src' support	Bryan Newbold	2021-10-04	2	-20/+84
*	html ingest: report dt with broken CDX records	Bryan Newbold	2021-10-04	1	-1/+1
*	allow through unknown-scope HTML ingests, for possible SPN import	Bryan Newbold	2021-10-01	1	-11/+5
*	html: fix logging of broken CDX URL	Bryan Newbold	2021-10-01	1	-1/+1
*	ingest CDX lookup: weigh year+month of capture against in-petabox-or-not	Bryan Newbold	2021-09-30	1	-0/+1
*	fix typo with spn_cdx_retry_sec arg	Bryan Newbold	2021-09-30	1	-1/+1
*	tune SPN CDX retry/wait depending on mode (priority vs daily)	Bryan Newbold	2021-09-30	3	-3/+9
*	yet another bad PDF sha1	Bryan Newbold	2021-09-30	1	-0/+1
*	new 'daily' and 'priority' ingest request topics	Bryan Newbold	2021-09-30	1	-1/+7
*	old HTML extractors: handle null tag	Bryan Newbold	2021-09-08	1	-8/+9
*	ingest: more block patterns, for huge databases	Bryan Newbold	2021-09-08	1	-1/+4
*	yet more PDF sha1 to skip	Bryan Newbold	2021-09-03	1	-0/+5
*	yet more PDF URL patterns	Bryan Newbold	2021-09-03	1	-0/+48
*	ingest: check URL blocklist again after redirects	Bryan Newbold	2021-09-03	1	-0/+7
*	refactor and expand wall/block/cookie URL patterns	Bryan Newbold	2021-09-03	2	-6/+39
*	HTML ingest: several more PDF fulltext URL patterns	Bryan Newbold	2021-09-03	1	-0/+87
*	HTML ingest: skip noisy print() statement	Bryan Newbold	2021-09-03	1	-1/+1
*	HTML ingest: more meta-URI prefixes	Bryan Newbold	2021-08-24	1	-2/+8
*	html ingest: detect some blog platforms, and allow lower wordcount threshold	Bryan Newbold	2021-08-16	1	-0/+6
*	html ingest: detect domain homepage (no path) as special case	Bryan Newbold	2021-08-16	1	-0/+8
*	html ingest: skip 'about:blank'	Bryan Newbold	2021-08-16	1	-0/+3
*	more bad PDF hashes	Bryan Newbold	2021-07-26	1	-0/+2
*	ingest: fix postgrest lookup bug (double get of GROBID)	Bryan Newbold	2021-07-26	1	-1/+1
*	more blocked-cookie patterns; fix old typo	Bryan Newbold	2021-07-14	1	-2/+2
*	another bad PDF sha1	Bryan Newbold	2021-07-13	1	-0/+1
*	crawl: SPN2 non-200 success code path	Bryan Newbold	2021-07-13	1	-11/+25
*	crawl: SPN self-redirect hack	Bryan Newbold	2021-07-13	1	-0/+9
*	crawl: small comment updates	Bryan Newbold	2021-07-13	1	-3/+6
*	another lowercase DOI in an (unused?) script	Bryan Newbold	2021-07-13	1	-1/+1
*	gitignore: samples/	Bryan Newbold	2021-07-13	1	-0/+1
*	add crossref postgrest fetch support to python db helpers	Bryan Newbold	2021-06-02	1	-0/+9
*	python Makefile: fix test/*.py linting with newer pylint	Bryan Newbold	2021-05-24	1	-1/+1
*	ingest: fix html PDF extraction exception catch behavior	Bryan Newbold	2021-05-24	1	-3/+2
*	ingest PDF extraction updates	Bryan Newbold	2021-05-21	3	-2/+74
*	better OSF preprint download re-writing	Bryan Newbold	2021-05-21	1	-6/+23
*	html ingest: remove whitespace around relative URLs (eg, for d-lib)	Bryan Newbold	2021-05-21	1	-1/+1
*	add cdx_collection.py python script (from scratch repo)	Bryan Newbold	2021-05-04	1	-0/+80
*	ingest: cap max body size to ~128 MByte	Bryan Newbold	2021-04-27	1	-0/+6
*	persist: skip very long URLs	Bryan Newbold	2021-04-12	1	-0/+4
*	update default postgrest ('db') API endpoint	Bryan Newbold	2021-04-09	1	-1/+1
*	grobid: disable biblio-glutton consolidation	Bryan Newbold	2021-04-07	1	-3/+3
*	ingest: handle current degruyter PDF link pattern	Bryan Newbold	2021-03-26	1	-0/+8