sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ingest: more generic OJS support, including pre-prints	Bryan Newbold	2022-10-24	1	-6/+22
\| \| \| \| \|	There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'.
*	ingest: more generic PDF fulltext URL patterns	Bryan Newbold	2022-10-24	1	-0/+14
\|
*	html: worldscientific PDF URL extraction	Bryan Newbold	2022-10-24	1	-0/+16
\|
*	ingest: more PDF fulltext tricks	Bryan Newbold	2022-07-20	1	-0/+29
\|
*	ingest: more PDF fulltext URL patterns	Bryan Newbold	2022-07-20	1	-0/+42
\|
*	html: mangled JSON-in-URL pattern	Bryan Newbold	2022-07-15	1	-0/+1
\|
*	html: fulltext URL prefixes to skip; also fix broken pattern matching	Bryan Newbold	2022-07-15	1	-4/+19
\| \| \| \| \|	Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
*	HTML ingest: most sub-resource patterns to skip	Bryan Newbold	2022-07-15	1	-1/+13
\|
*	ingest: random site PDF link pattern	Bryan Newbold	2022-07-12	1	-0/+7
\|
*	ingest: doaj.org article landing page access links	Bryan Newbold	2022-07-12	1	-0/+12
\|
*	sandcrawler: additional extracts, mostly OJS	Bryan Newbold	2022-01-13	1	-1/+23
\|
*	ingest: PDF pattern for integrityresjournals.org	Bryan Newbold	2022-01-13	1	-0/+8
\|
*	codespell typos in python (comments)	Bryan Newbold	2021-11-24	1	-1/+1
\|
*	html_meta: actual typo in code (CSS selector) caught by codespell	Bryan Newbold	2021-11-24	1	-1/+1
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	1	-62/+71
\|
*	lint collection membership (last lint for now)	Bryan Newbold	2021-10-26	1	-9/+9
\|
*	more progress on type annotations and linting	Bryan Newbold	2021-10-26	1	-12/+13
\|
*	start handling trivial lint cleanups: unused imports, 'is None', etc	Bryan Newbold	2021-10-26	1	-3/+3
\|
*	make fmt	Bryan Newbold	2021-10-26	1	-21/+16
\|
*	python: isort all imports	Bryan Newbold	2021-10-26	1	-5/+4
\|
*	component ingest support for dataverse files (individual)	Bryan Newbold	2021-10-15	1	-13/+27
\|
*	pdf ingest: journals.uchicago.edu pattern	Bryan Newbold	2021-10-11	1	-0/+8
\|
*	ingest: basic 'component' and 'src' support	Bryan Newbold	2021-10-04	1	-0/+15
\|
*	yet more PDF URL patterns	Bryan Newbold	2021-09-03	1	-0/+48
\|
*	HTML ingest: several more PDF fulltext URL patterns	Bryan Newbold	2021-09-03	1	-0/+87
\|
*	HTML ingest: skip noisy print() statement	Bryan Newbold	2021-09-03	1	-1/+1
\|
*	HTML ingest: more meta-URI prefixes	Bryan Newbold	2021-08-24	1	-2/+8
\|
*	html ingest: skip 'about:blank'	Bryan Newbold	2021-08-16	1	-0/+3
\| \| \| \| \|	Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
*	ingest PDF extraction updates	Bryan Newbold	2021-05-21	1	-0/+54
\|
*	html ingest: remove whitespace around relative URLs (eg, for d-lib)	Bryan Newbold	2021-05-21	1	-1/+1
\|
*	ingest: handle current degruyter PDF link pattern	Bryan Newbold	2021-03-26	1	-0/+8
\|
*	html: more conservative parsing of element attr	Bryan Newbold	2020-11-20	1	-2/+4
\|
*	html biblio: handle 'content not in attrs' case	Bryan Newbold	2020-11-12	1	-2/+2
\|
*	html: more adblock	Bryan Newbold	2020-11-08	1	-1/+3
\|
*	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	1	-0/+2
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	1	-9/+149
\|
*	html: more extraction patterns; bugfix; skip more crossmark	Bryan Newbold	2020-11-08	1	-1/+24
\|
*	html: small ingest improvements	Bryan Newbold	2020-11-08	1	-0/+15
\|
*	html: pdf and html extract similar to XML	Bryan Newbold	2020-11-06	1	-20/+30
\| \| \| \|	Note that the primary PDF URL extraction path is a separate code path.
*	initial implementation of HTML ingest in existing worker	Bryan Newbold	2020-11-04	1	-0/+5
\|
*	html: improve XML fulltext extraction for scielo	Bryan Newbold	2020-11-03	1	-4/+17
\|
*	html: some refactoring	Bryan Newbold	2020-11-03	1	-10/+40
\|
*	html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs	Bryan Newbold	2020-10-30	1	-5/+12
\|
*	html: more ingest improvements	Bryan Newbold	2020-10-30	1	-0/+2
\|
*	html: more biblio selectors; resource extraction	Bryan Newbold	2020-10-29	1	-0/+102
\|
*	HTML meta: more from online hunting/research	Bryan Newbold	2020-10-27	1	-3/+54
\|
*	HTML metadata: fix type warnings	Bryan Newbold	2020-10-27	1	-1/+3
\|
*	start HTML metadata extraction code	Bryan Newbold	2020-10-27	1	-0/+230