sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	html: more conservative parsing of element attr	Bryan Newbold	2020-11-20	1	-2/+4
\|
*	html biblio: handle 'content not in attrs' case	Bryan Newbold	2020-11-12	1	-2/+2
\|
*	html: more adblock	Bryan Newbold	2020-11-08	1	-1/+3
\|
*	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	1	-0/+2
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	1	-9/+149
\|
*	html: more extraction patterns; bugfix; skip more crossmark	Bryan Newbold	2020-11-08	1	-1/+24
\|
*	html: small ingest improvements	Bryan Newbold	2020-11-08	1	-0/+15
\|
*	html: pdf and html extract similar to XML	Bryan Newbold	2020-11-06	1	-20/+30
\| \| \| \|	Note that the primary PDF URL extraction path is a separate code path.
*	initial implementation of HTML ingest in existing worker	Bryan Newbold	2020-11-04	1	-0/+5
\|
*	html: improve XML fulltext extraction for scielo	Bryan Newbold	2020-11-03	1	-4/+17
\|
*	html: some refactoring	Bryan Newbold	2020-11-03	1	-10/+40
\|
*	html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs	Bryan Newbold	2020-10-30	1	-5/+12
\|
*	html: more ingest improvements	Bryan Newbold	2020-10-30	1	-0/+2
\|
*	html: more biblio selectors; resource extraction	Bryan Newbold	2020-10-29	1	-0/+102
\|
*	HTML meta: more from online hunting/research	Bryan Newbold	2020-10-27	1	-3/+54
\|
*	HTML metadata: fix type warnings	Bryan Newbold	2020-10-27	1	-1/+3
\|
*	start HTML metadata extraction code	Bryan Newbold	2020-10-27	1	-0/+230