sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	old HTML extractors: handle null tag	Bryan Newbold	2021-09-08	1	-8/+9
\|
*	ingest: fix html PDF extraction exception catch behavior	Bryan Newbold	2021-05-24	1	-3/+2
\|
*	ingest PDF extraction updates	Bryan Newbold	2021-05-21	1	-0/+17
\|
*	better OSF preprint download re-writing	Bryan Newbold	2021-05-21	1	-6/+23
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	1	-116/+18
\|
*	html: handle JMIR URL pattern	Bryan Newbold	2020-09-15	1	-0/+6
\|
*	skip citation_pdf_url if it is a link loop	Bryan Newbold	2020-09-14	1	-2/+8
\| \| \| \|	This may help get around link-loop errors for a specific version of OJS
*	html parse: add another generic fulltext pattern	Bryan Newbold	2020-09-14	1	-1/+10
\|
*	html: handle embed with mangled 'src' attribute	Bryan Newbold	2020-08-24	1	-1/+1
\|
*	html: extract eprints PDF url (eg, ub.uni-heidelberg.de)	Bryan Newbold	2020-08-11	1	-0/+2
\|
*	extract PDF urls for e-periodica.ch	Bryan Newbold	2020-08-10	1	-0/+6
\|
*	add more HTML extraction tricks	Bryan Newbold	2020-08-08	1	-2/+29
\|
*	rwth-aachen.de HTML extract, and a generic URL guess method	Bryan Newbold	2020-08-08	1	-0/+15
\|
*	handle UnboundLocalError in HTML parsing	Bryan Newbold	2020-05-19	1	-1/+4
\|
*	hotfix for html meta extract codepath	Bryan Newbold	2020-05-03	1	-1/+1
\| \| \| \|	Didn't test last commit before pushing; bad Bryan!
*	ingest: handle partial citation_pdf_url tag	Bryan Newbold	2020-05-03	1	-0/+3
\| \| \| \| \| \| \| \|	Eg: https://www.cureus.com/articles/29935-a-nomogram-for-the-rapid-prediction-of-hematocrit-following-blood-loss-and-fluid-shifts-in-neonates-infants-and-adults Has: <meta name="citation_pdf_url"/>
*	fix KeyError in HTML PDF URL extraction	Bryan Newbold	2020-04-17	1	-1/+1
\|
*	html: attempt at CNKI href extraction	Bryan Newbold	2020-04-13	1	-0/+11
\|
*	ingest: eurosurveillance PDF parser	Bryan Newbold	2020-03-25	1	-0/+11
\|
*	ingest: handle missing chemrxvi tag	Bryan Newbold	2020-02-24	1	-1/+1
\|
*	ingest: more direct americanarchivist PDF url guess	Bryan Newbold	2020-02-24	1	-0/+4
\|
*	ingest: make ehp.niehs.nih.gov rule more robust	Bryan Newbold	2020-02-24	1	-2/+3
\|
*	small tweak to americanarchivist.org URL extraction	Bryan Newbold	2020-02-24	1	-1/+1
\|
*	html: more publisher-specific fulltext extraction tricks	Bryan Newbold	2020-02-22	1	-0/+47
\|
*	html: degruyter extraction; disabled journals.lww.com	Bryan Newbold	2020-02-22	1	-0/+19
\|
*	html: handle TypeError during bs4 parse	Bryan Newbold	2020-02-22	1	-1/+7
\|
*	allow <meta property=citation_pdf_url>	Bryan Newbold	2020-02-18	1	-0/+3
\| \| \| \|	at least researchgate does this (!)
*	html extract: protocols.io, fix americanarchivist	Bryan Newbold	2020-01-10	1	-1/+7
\|
*	more ingest HTML extraction hacks	Bryan Newbold	2020-01-10	1	-6/+46
\|
*	many publisher-specific ingest improvements	Bryan Newbold	2020-01-10	1	-4/+96
\|
*	fill in more html extraction techniques	Bryan Newbold	2020-01-09	1	-7/+6
\|
*	refactor: use print(..., file=sys.stderr)	Bryan Newbold	2019-12-18	1	-1/+1
\| \| \| \|	Should use logging soon, but this seems more idiomatic in the meanwhile.
*	start of hrmars.com ingest support	Bryan Newbold	2019-11-14	1	-0/+2
\|
*	citation_pdf_url with host-relative URLs	Bryan Newbold	2019-11-13	1	-1/+3
\|
*	more progress on file ingest	Bryan Newbold	2019-11-13	1	-0/+19
\|
*	much progress on file ingest path	Bryan Newbold	2019-10-22	1	-0/+73