aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html.py
Commit message (Expand)AuthorAgeFilesLines
* old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
* ingest: fix html PDF extraction exception catch behaviorBryan Newbold2021-05-241-3/+2
* ingest PDF extraction updatesBryan Newbold2021-05-211-0/+17
* better OSF preprint download re-writingBryan Newbold2021-05-211-6/+23
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-081-116/+18
* html: handle JMIR URL patternBryan Newbold2020-09-151-0/+6
* skip citation_pdf_url if it is a link loopBryan Newbold2020-09-141-2/+8
* html parse: add another generic fulltext patternBryan Newbold2020-09-141-1/+10
* html: handle embed with mangled 'src' attributeBryan Newbold2020-08-241-1/+1
* html: extract eprints PDF url (eg, ub.uni-heidelberg.de)Bryan Newbold2020-08-111-0/+2
* extract PDF urls for e-periodica.chBryan Newbold2020-08-101-0/+6
* add more HTML extraction tricksBryan Newbold2020-08-081-2/+29
* rwth-aachen.de HTML extract, and a generic URL guess methodBryan Newbold2020-08-081-0/+15
* handle UnboundLocalError in HTML parsingBryan Newbold2020-05-191-1/+4
* hotfix for html meta extract codepathBryan Newbold2020-05-031-1/+1
* ingest: handle partial citation_pdf_url tagBryan Newbold2020-05-031-0/+3
* fix KeyError in HTML PDF URL extractionBryan Newbold2020-04-171-1/+1
* html: attempt at CNKI href extractionBryan Newbold2020-04-131-0/+11
* ingest: eurosurveillance PDF parserBryan Newbold2020-03-251-0/+11
* ingest: handle missing chemrxvi tagBryan Newbold2020-02-241-1/+1
* ingest: more direct americanarchivist PDF url guessBryan Newbold2020-02-241-0/+4
* ingest: make ehp.niehs.nih.gov rule more robustBryan Newbold2020-02-241-2/+3
* small tweak to americanarchivist.org URL extractionBryan Newbold2020-02-241-1/+1
* html: more publisher-specific fulltext extraction tricksBryan Newbold2020-02-221-0/+47
* html: degruyter extraction; disabled journals.lww.comBryan Newbold2020-02-221-0/+19
* html: handle TypeError during bs4 parseBryan Newbold2020-02-221-1/+7
* allow <meta property=citation_pdf_url>Bryan Newbold2020-02-181-0/+3
* html extract: protocols.io, fix americanarchivistBryan Newbold2020-01-101-1/+7
* more ingest HTML extraction hacksBryan Newbold2020-01-101-6/+46
* many publisher-specific ingest improvementsBryan Newbold2020-01-101-4/+96
* fill in more html extraction techniquesBryan Newbold2020-01-091-7/+6
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-1/+1
* start of hrmars.com ingest supportBryan Newbold2019-11-141-0/+2
* citation_pdf_url with host-relative URLsBryan Newbold2019-11-131-1/+3
* more progress on file ingestBryan Newbold2019-11-131-0/+19
* much progress on file ingest pathBryan Newbold2019-10-221-0/+73