aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html.py
Commit message (Collapse)AuthorAgeFilesLines
* old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
|
* ingest: fix html PDF extraction exception catch behaviorBryan Newbold2021-05-241-3/+2
|
* ingest PDF extraction updatesBryan Newbold2021-05-211-0/+17
|
* better OSF preprint download re-writingBryan Newbold2021-05-211-6/+23
|
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-081-116/+18
|
* html: handle JMIR URL patternBryan Newbold2020-09-151-0/+6
|
* skip citation_pdf_url if it is a link loopBryan Newbold2020-09-141-2/+8
| | | | This may help get around link-loop errors for a specific version of OJS
* html parse: add another generic fulltext patternBryan Newbold2020-09-141-1/+10
|
* html: handle embed with mangled 'src' attributeBryan Newbold2020-08-241-1/+1
|
* html: extract eprints PDF url (eg, ub.uni-heidelberg.de)Bryan Newbold2020-08-111-0/+2
|
* extract PDF urls for e-periodica.chBryan Newbold2020-08-101-0/+6
|
* add more HTML extraction tricksBryan Newbold2020-08-081-2/+29
|
* rwth-aachen.de HTML extract, and a generic URL guess methodBryan Newbold2020-08-081-0/+15
|
* handle UnboundLocalError in HTML parsingBryan Newbold2020-05-191-1/+4
|
* hotfix for html meta extract codepathBryan Newbold2020-05-031-1/+1
| | | | Didn't test last commit before pushing; bad Bryan!
* ingest: handle partial citation_pdf_url tagBryan Newbold2020-05-031-0/+3
| | | | | | | | Eg: https://www.cureus.com/articles/29935-a-nomogram-for-the-rapid-prediction-of-hematocrit-following-blood-loss-and-fluid-shifts-in-neonates-infants-and-adults Has: <meta name="citation_pdf_url"/>
* fix KeyError in HTML PDF URL extractionBryan Newbold2020-04-171-1/+1
|
* html: attempt at CNKI href extractionBryan Newbold2020-04-131-0/+11
|
* ingest: eurosurveillance PDF parserBryan Newbold2020-03-251-0/+11
|
* ingest: handle missing chemrxvi tagBryan Newbold2020-02-241-1/+1
|
* ingest: more direct americanarchivist PDF url guessBryan Newbold2020-02-241-0/+4
|
* ingest: make ehp.niehs.nih.gov rule more robustBryan Newbold2020-02-241-2/+3
|
* small tweak to americanarchivist.org URL extractionBryan Newbold2020-02-241-1/+1
|
* html: more publisher-specific fulltext extraction tricksBryan Newbold2020-02-221-0/+47
|
* html: degruyter extraction; disabled journals.lww.comBryan Newbold2020-02-221-0/+19
|
* html: handle TypeError during bs4 parseBryan Newbold2020-02-221-1/+7
|
* allow <meta property=citation_pdf_url>Bryan Newbold2020-02-181-0/+3
| | | | at least researchgate does this (!)
* html extract: protocols.io, fix americanarchivistBryan Newbold2020-01-101-1/+7
|
* more ingest HTML extraction hacksBryan Newbold2020-01-101-6/+46
|
* many publisher-specific ingest improvementsBryan Newbold2020-01-101-4/+96
|
* fill in more html extraction techniquesBryan Newbold2020-01-091-7/+6
|
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-1/+1
| | | | Should use logging soon, but this seems more idiomatic in the meanwhile.
* start of hrmars.com ingest supportBryan Newbold2019-11-141-0/+2
|
* citation_pdf_url with host-relative URLsBryan Newbold2019-11-131-1/+3
|
* more progress on file ingestBryan Newbold2019-11-131-0/+19
|
* much progress on file ingest pathBryan Newbold2019-10-221-0/+73