aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html.py
Commit message (Collapse)AuthorAgeFilesLines
* handle UnboundLocalError in HTML parsingBryan Newbold2020-05-191-1/+4
|
* hotfix for html meta extract codepathBryan Newbold2020-05-031-1/+1
| | | | Didn't test last commit before pushing; bad Bryan!
* ingest: handle partial citation_pdf_url tagBryan Newbold2020-05-031-0/+3
| | | | | | | | Eg: https://www.cureus.com/articles/29935-a-nomogram-for-the-rapid-prediction-of-hematocrit-following-blood-loss-and-fluid-shifts-in-neonates-infants-and-adults Has: <meta name="citation_pdf_url"/>
* fix KeyError in HTML PDF URL extractionBryan Newbold2020-04-171-1/+1
|
* html: attempt at CNKI href extractionBryan Newbold2020-04-131-0/+11
|
* ingest: eurosurveillance PDF parserBryan Newbold2020-03-251-0/+11
|
* ingest: handle missing chemrxvi tagBryan Newbold2020-02-241-1/+1
|
* ingest: more direct americanarchivist PDF url guessBryan Newbold2020-02-241-0/+4
|
* ingest: make ehp.niehs.nih.gov rule more robustBryan Newbold2020-02-241-2/+3
|
* small tweak to americanarchivist.org URL extractionBryan Newbold2020-02-241-1/+1
|
* html: more publisher-specific fulltext extraction tricksBryan Newbold2020-02-221-0/+47
|
* html: degruyter extraction; disabled journals.lww.comBryan Newbold2020-02-221-0/+19
|
* html: handle TypeError during bs4 parseBryan Newbold2020-02-221-1/+7
|
* allow <meta property=citation_pdf_url>Bryan Newbold2020-02-181-0/+3
| | | | at least researchgate does this (!)
* html extract: protocols.io, fix americanarchivistBryan Newbold2020-01-101-1/+7
|
* more ingest HTML extraction hacksBryan Newbold2020-01-101-6/+46
|
* many publisher-specific ingest improvementsBryan Newbold2020-01-101-4/+96
|
* fill in more html extraction techniquesBryan Newbold2020-01-091-7/+6
|
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-1/+1
| | | | Should use logging soon, but this seems more idiomatic in the meanwhile.
* start of hrmars.com ingest supportBryan Newbold2019-11-141-0/+2
|
* citation_pdf_url with host-relative URLsBryan Newbold2019-11-131-1/+3
|
* more progress on file ingestBryan Newbold2019-11-131-0/+19
|
* much progress on file ingest pathBryan Newbold2019-10-221-0/+73