aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html_metadata.py
Commit message (Expand)AuthorAgeFilesLines
* html: more conservative parsing of element attrBryan Newbold2020-11-201-2/+4
* html biblio: handle 'content not in attrs' caseBryan Newbold2020-11-121-2/+2
* html: more adblockBryan Newbold2020-11-081-1/+3
* move fuzzy URL match method to miscBryan Newbold2020-11-081-0/+2
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-081-9/+149
* html: more extraction patterns; bugfix; skip more crossmarkBryan Newbold2020-11-081-1/+24
* html: small ingest improvementsBryan Newbold2020-11-081-0/+15
* html: pdf and html extract similar to XMLBryan Newbold2020-11-061-20/+30
* initial implementation of HTML ingest in existing workerBryan Newbold2020-11-041-0/+5
* html: improve XML fulltext extraction for scieloBryan Newbold2020-11-031-4/+17
* html: some refactoringBryan Newbold2020-11-031-10/+40
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-5/+12
* html: more ingest improvementsBryan Newbold2020-10-301-0/+2
* html: more biblio selectors; resource extractionBryan Newbold2020-10-291-0/+102
* HTML meta: more from online hunting/researchBryan Newbold2020-10-271-3/+54
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+3
* start HTML metadata extraction codeBryan Newbold2020-10-271-0/+230