aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html_metadata.py
Commit message (Expand)AuthorAgeFilesLines
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-151-13/+27
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-041-0/+15
* yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
* HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
* HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
* HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
* html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
* ingest PDF extraction updatesBryan Newbold2021-05-211-0/+54
* html ingest: remove whitespace around relative URLs (eg, for d-lib)Bryan Newbold2021-05-211-1/+1
* ingest: handle current degruyter PDF link patternBryan Newbold2021-03-261-0/+8
* html: more conservative parsing of element attrBryan Newbold2020-11-201-2/+4
* html biblio: handle 'content not in attrs' caseBryan Newbold2020-11-121-2/+2
* html: more adblockBryan Newbold2020-11-081-1/+3
* move fuzzy URL match method to miscBryan Newbold2020-11-081-0/+2
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-081-9/+149
* html: more extraction patterns; bugfix; skip more crossmarkBryan Newbold2020-11-081-1/+24
* html: small ingest improvementsBryan Newbold2020-11-081-0/+15
* html: pdf and html extract similar to XMLBryan Newbold2020-11-061-20/+30
* initial implementation of HTML ingest in existing workerBryan Newbold2020-11-041-0/+5
* html: improve XML fulltext extraction for scieloBryan Newbold2020-11-031-4/+17
* html: some refactoringBryan Newbold2020-11-031-10/+40
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-5/+12
* html: more ingest improvementsBryan Newbold2020-10-301-0/+2
* html: more biblio selectors; resource extractionBryan Newbold2020-10-291-0/+102
* HTML meta: more from online hunting/researchBryan Newbold2020-10-271-3/+54
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+3
* start HTML metadata extraction codeBryan Newbold2020-10-271-0/+230