Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | html: pdf and html extract similar to XML | Bryan Newbold | 2020-11-06 | 1 | -20/+30 |
| | | | | Note that the primary PDF URL extraction path is a separate code path. | ||||
* | initial implementation of HTML ingest in existing worker | Bryan Newbold | 2020-11-04 | 1 | -0/+5 |
| | |||||
* | html: improve XML fulltext extraction for scielo | Bryan Newbold | 2020-11-03 | 1 | -4/+17 |
| | |||||
* | html: some refactoring | Bryan Newbold | 2020-11-03 | 1 | -10/+40 |
| | |||||
* | html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs | Bryan Newbold | 2020-10-30 | 1 | -5/+12 |
| | |||||
* | html: more ingest improvements | Bryan Newbold | 2020-10-30 | 1 | -0/+2 |
| | |||||
* | html: more biblio selectors; resource extraction | Bryan Newbold | 2020-10-29 | 1 | -0/+102 |
| | |||||
* | HTML meta: more from online hunting/research | Bryan Newbold | 2020-10-27 | 1 | -3/+54 |
| | |||||
* | HTML metadata: fix type warnings | Bryan Newbold | 2020-10-27 | 1 | -1/+3 |
| | |||||
* | start HTML metadata extraction code | Bryan Newbold | 2020-10-27 | 1 | -0/+230 |