Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | html: more adblock | Bryan Newbold | 2020-11-08 | 1 | -1/+3 |
| | |||||
* | move fuzzy URL match method to misc | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | |||||
* | move some PDF URL extraction into declarative format | Bryan Newbold | 2020-11-08 | 1 | -9/+149 |
| | |||||
* | html: more extraction patterns; bugfix; skip more crossmark | Bryan Newbold | 2020-11-08 | 1 | -1/+24 |
| | |||||
* | html: small ingest improvements | Bryan Newbold | 2020-11-08 | 1 | -0/+15 |
| | |||||
* | html: pdf and html extract similar to XML | Bryan Newbold | 2020-11-06 | 1 | -20/+30 |
| | | | | Note that the primary PDF URL extraction path is a separate code path. | ||||
* | initial implementation of HTML ingest in existing worker | Bryan Newbold | 2020-11-04 | 1 | -0/+5 |
| | |||||
* | html: improve XML fulltext extraction for scielo | Bryan Newbold | 2020-11-03 | 1 | -4/+17 |
| | |||||
* | html: some refactoring | Bryan Newbold | 2020-11-03 | 1 | -10/+40 |
| | |||||
* | html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs | Bryan Newbold | 2020-10-30 | 1 | -5/+12 |
| | |||||
* | html: more ingest improvements | Bryan Newbold | 2020-10-30 | 1 | -0/+2 |
| | |||||
* | html: more biblio selectors; resource extraction | Bryan Newbold | 2020-10-29 | 1 | -0/+102 |
| | |||||
* | HTML meta: more from online hunting/research | Bryan Newbold | 2020-10-27 | 1 | -3/+54 |
| | |||||
* | HTML metadata: fix type warnings | Bryan Newbold | 2020-10-27 | 1 | -1/+3 |
| | |||||
* | start HTML metadata extraction code | Bryan Newbold | 2020-10-27 | 1 | -0/+230 |