Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ingest: more generic OJS support, including pre-prints | Bryan Newbold | 2022-10-24 | 1 | -6/+22 |
| | | | | | There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'. | ||||
* | ingest: more generic PDF fulltext URL patterns | Bryan Newbold | 2022-10-24 | 1 | -0/+14 |
| | |||||
* | html: worldscientific PDF URL extraction | Bryan Newbold | 2022-10-24 | 1 | -0/+16 |
| | |||||
* | ingest: more PDF fulltext tricks | Bryan Newbold | 2022-07-20 | 1 | -0/+29 |
| | |||||
* | ingest: more PDF fulltext URL patterns | Bryan Newbold | 2022-07-20 | 1 | -0/+42 |
| | |||||
* | html: mangled JSON-in-URL pattern | Bryan Newbold | 2022-07-15 | 1 | -0/+1 |
| | |||||
* | html: fulltext URL prefixes to skip; also fix broken pattern matching | Bryan Newbold | 2022-07-15 | 1 | -4/+19 |
| | | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working. | ||||
* | HTML ingest: most sub-resource patterns to skip | Bryan Newbold | 2022-07-15 | 1 | -1/+13 |
| | |||||
* | ingest: random site PDF link pattern | Bryan Newbold | 2022-07-12 | 1 | -0/+7 |
| | |||||
* | ingest: doaj.org article landing page access links | Bryan Newbold | 2022-07-12 | 1 | -0/+12 |
| | |||||
* | sandcrawler: additional extracts, mostly OJS | Bryan Newbold | 2022-01-13 | 1 | -1/+23 |
| | |||||
* | ingest: PDF pattern for integrityresjournals.org | Bryan Newbold | 2022-01-13 | 1 | -0/+8 |
| | |||||
* | codespell typos in python (comments) | Bryan Newbold | 2021-11-24 | 1 | -1/+1 |
| | |||||
* | html_meta: actual typo in code (CSS selector) caught by codespell | Bryan Newbold | 2021-11-24 | 1 | -1/+1 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -62/+71 |
| | |||||
* | lint collection membership (last lint for now) | Bryan Newbold | 2021-10-26 | 1 | -9/+9 |
| | |||||
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 1 | -12/+13 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -3/+3 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -21/+16 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -5/+4 |
| | |||||
* | component ingest support for dataverse files (individual) | Bryan Newbold | 2021-10-15 | 1 | -13/+27 |
| | |||||
* | pdf ingest: journals.uchicago.edu pattern | Bryan Newbold | 2021-10-11 | 1 | -0/+8 |
| | |||||
* | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 1 | -0/+15 |
| | |||||
* | yet more PDF URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+48 |
| | |||||
* | HTML ingest: several more PDF fulltext URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+87 |
| | |||||
* | HTML ingest: skip noisy print() statement | Bryan Newbold | 2021-09-03 | 1 | -1/+1 |
| | |||||
* | HTML ingest: more meta-URI prefixes | Bryan Newbold | 2021-08-24 | 1 | -2/+8 |
| | |||||
* | html ingest: skip 'about:blank' | Bryan Newbold | 2021-08-16 | 1 | -0/+3 |
| | | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case? | ||||
* | ingest PDF extraction updates | Bryan Newbold | 2021-05-21 | 1 | -0/+54 |
| | |||||
* | html ingest: remove whitespace around relative URLs (eg, for d-lib) | Bryan Newbold | 2021-05-21 | 1 | -1/+1 |
| | |||||
* | ingest: handle current degruyter PDF link pattern | Bryan Newbold | 2021-03-26 | 1 | -0/+8 |
| | |||||
* | html: more conservative parsing of element attr | Bryan Newbold | 2020-11-20 | 1 | -2/+4 |
| | |||||
* | html biblio: handle 'content not in attrs' case | Bryan Newbold | 2020-11-12 | 1 | -2/+2 |
| | |||||
* | html: more adblock | Bryan Newbold | 2020-11-08 | 1 | -1/+3 |
| | |||||
* | move fuzzy URL match method to misc | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | |||||
* | move some PDF URL extraction into declarative format | Bryan Newbold | 2020-11-08 | 1 | -9/+149 |
| | |||||
* | html: more extraction patterns; bugfix; skip more crossmark | Bryan Newbold | 2020-11-08 | 1 | -1/+24 |
| | |||||
* | html: small ingest improvements | Bryan Newbold | 2020-11-08 | 1 | -0/+15 |
| | |||||
* | html: pdf and html extract similar to XML | Bryan Newbold | 2020-11-06 | 1 | -20/+30 |
| | | | | Note that the primary PDF URL extraction path is a separate code path. | ||||
* | initial implementation of HTML ingest in existing worker | Bryan Newbold | 2020-11-04 | 1 | -0/+5 |
| | |||||
* | html: improve XML fulltext extraction for scielo | Bryan Newbold | 2020-11-03 | 1 | -4/+17 |
| | |||||
* | html: some refactoring | Bryan Newbold | 2020-11-03 | 1 | -10/+40 |
| | |||||
* | html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs | Bryan Newbold | 2020-10-30 | 1 | -5/+12 |
| | |||||
* | html: more ingest improvements | Bryan Newbold | 2020-10-30 | 1 | -0/+2 |
| | |||||
* | html: more biblio selectors; resource extraction | Bryan Newbold | 2020-10-29 | 1 | -0/+102 |
| | |||||
* | HTML meta: more from online hunting/research | Bryan Newbold | 2020-10-27 | 1 | -3/+54 |
| | |||||
* | HTML metadata: fix type warnings | Bryan Newbold | 2020-10-27 | 1 | -1/+3 |
| | |||||
* | start HTML metadata extraction code | Bryan Newbold | 2020-10-27 | 1 | -0/+230 |