Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | old HTML extractors: handle null tag | Bryan Newbold | 2021-09-08 | 1 | -8/+9 |
| | |||||
* | ingest: fix html PDF extraction exception catch behavior | Bryan Newbold | 2021-05-24 | 1 | -3/+2 |
| | |||||
* | ingest PDF extraction updates | Bryan Newbold | 2021-05-21 | 1 | -0/+17 |
| | |||||
* | better OSF preprint download re-writing | Bryan Newbold | 2021-05-21 | 1 | -6/+23 |
| | |||||
* | move some PDF URL extraction into declarative format | Bryan Newbold | 2020-11-08 | 1 | -116/+18 |
| | |||||
* | html: handle JMIR URL pattern | Bryan Newbold | 2020-09-15 | 1 | -0/+6 |
| | |||||
* | skip citation_pdf_url if it is a link loop | Bryan Newbold | 2020-09-14 | 1 | -2/+8 |
| | | | | This may help get around link-loop errors for a specific version of OJS | ||||
* | html parse: add another generic fulltext pattern | Bryan Newbold | 2020-09-14 | 1 | -1/+10 |
| | |||||
* | html: handle embed with mangled 'src' attribute | Bryan Newbold | 2020-08-24 | 1 | -1/+1 |
| | |||||
* | html: extract eprints PDF url (eg, ub.uni-heidelberg.de) | Bryan Newbold | 2020-08-11 | 1 | -0/+2 |
| | |||||
* | extract PDF urls for e-periodica.ch | Bryan Newbold | 2020-08-10 | 1 | -0/+6 |
| | |||||
* | add more HTML extraction tricks | Bryan Newbold | 2020-08-08 | 1 | -2/+29 |
| | |||||
* | rwth-aachen.de HTML extract, and a generic URL guess method | Bryan Newbold | 2020-08-08 | 1 | -0/+15 |
| | |||||
* | handle UnboundLocalError in HTML parsing | Bryan Newbold | 2020-05-19 | 1 | -1/+4 |
| | |||||
* | hotfix for html meta extract codepath | Bryan Newbold | 2020-05-03 | 1 | -1/+1 |
| | | | | Didn't test last commit before pushing; bad Bryan! | ||||
* | ingest: handle partial citation_pdf_url tag | Bryan Newbold | 2020-05-03 | 1 | -0/+3 |
| | | | | | | | | Eg: https://www.cureus.com/articles/29935-a-nomogram-for-the-rapid-prediction-of-hematocrit-following-blood-loss-and-fluid-shifts-in-neonates-infants-and-adults Has: <meta name="citation_pdf_url"/> | ||||
* | fix KeyError in HTML PDF URL extraction | Bryan Newbold | 2020-04-17 | 1 | -1/+1 |
| | |||||
* | html: attempt at CNKI href extraction | Bryan Newbold | 2020-04-13 | 1 | -0/+11 |
| | |||||
* | ingest: eurosurveillance PDF parser | Bryan Newbold | 2020-03-25 | 1 | -0/+11 |
| | |||||
* | ingest: handle missing chemrxvi tag | Bryan Newbold | 2020-02-24 | 1 | -1/+1 |
| | |||||
* | ingest: more direct americanarchivist PDF url guess | Bryan Newbold | 2020-02-24 | 1 | -0/+4 |
| | |||||
* | ingest: make ehp.niehs.nih.gov rule more robust | Bryan Newbold | 2020-02-24 | 1 | -2/+3 |
| | |||||
* | small tweak to americanarchivist.org URL extraction | Bryan Newbold | 2020-02-24 | 1 | -1/+1 |
| | |||||
* | html: more publisher-specific fulltext extraction tricks | Bryan Newbold | 2020-02-22 | 1 | -0/+47 |
| | |||||
* | html: degruyter extraction; disabled journals.lww.com | Bryan Newbold | 2020-02-22 | 1 | -0/+19 |
| | |||||
* | html: handle TypeError during bs4 parse | Bryan Newbold | 2020-02-22 | 1 | -1/+7 |
| | |||||
* | allow <meta property=citation_pdf_url> | Bryan Newbold | 2020-02-18 | 1 | -0/+3 |
| | | | | at least researchgate does this (!) | ||||
* | html extract: protocols.io, fix americanarchivist | Bryan Newbold | 2020-01-10 | 1 | -1/+7 |
| | |||||
* | more ingest HTML extraction hacks | Bryan Newbold | 2020-01-10 | 1 | -6/+46 |
| | |||||
* | many publisher-specific ingest improvements | Bryan Newbold | 2020-01-10 | 1 | -4/+96 |
| | |||||
* | fill in more html extraction techniques | Bryan Newbold | 2020-01-09 | 1 | -7/+6 |
| | |||||
* | refactor: use print(..., file=sys.stderr) | Bryan Newbold | 2019-12-18 | 1 | -1/+1 |
| | | | | Should use logging soon, but this seems more idiomatic in the meanwhile. | ||||
* | start of hrmars.com ingest support | Bryan Newbold | 2019-11-14 | 1 | -0/+2 |
| | |||||
* | citation_pdf_url with host-relative URLs | Bryan Newbold | 2019-11-13 | 1 | -1/+3 |
| | |||||
* | more progress on file ingest | Bryan Newbold | 2019-11-13 | 1 | -0/+19 |
| | |||||
* | much progress on file ingest path | Bryan Newbold | 2019-10-22 | 1 | -0/+73 |