Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ingest: handle missing chemrxvi tag | Bryan Newbold | 2020-02-24 | 1 | -1/+1 |
| | |||||
* | ingest: more direct americanarchivist PDF url guess | Bryan Newbold | 2020-02-24 | 1 | -0/+4 |
| | |||||
* | ingest: make ehp.niehs.nih.gov rule more robust | Bryan Newbold | 2020-02-24 | 1 | -2/+3 |
| | |||||
* | small tweak to americanarchivist.org URL extraction | Bryan Newbold | 2020-02-24 | 1 | -1/+1 |
| | |||||
* | html: more publisher-specific fulltext extraction tricks | Bryan Newbold | 2020-02-22 | 1 | -0/+47 |
| | |||||
* | html: degruyter extraction; disabled journals.lww.com | Bryan Newbold | 2020-02-22 | 1 | -0/+19 |
| | |||||
* | html: handle TypeError during bs4 parse | Bryan Newbold | 2020-02-22 | 1 | -1/+7 |
| | |||||
* | allow <meta property=citation_pdf_url> | Bryan Newbold | 2020-02-18 | 1 | -0/+3 |
| | | | | at least researchgate does this (!) | ||||
* | html extract: protocols.io, fix americanarchivist | Bryan Newbold | 2020-01-10 | 1 | -1/+7 |
| | |||||
* | more ingest HTML extraction hacks | Bryan Newbold | 2020-01-10 | 1 | -6/+46 |
| | |||||
* | many publisher-specific ingest improvements | Bryan Newbold | 2020-01-10 | 1 | -4/+96 |
| | |||||
* | fill in more html extraction techniques | Bryan Newbold | 2020-01-09 | 1 | -7/+6 |
| | |||||
* | refactor: use print(..., file=sys.stderr) | Bryan Newbold | 2019-12-18 | 1 | -1/+1 |
| | | | | Should use logging soon, but this seems more idiomatic in the meanwhile. | ||||
* | start of hrmars.com ingest support | Bryan Newbold | 2019-11-14 | 1 | -0/+2 |
| | |||||
* | citation_pdf_url with host-relative URLs | Bryan Newbold | 2019-11-13 | 1 | -1/+3 |
| | |||||
* | more progress on file ingest | Bryan Newbold | 2019-11-13 | 1 | -0/+19 |
| | |||||
* | much progress on file ingest path | Bryan Newbold | 2019-10-22 | 1 | -0/+73 |