Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | refs: include GROBID-parsed crossref refs | Bryan Newbold | 2021-12-06 | 1 | -0/+1 |
| | | | | | | This takes advantage of Crossref 'unstructured' refs which have been parsed using GROBID and stored in the sandcrawler database, as part of the sandcrawler crossref metadata pipeline. | ||||
* | fetch GROBID-parsed refs along with crossref metadata | Bryan Newbold | 2021-12-06 | 1 | -1/+2 |
| | |||||
* | Revert "pull GROBID refs along with crossref records into bundles" | Bryan Newbold | 2021-11-10 | 1 | -2/+1 |
| | | | | | | This reverts commit c164970449a392b5165d903d213c2bb51f2a187f. Didn't mean to merge this to master just yet. | ||||
* | lint: disallow 'import *' even in tests | Bryan Newbold | 2021-11-10 | 2 | -4/+14 |
| | |||||
* | pull GROBID refs along with crossref records into bundles | Bryan Newbold | 2021-11-10 | 1 | -1/+2 |
| | |||||
* | refactor use of grobid_tei_xml | Bryan Newbold | 2021-10-27 | 2 | -3/+33 |
| | |||||
* | replace grobid2json with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 2 | -5/+11 |
| | | | | | This first iteration uses the .to_legacy_dict() helpers for backwards compatibility | ||||
* | lint: small cleanups, mostly E711 and E713 | Bryan Newbold | 2021-10-27 | 2 | -2/+2 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 2 | -3/+8 |
| | |||||
* | re-style imports (isort) on all core python files | Bryan Newbold | 2021-10-27 | 7 | -7/+9 |
| | |||||
* | web: access_redirect_fallback mechanism | Bryan Newbold | 2021-07-26 | 1 | -1/+102 |
| | | | | | | | | | | | | This adds a helper code path that "tries harder" to find an access link, by querying the fatcat API directly to look for any file from any release associated with the work. If it finds a match, it does the redirect as usual (but does log the incident). If no match can be found, there is now a more helpful access-specific 404 error page. If the *work* is a 404, the generic error page is shown. | ||||
* | make fmt | Bryan Newbold | 2021-07-26 | 1 | -5/+13 |
| | |||||
* | fix failing test after clean_doi() | Bryan Newbold | 2021-07-26 | 1 | -1/+1 |
| | |||||
* | refs transform: many fixes | Bryan Newbold | 2021-07-25 | 2 | -1/+274 |
| | | | | | | | | | - include year correctly (many cases) - test coverage for Crossref transform - pass-through 'edition' as 'version' - series-title parsed in to title or container as appropriate - missing release stage - fix 0-index vs. 1-index ref index field | ||||
* | refs transform: 1-index refs.index, not 0-index | Bryan Newbold | 2021-07-25 | 1 | -1/+1 |
| | | | | | | | | This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values. | ||||
* | refs: include (source) release_stage in output | Bryan Newbold | 2021-06-30 | 1 | -9/+18 |
| | |||||
* | commit missing elastic get example JSON files | Bryan Newbold | 2021-06-11 | 2 | -0/+174 |
| | |||||
* | update citation_pdf_url HTML meta tag to new access URL style | Bryan Newbold | 2021-06-11 | 1 | -0/+1 |
| | |||||
* | update access redirect URL endpoints | Bryan Newbold | 2021-06-11 | 1 | -19/+20 |
| | |||||
* | lint fixes, and run fmt | Bryan Newbold | 2021-06-02 | 1 | -4/+1 |
| | |||||
* | add 'crossref' hydration to work pipeline | Bryan Newbold | 2021-06-02 | 1 | -0/+16 |
| | | | | | | | | The immediate motivation is to include recent crossref refs in citation graph transforms. May also be valuable for researchers to have authoritative/publisher metadata in the bundle dumps. | ||||
* | web: fixes to access redirect endpoints | Bryan Newbold | 2021-05-19 | 1 | -0/+11 |
| | |||||
* | iterate on PDF redirect links | Bryan Newbold | 2021-05-17 | 1 | -3/+41 |
| | |||||
* | iterate on access redirects and landing page implementation | Bryan Newbold | 2021-04-27 | 2 | -0/+123 |
| | | | | Small code refactors and minimal test coverage | ||||
* | Revert undesirable changes | Christian Clauss | 2021-02-23 | 6 | -11/+11 |
| | |||||
* | Modernize Python syntax with pyupgrade --py38-plus **/*.py | Christian Clauss | 2021-02-23 | 6 | -11/+11 |
| | |||||
* | api: handle null 'q' parameter on search endpoint | Bryan Newbold | 2021-02-11 | 1 | -1/+5 |
| | |||||
* | refactor ES configuration setting names | Bryan Newbold | 2021-01-25 | 1 | -1/+1 |
| | |||||
* | api: fix /search test, and mypy error on implementation | Bryan Newbold | 2021-01-15 | 1 | -1/+11 |
| | |||||
* | add mocks to work pipeline test | Bryan Newbold | 2021-01-14 | 1 | -1/+63 |
| | |||||
* | add regression test for uvloop+httptools uvicorn problem | Bryan Newbold | 2021-01-05 | 1 | -0/+11 |
| | |||||
* | improve Accept-Language header parsing | Bryan Newbold | 2020-12-02 | 1 | -0/+4 |
| | |||||
* | fmt | Bryan Newbold | 2020-10-28 | 1 | -1/+0 |
| | |||||
* | fixes to issue_db tests | Bryan Newbold | 2020-10-23 | 1 | -6/+3 |
| | |||||
* | basic web search test | Bryan Newbold | 2020-10-23 | 2 | -1/+1701 |
| | |||||
* | basic test for issue-db pipeline | Bryan Newbold | 2020-10-23 | 3 | -0/+30 |
| | |||||
* | start test coverage for web interface | Bryan Newbold | 2020-10-22 | 2 | -0/+68 |
| | |||||
* | improve test coverage | Bryan Newbold | 2020-10-22 | 5 | -0/+72 |
| | |||||
* | minimum viable tests for GROBID XML parsing and refs transform | Bryan Newbold | 2020-09-14 | 3 | -0/+535 |
| | |||||
* | another clean_str() test case | Bryan Newbold | 2020-08-12 | 1 | -0/+4 |
| | |||||
* | transform: more string cleaning | Bryan Newbold | 2020-08-12 | 1 | -1/+19 |
| | |||||
* | scrub_text: single-token strings skipped | Bryan Newbold | 2020-08-06 | 1 | -1/+1 |
| | |||||
* | start some annotaition fixes for pytype | Bryan Newbold | 2020-06-03 | 1 | -1/+1 |
| | |||||
* | flake8-annotation linting | Bryan Newbold | 2020-06-03 | 3 | -4/+4 |
| | | | | Added some new annotations; need to finish more. | ||||
* | flake8 fixes (partial) | Bryan Newbold | 2020-06-03 | 2 | -3/+0 |
| | |||||
* | reformat python code with black | Bryan Newbold | 2020-06-03 | 3 | -13/+19 |
| | |||||
* | improve text scrubbing | Bryan Newbold | 2020-06-03 | 1 | -0/+15 |
| | | | | | | | | | | Was going to use textpipe, but dependency was too large and failed to install with halfway modern GCC (due to CLD2 issue): https://github.com/GregBowyer/cld2-cffi/issues/12 So instead basically pulled out the clean_text function, which is quite short. | ||||
* | first pass transform from pipelines to ES schema | Bryan Newbold | 2020-05-20 | 1 | -1/+1 |
| | |||||
* | initial progress on work pipeline | Bryan Newbold | 2020-05-16 | 1 | -2/+2 |
| | |||||
* | crude djvu XML parsing | Bryan Newbold | 2020-05-16 | 2 | -0/+5158 |
| |