Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | refs: include GROBID-parsed crossref refs | Bryan Newbold | 2021-12-06 | 1 | -4/+52 |
| | | | | | | This takes advantage of Crossref 'unstructured' refs which have been parsed using GROBID and stored in the sandcrawler database, as part of the sandcrawler crossref metadata pipeline. | ||||
* | refactor use of grobid_tei_xml | Bryan Newbold | 2021-10-27 | 1 | -41/+39 |
| | |||||
* | replace grobid2json with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 1 | -3/+5 |
| | | | | | This first iteration uses the .to_legacy_dict() helpers for backwards compatibility | ||||
* | lint: small cleanups, mostly E711 and E713 | Bryan Newbold | 2021-10-27 | 1 | -3/+3 |
| | |||||
* | lint: remove all 'import *' uses | Bryan Newbold | 2021-10-27 | 1 | -2/+20 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -3/+10 |
| | |||||
* | re-style imports (isort) on all core python files | Bryan Newbold | 2021-10-27 | 1 | -5/+5 |
| | |||||
* | better parsing of year as integer in refs pipeline | Bryan Newbold | 2021-07-26 | 1 | -2/+2 |
| | |||||
* | make fmt | Bryan Newbold | 2021-07-26 | 1 | -4/+10 |
| | |||||
* | ref_key: hotfix for some corner cases | Bryan Newbold | 2021-07-26 | 1 | -8/+25 |
| | |||||
* | transform: more clean_doi() calls | Bryan Newbold | 2021-07-26 | 1 | -3/+3 |
| | |||||
* | refs transform: consolidate clean_ref_key() hacks | Bryan Newbold | 2021-07-25 | 1 | -17/+35 |
| | |||||
* | refs transform: many fixes | Bryan Newbold | 2021-07-25 | 1 | -9/+34 |
| | | | | | | | | | - include year correctly (many cases) - test coverage for Crossref transform - pass-through 'edition' as 'version' - series-title parsed in to title or container as appropriate - missing release stage - fix 0-index vs. 1-index ref index field | ||||
* | refs transform: 1-index refs.index, not 0-index | Bryan Newbold | 2021-07-25 | 1 | -3/+11 |
| | | | | | | | | This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values. | ||||
* | refs: clean up GROBID DOIs and PMCIDs | Bryan Newbold | 2021-07-01 | 1 | -2/+3 |
| | |||||
* | HACK: don't parse TEI-XML for a specific paper/file | Bryan Newbold | 2021-06-30 | 1 | -2/+4 |
| | | | | | GROBID v0.5.5 returns TEI-XML for this one PDF which is not valid XML, due to a text encoding issue. | ||||
* | refs: include (source) release_stage in output | Bryan Newbold | 2021-06-30 | 1 | -0/+1 |
| | |||||
* | bugfix: pass full crossref obj, not just 'record' | Bryan Newbold | 2021-06-02 | 1 | -1/+1 |
| | |||||
* | refs: use fatcat prefix for some sources | Bryan Newbold | 2021-06-02 | 1 | -5/+5 |
| | | | | This makes debugging what is going on much easier | ||||
* | integrate crossref references, and iterate on refs output logic | Bryan Newbold | 2021-06-02 | 1 | -7/+115 |
| | | | | Needs test coverage! | ||||
* | schema: add 'crossref' to bundle schema, and add from_json() helper | Bryan Newbold | 2021-06-02 | 1 | -26/+4 |
| | | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class | ||||
* | reduce max body size to 0.5M characters | Bryan Newbold | 2021-02-24 | 1 | -1/+1 |
| | |||||
* | fix body size limit | Bryan Newbold | 2021-02-24 | 1 | -4/+4 |
| | |||||
* | fmt and lint fixes (including one actual bug) | Bryan Newbold | 2021-02-15 | 1 | -2/+3 |
| | |||||
* | truncate indexed fulltext body at 1 MByte | Bryan Newbold | 2021-02-15 | 1 | -2/+13 |
| | | | | | | There was a large ~4 MByte document getting indexed (work_lumgqw4vqbgvha2ejbsbaepedq) with several megabytes of text, and this was causing elasticsearch indexing timeouts. | ||||
* | catch TEI-XML parsing exception | Bryan Newbold | 2021-01-30 | 1 | -12/+17 |
| | |||||
* | enable sentry exceptions for workers and pipelines | Bryan Newbold | 2021-01-30 | 1 | -1/+12 |
| | | | | It is otherwise difficult to debug multi-million record pipelines. | ||||
* | bigfix: try resolving lang_code list issue again | Bryan Newbold | 2021-01-30 | 1 | -5/+4 |
| | |||||
* | bugfix: lang_code sometimes a list | Bryan Newbold | 2021-01-29 | 1 | -2/+7 |
| | |||||
* | make fmt | Bryan Newbold | 2021-01-25 | 1 | -1/+4 |
| | |||||
* | basic support for excluding web content from index | Bryan Newbold | 2021-01-22 | 1 | -6/+45 |
| | | | | Based on particular patterns in metadata, or exclusion lists in settings | ||||
* | bug fix: more html_fulltext not getting processed | Bryan Newbold | 2021-01-22 | 1 | -0/+2 |
| | |||||
* | add container_sherpa_color field, and populate it | Bryan Newbold | 2021-01-22 | 1 | -0/+1 |
| | |||||
* | improve 'oa' tag calculation | Bryan Newbold | 2021-01-16 | 1 | -4/+4 |
| | |||||
* | small corrections to schema/transform | Bryan Newbold | 2021-01-16 | 1 | -2/+4 |
| | |||||
* | add support for new identifiers and size_bytes schema fields | Bryan Newbold | 2021-01-14 | 1 | -0/+3 |
| | |||||
* | basic HTML transform/index support | Bryan Newbold | 2020-11-18 | 1 | -2/+46 |
| | |||||
* | refs: extract fatcat crossref pages metadata | Bryan Newbold | 2020-11-13 | 1 | -1/+1 |
| | |||||
* | commands: show usage on empty command | Bryan Newbold | 2020-11-02 | 1 | -1/+1 |
| | |||||
* | more SIM metadata mappings | Bryan Newbold | 2020-10-19 | 1 | -3/+31 |
| | |||||
* | SIM pipeline: more language conversions | Bryan Newbold | 2020-10-16 | 1 | -2/+5 |
| | | | | | Not sure where these language strings are coming from, but these were from existing SIM item metadata in archive.org | ||||
* | transform: refactor tag generation out of transform heavy method | Bryan Newbold | 2020-10-16 | 1 | -28/+37 |
| | |||||
* | Upgrade Dynaconf to 3+ | Bruno Rocha | 2020-10-05 | 1 | -1/+1 |
| | | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class. | ||||
* | refs and grobid2json bugfixes from testing | Bryan Newbold | 2020-09-14 | 1 | -3/+10 |
| | |||||
* | bugfix: release_year | Bryan Newbold | 2020-09-13 | 1 | -2/+2 |
| | |||||
* | refs transform: both GROBID and fatcat refs | Bryan Newbold | 2020-09-13 | 1 | -5/+12 |
| | |||||
* | ref transform: support more GROBID fields | Bryan Newbold | 2020-09-13 | 1 | -10/+16 |
| | |||||
* | fixes to refs transform (for non-str author fields) | Bryan Newbold | 2020-09-04 | 1 | -2/+6 |
| | |||||
* | heavy to refs command | Bryan Newbold | 2020-09-04 | 1 | -2/+142 |
| | |||||
* | use simple names, not domain names, for some platforms | Bryan Newbold | 2020-08-12 | 1 | -3/+3 |
| |