Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | transform: more clean_doi() calls | Bryan Newbold | 2021-07-26 | 1 | -3/+3 |
| | |||||
* | refs transform: consolidate clean_ref_key() hacks | Bryan Newbold | 2021-07-25 | 1 | -17/+35 |
| | |||||
* | refs transform: many fixes | Bryan Newbold | 2021-07-25 | 1 | -9/+34 |
| | | | | | | | | | - include year correctly (many cases) - test coverage for Crossref transform - pass-through 'edition' as 'version' - series-title parsed in to title or container as appropriate - missing release stage - fix 0-index vs. 1-index ref index field | ||||
* | refs transform: 1-index refs.index, not 0-index | Bryan Newbold | 2021-07-25 | 1 | -3/+11 |
| | | | | | | | | This was not matching expectations/schema of downstream refs pipeline (cgraph), and wasn't matching documented schema. Note care required when checking if the index is set, to distinguish between '0' and 'None' values. | ||||
* | refs: clean up GROBID DOIs and PMCIDs | Bryan Newbold | 2021-07-01 | 1 | -2/+3 |
| | |||||
* | HACK: don't parse TEI-XML for a specific paper/file | Bryan Newbold | 2021-06-30 | 1 | -2/+4 |
| | | | | | GROBID v0.5.5 returns TEI-XML for this one PDF which is not valid XML, due to a text encoding issue. | ||||
* | refs: include (source) release_stage in output | Bryan Newbold | 2021-06-30 | 1 | -0/+1 |
| | |||||
* | bugfix: pass full crossref obj, not just 'record' | Bryan Newbold | 2021-06-02 | 1 | -1/+1 |
| | |||||
* | refs: use fatcat prefix for some sources | Bryan Newbold | 2021-06-02 | 1 | -5/+5 |
| | | | | This makes debugging what is going on much easier | ||||
* | integrate crossref references, and iterate on refs output logic | Bryan Newbold | 2021-06-02 | 1 | -7/+115 |
| | | | | Needs test coverage! | ||||
* | schema: add 'crossref' to bundle schema, and add from_json() helper | Bryan Newbold | 2021-06-02 | 1 | -26/+4 |
| | | | | | from_json() refactor was an earlier TODO, to reduce duplication when updating fields on this class | ||||
* | reduce max body size to 0.5M characters | Bryan Newbold | 2021-02-24 | 1 | -1/+1 |
| | |||||
* | fix body size limit | Bryan Newbold | 2021-02-24 | 1 | -4/+4 |
| | |||||
* | fmt and lint fixes (including one actual bug) | Bryan Newbold | 2021-02-15 | 1 | -2/+3 |
| | |||||
* | truncate indexed fulltext body at 1 MByte | Bryan Newbold | 2021-02-15 | 1 | -2/+13 |
| | | | | | | There was a large ~4 MByte document getting indexed (work_lumgqw4vqbgvha2ejbsbaepedq) with several megabytes of text, and this was causing elasticsearch indexing timeouts. | ||||
* | catch TEI-XML parsing exception | Bryan Newbold | 2021-01-30 | 1 | -12/+17 |
| | |||||
* | enable sentry exceptions for workers and pipelines | Bryan Newbold | 2021-01-30 | 1 | -1/+12 |
| | | | | It is otherwise difficult to debug multi-million record pipelines. | ||||
* | bigfix: try resolving lang_code list issue again | Bryan Newbold | 2021-01-30 | 1 | -5/+4 |
| | |||||
* | bugfix: lang_code sometimes a list | Bryan Newbold | 2021-01-29 | 1 | -2/+7 |
| | |||||
* | make fmt | Bryan Newbold | 2021-01-25 | 1 | -1/+4 |
| | |||||
* | basic support for excluding web content from index | Bryan Newbold | 2021-01-22 | 1 | -6/+45 |
| | | | | Based on particular patterns in metadata, or exclusion lists in settings | ||||
* | bug fix: more html_fulltext not getting processed | Bryan Newbold | 2021-01-22 | 1 | -0/+2 |
| | |||||
* | add container_sherpa_color field, and populate it | Bryan Newbold | 2021-01-22 | 1 | -0/+1 |
| | |||||
* | improve 'oa' tag calculation | Bryan Newbold | 2021-01-16 | 1 | -4/+4 |
| | |||||
* | small corrections to schema/transform | Bryan Newbold | 2021-01-16 | 1 | -2/+4 |
| | |||||
* | add support for new identifiers and size_bytes schema fields | Bryan Newbold | 2021-01-14 | 1 | -0/+3 |
| | |||||
* | basic HTML transform/index support | Bryan Newbold | 2020-11-18 | 1 | -2/+46 |
| | |||||
* | refs: extract fatcat crossref pages metadata | Bryan Newbold | 2020-11-13 | 1 | -1/+1 |
| | |||||
* | commands: show usage on empty command | Bryan Newbold | 2020-11-02 | 1 | -1/+1 |
| | |||||
* | more SIM metadata mappings | Bryan Newbold | 2020-10-19 | 1 | -3/+31 |
| | |||||
* | SIM pipeline: more language conversions | Bryan Newbold | 2020-10-16 | 1 | -2/+5 |
| | | | | | Not sure where these language strings are coming from, but these were from existing SIM item metadata in archive.org | ||||
* | transform: refactor tag generation out of transform heavy method | Bryan Newbold | 2020-10-16 | 1 | -28/+37 |
| | |||||
* | Upgrade Dynaconf to 3+ | Bruno Rocha | 2020-10-05 | 1 | -1/+1 |
| | | | | | | In dynaconf 3+ it is no more recommended to use `from dynaconf import settings` now the recommendation is to create your own instance of the settings object based on Dynaconf class. | ||||
* | refs and grobid2json bugfixes from testing | Bryan Newbold | 2020-09-14 | 1 | -3/+10 |
| | |||||
* | bugfix: release_year | Bryan Newbold | 2020-09-13 | 1 | -2/+2 |
| | |||||
* | refs transform: both GROBID and fatcat refs | Bryan Newbold | 2020-09-13 | 1 | -5/+12 |
| | |||||
* | ref transform: support more GROBID fields | Bryan Newbold | 2020-09-13 | 1 | -10/+16 |
| | |||||
* | fixes to refs transform (for non-str author fields) | Bryan Newbold | 2020-09-04 | 1 | -2/+6 |
| | |||||
* | heavy to refs command | Bryan Newbold | 2020-09-04 | 1 | -2/+142 |
| | |||||
* | use simple names, not domain names, for some platforms | Bryan Newbold | 2020-08-12 | 1 | -3/+3 |
| | |||||
* | biblio metadata hacks at transform time | Bryan Newbold | 2020-08-12 | 1 | -2/+98 |
| | |||||
* | don't index sim_page without issue_item and first_page | Bryan Newbold | 2020-08-06 | 1 | -0/+3 |
| | |||||
* | handle integer conversion and bounding for ES schema | Bryan Newbold | 2020-08-06 | 1 | -10/+13 |
| | |||||
* | json: exclude None in output, and sort keys | Bryan Newbold | 2020-07-27 | 1 | -1/+1 |
| | | | | | | | | | | These are both size/performance enhancements. Not including 'None' values will reduce document sizes on-disk and over network, particularly for intermediate objects. Sorting by key should improve compression ratios across multiple documents, both on-disk (gzip) and in elasticsearch itself: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html#_put_fields_in_the_same_order_in_documents | ||||
* | ensure SIM release date parses before assigning | Bryan Newbold | 2020-07-21 | 1 | -1/+6 |
| | |||||
* | make fmt | Bryan Newbold | 2020-06-29 | 1 | -8/+13 |
| | |||||
* | include GROBID-extracted abstracts in search documents | Bryan Newbold | 2020-06-29 | 1 | -10/+15 |
| | |||||
* | small improvements to SIM metadata maps | Bryan Newbold | 2020-06-29 | 1 | -6/+11 |
| | |||||
* | fixes for pdf_meta dict | Bryan Newbold | 2020-06-29 | 1 | -1/+2 |
| | |||||
* | remove old COVID19 thumbnail hack | Bryan Newbold | 2020-06-29 | 1 | -1/+2 |
| |