Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | improve is_oa flag accuracy | Bryan Newbold | 2020-02-26 | 2 | -10/+6 |
| | | | | | | Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window? | ||||
* | update ES transform README | Bryan Newbold | 2020-02-26 | 1 | -2/+3 |
| | | | | | - smaller batch sizes to prevent esbulk errors - file transform/index | ||||
* | fix fatcat_transform state filters | Bryan Newbold | 2020-02-26 | 1 | -4/+4 |
| | |||||
* | bulk ES transform: skip non-active entities | Bryan Newbold | 2020-02-26 | 1 | -0/+8 |
| | |||||
* | ES container last tweaks | Bryan Newbold | 2020-02-26 | 2 | -3/+7 |
| | |||||
* | ES release: last minor tweaks | Bryan Newbold | 2020-02-26 | 2 | -5/+7 |
| | |||||
* | ES updates: fix tests to accept archive.org in host/domain | Bryan Newbold | 2020-02-14 | 1 | -2/+3 |
| | |||||
* | release schema: do doc_value on DOIs | Bryan Newbold | 2020-02-13 | 1 | -1/+1 |
| | | | | | | Because DOIs are pseudo-structured (prefix, and often structure within the publisher-controlled area), I suspect we will in fact be wanting to do analytics over these strings. | ||||
* | ES files: don't remove archive.org domains/hosts | Bryan Newbold | 2020-02-07 | 1 | -5/+0 |
| | |||||
* | ES release: actually do want doc_values for work_id | Bryan Newbold | 2020-02-05 | 1 | -1/+1 |
| | | | | Eg, for fast "unique count" | ||||
* | fix axiv/arxiv typo in release schema | Bryan Newbold | 2020-02-04 | 1 | -1/+1 |
| | |||||
* | ES release schema: fix typo | Bryan Newbold | 2020-01-31 | 1 | -1/+1 |
| | |||||
* | ES releases: host/domain fixes | Bryan Newbold | 2020-01-31 | 2 | -2/+5 |
| | |||||
* | pipenv: lock zipp version to work around python3.6 requirement | Bryan Newbold | 2020-01-30 | 2 | -7/+20 |
| | |||||
* | fix release es transform missing 'issue' | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | |||||
* | fix json typos in changelog schema | Bryan Newbold | 2020-01-30 | 1 | -2/+2 |
| | |||||
* | add upper-case work-around from kibana map join | Bryan Newbold | 2020-01-30 | 2 | -0/+2 |
| | |||||
* | JSON typo in release mapping | Bryan Newbold | 2020-01-30 | 1 | -1/+0 |
| | |||||
* | ES schemas: make keywords case-insensitive by default | Bryan Newbold | 2020-01-30 | 4 | -66/+115 |
| | | | | But not applying asciifolding; don't see any need to do so? | ||||
* | tweak file ES archive.org domain tracking | Bryan Newbold | 2020-01-30 | 2 | -0/+7 |
| | |||||
* | implement host+domain parsing for file ES transform | Bryan Newbold | 2020-01-30 | 2 | -13/+8 |
| | |||||
* | pipenv: add tldextract (url parser) and update deps | Bryan Newbold | 2020-01-30 | 2 | -136/+159 |
| | |||||
* | fix ES file schema plural field names | Bryan Newbold | 2020-01-29 | 2 | -5/+4 |
| | |||||
* | new biblio-only general search | Bryan Newbold | 2020-01-29 | 1 | -2/+2 |
| | | | | The other fields are now "copy_to" the merged biblio field. | ||||
* | elastic schema fixes | Bryan Newbold | 2020-01-29 | 3 | -7/+12 |
| | |||||
* | add country to v03b release schema | Bryan Newbold | 2020-01-29 | 2 | -0/+3 |
| | |||||
* | update ES docs and proposal | Bryan Newbold | 2020-01-29 | 2 | -4/+6 |
| | |||||
* | actually implement changelog transform | Bryan Newbold | 2020-01-29 | 3 | -19/+78 |
| | |||||
* | fix some transform bugs, add some tests | Bryan Newbold | 2020-01-29 | 6 | -13/+48 |
| | |||||
* | ES release schema updates | Bryan Newbold | 2020-01-29 | 2 | -28/+122 |
| | |||||
* | container ES schema changes | Bryan Newbold | 2020-01-29 | 2 | -29/+38 |
| | |||||
* | first implementation of ES file schema | Bryan Newbold | 2020-01-29 | 4 | -3/+115 |
| | | | | | Includes a trivial test and transform, but not any workers or doc updates. | ||||
* | fix KafkaError worker reporting for partition errors | Bryan Newbold | 2020-01-29 | 3 | -3/+3 |
| | |||||
* | additional DOI prefix filters | Bryan Newbold | 2020-01-28 | 1 | -0/+8 |
| | | | | From martin, thanks. | ||||
* | increase kafka-pixy timeout to 25 seconds | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | apply ingest request filtering in entity worker | Bryan Newbold | 2020-01-28 | 1 | -3/+34 |
| | | | | | | | `ingest_oa_only` behavior, and other filters, now handled in the entity update worker, instead of in the transform function. Also add a DOI prefix blocklist feature. | ||||
* | remove 'oa_only' feature from ingest transform | Bryan Newbold | 2020-01-28 | 2 | -15/+1 |
| | | | | Refactoring to move this filter elsewhere | ||||
* | more TODO/proposal cleanup | Bryan Newbold | 2020-01-22 | 4 | -10/+34 |
| | |||||
* | more details on potential _edit table disk savings | Bryan Newbold | 2020-01-22 | 1 | -3/+23 |
| | |||||
* | proposal of ideas for reducing database size | Bryan Newbold | 2020-01-21 | 1 | -0/+154 |
| | |||||
* | cleanup some of old TODO list into proposals | Bryan Newbold | 2020-01-21 | 4 | -44/+269 |
| | |||||
* | refactor fatcat_import kafka group names | Bryan Newbold | 2020-01-21 | 1 | -13/+54 |
| | | | | | | | | | | | | | My current understanding is that consumer group names should be one-to-one with topic names. I previously though offsets were stored on a {topic, group} key, but they seem to be mixed and having too many workers in the same group is bad. In particular, we don't want cross-talk or load between QA and prod. All these topics are caught up in prod, so deploying this change and restarting workers should be safe. This commit does not update the elasticsearch or entity updates workers. | ||||
* | fix trivial typo in file importer | Bryan Newbold | 2020-01-20 | 1 | -1/+1 |
| | |||||
* | stats: remove internal PG table sizes from old dumps | Bryan Newbold | 2020-01-19 | 2 | -292/+0 |
| | | | | For ease of reading and comparison | ||||
* | update stats and table sizes | Bryan Newbold | 2020-01-19 | 4 | -0/+96 |
| | |||||
* | Merge branch 'martin-openapi-typo-exmaple' into 'master' | bnewbold | 2020-01-19 | 1 | -1/+1 |
|\ | | | | | | | | | fix a typo in openapi definition See merge request webgroup/fatcat!20 | ||||
| * | fix a typo in openapi definition | Martin Czygan | 2020-01-18 | 1 | -1/+1 |
| | | |||||
* | | Merge branch 'martin-guide-typos-sentance' into 'master' | bnewbold | 2020-01-19 | 1 | -2/+2 |
|\ \ | | | | | | | | | | | | | fix two typos in editing guide See merge request webgroup/fatcat!21 | ||||
| * | | fix two typos in editing guide | Martin Czygan | 2020-01-18 | 1 | -2/+2 |
| |/ | |||||
* | | basic notes in bulk edit changelog | Bryan Newbold | 2020-01-19 | 1 | -0/+7 |
| | |