Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | release ES transform tweaks | Bryan Newbold | 2020-08-07 | 1 | -3/+5 |
| | | | | | | | | pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright" | ||||
* | simplify in_kbart check statement | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | Thanks @martin | ||||
* | make in_kbart transform inclusive of last year | Bryan Newbold | 2020-07-23 | 1 | -0/+9 |
| | | | | | | | | | | | | | | | | | Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year. | ||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 1 | -7/+5 |
| | |||||
* | ES schema: add best_url to file schema | Bryan Newbold | 2020-06-04 | 1 | -0/+12 |
| | | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile. | ||||
* | improve is_oa flag accuracy | Bryan Newbold | 2020-02-26 | 1 | -8/+4 |
| | | | | | | Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window? | ||||
* | ES container last tweaks | Bryan Newbold | 2020-02-26 | 1 | -0/+3 |
| | |||||
* | ES release: last minor tweaks | Bryan Newbold | 2020-02-26 | 1 | -2/+2 |
| | |||||
* | ES files: don't remove archive.org domains/hosts | Bryan Newbold | 2020-02-07 | 1 | -5/+0 |
| | |||||
* | ES releases: host/domain fixes | Bryan Newbold | 2020-01-31 | 1 | -2/+2 |
| | |||||
* | fix release es transform missing 'issue' | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | |||||
* | add upper-case work-around from kibana map join | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | |||||
* | tweak file ES archive.org domain tracking | Bryan Newbold | 2020-01-30 | 1 | -0/+6 |
| | |||||
* | implement host+domain parsing for file ES transform | Bryan Newbold | 2020-01-30 | 1 | -9/+5 |
| | |||||
* | fix ES file schema plural field names | Bryan Newbold | 2020-01-29 | 1 | -4/+3 |
| | |||||
* | elastic schema fixes | Bryan Newbold | 2020-01-29 | 1 | -0/+5 |
| | |||||
* | add country to v03b release schema | Bryan Newbold | 2020-01-29 | 1 | -0/+2 |
| | |||||
* | actually implement changelog transform | Bryan Newbold | 2020-01-29 | 1 | -17/+45 |
| | |||||
* | fix some transform bugs, add some tests | Bryan Newbold | 2020-01-29 | 1 | -6/+8 |
| | |||||
* | ES release schema updates | Bryan Newbold | 2020-01-29 | 1 | -5/+76 |
| | |||||
* | container ES schema changes | Bryan Newbold | 2020-01-29 | 1 | -16/+18 |
| | |||||
* | first implementation of ES file schema | Bryan Newbold | 2020-01-29 | 1 | -0/+45 |
| | | | | | Includes a trivial test and transform, but not any workers or doc updates. | ||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 1 | -1/+1 |
| | |||||
* | comment clarifying container.ident in ES release transform | Bryan Newbold | 2019-09-03 | 1 | -0/+2 |
| | |||||
* | fix previous fix (need tests) | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | fix typo bug in container ES transform | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | use EZB and szczepanski as OA signals (ES) | Bryan Newbold | 2019-09-03 | 1 | -0/+12 |
| | |||||
* | elasticsearch transform: fix url.url bug | Bryan Newbold | 2019-05-24 | 1 | -11/+11 |
| | |||||
* | add 'superceded' release extra flag to elastic schema | Bryan Newbold | 2019-05-23 | 1 | -0/+1 |
| | |||||
* | also track work_id in release elasticsearch table | Bryan Newbold | 2019-05-22 | 1 | -0/+1 |
| | |||||
* | count linked refs (not just raw refs) in elasticsearch | Bryan Newbold | 2019-05-22 | 1 | -0/+3 |
| | |||||
* | include creator_ids in release elastic schema | Bryan Newbold | 2019-05-20 | 1 | -0/+6 |
| | | | | Intent is to allow fast creator search/lookup | ||||
* | elastic release schema update | Bryan Newbold | 2019-05-20 | 1 | -2/+5 |
| | |||||
* | fix elastic file pdf check | Bryan Newbold | 2019-05-16 | 1 | -1/+3 |
| | |||||
* | elastic transforms: work around missing pdf mimetypes | Bryan Newbold | 2019-05-15 | 1 | -1/+1 |
| | |||||
* | partial python impl of ext_id and release_stage refactors | Bryan Newbold | 2019-05-13 | 1 | -10/+11 |
| | |||||
* | handle null abstracts for release | Bryan Newbold | 2019-05-07 | 1 | -1/+1 |
| | |||||
* | improve test coverage | Bryan Newbold | 2019-04-04 | 1 | -0/+1 |
| | |||||
* | refactor transforms into sub-dir | Bryan Newbold | 2019-03-11 | 1 | -0/+327 |