Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | container search schema: preservation stats, new fields | Bryan Newbold | 2021-04-06 | 1 | -2/+18 |
| | | | | Includes transform code updates and partial test coverage. | ||||
* | release ES: add discipline field | Bryan Newbold | 2021-04-06 | 1 | -0/+2 |
| | |||||
* | ES schemas: add doc_index_ts to all mappings | Bryan Newbold | 2021-04-06 | 1 | -0/+4 |
| | |||||
* | elasticsearch: simple new dblp and doaj fields | Bryan Newbold | 2021-01-20 | 1 | -0/+4 |
| | |||||
* | bug fix: is_preserved should always be bool | Bryan Newbold | 2020-12-17 | 1 | -2/+2 |
| | |||||
* | fix indentation | Bryan Newbold | 2020-12-16 | 1 | -2/+2 |
| | |||||
* | have release elasticsearch transform count webcaptures and filesets towards ↵ | Bryan Newbold | 2020-12-16 | 1 | -26/+57 |
| | | | | | | | | | | | | | preservation These are simple/partial changes to have webcaptures and filesets show up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A longer-term TODO is to update the ES schema to have more granular analytic flags. Also includes a small generalization refactor for URL object parsing into preservation status, shared across file+fileset+webcapture entity types (all have similar URL objects with url+rel fields). | ||||
* | small release_to_elasticsearch refactors | Bryan Newbold | 2020-12-16 | 1 | -7/+12 |
| | | | | | | | These should have almost no change in behavior, but improve code quality. The one behavior change is counting ftp URLs as "in_web" | ||||
* | refactor release_to_elasticsearch transform | Bryan Newbold | 2020-12-16 | 1 | -131/+148 |
| | | | | | | | | | | | | This method was huge an monolithic. This commit splits out the content and container specific sections into helper functions to make it more managable. This involved refactoring to make many flags ("is_*" and "in_*") part of the output dict through the entire code path, allowing simple update() calls on the dict. Noting that in the future should refactor to use a type-annotated class for the elasticsearch output object. Perhaps something auto-generated from the ES schema itself (JSON files). | ||||
* | if a release has DOAJ article id, count as OA | Bryan Newbold | 2020-11-19 | 1 | -0/+3 |
| | |||||
* | elastic transform: more preservation keepers | Bryan Newbold | 2020-10-08 | 1 | -1/+2 |
| | |||||
* | release ES transform tweaks | Bryan Newbold | 2020-08-07 | 1 | -3/+5 |
| | | | | | | | | pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright" | ||||
* | simplify in_kbart check statement | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | Thanks @martin | ||||
* | make in_kbart transform inclusive of last year | Bryan Newbold | 2020-07-23 | 1 | -0/+9 |
| | | | | | | | | | | | | | | | | | Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year. | ||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 1 | -7/+5 |
| | |||||
* | ES schema: add best_url to file schema | Bryan Newbold | 2020-06-04 | 1 | -0/+12 |
| | | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile. | ||||
* | improve is_oa flag accuracy | Bryan Newbold | 2020-02-26 | 1 | -8/+4 |
| | | | | | | Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window? | ||||
* | ES container last tweaks | Bryan Newbold | 2020-02-26 | 1 | -0/+3 |
| | |||||
* | ES release: last minor tweaks | Bryan Newbold | 2020-02-26 | 1 | -2/+2 |
| | |||||
* | ES files: don't remove archive.org domains/hosts | Bryan Newbold | 2020-02-07 | 1 | -5/+0 |
| | |||||
* | ES releases: host/domain fixes | Bryan Newbold | 2020-01-31 | 1 | -2/+2 |
| | |||||
* | fix release es transform missing 'issue' | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | |||||
* | add upper-case work-around from kibana map join | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | |||||
* | tweak file ES archive.org domain tracking | Bryan Newbold | 2020-01-30 | 1 | -0/+6 |
| | |||||
* | implement host+domain parsing for file ES transform | Bryan Newbold | 2020-01-30 | 1 | -9/+5 |
| | |||||
* | fix ES file schema plural field names | Bryan Newbold | 2020-01-29 | 1 | -4/+3 |
| | |||||
* | elastic schema fixes | Bryan Newbold | 2020-01-29 | 1 | -0/+5 |
| | |||||
* | add country to v03b release schema | Bryan Newbold | 2020-01-29 | 1 | -0/+2 |
| | |||||
* | actually implement changelog transform | Bryan Newbold | 2020-01-29 | 1 | -17/+45 |
| | |||||
* | fix some transform bugs, add some tests | Bryan Newbold | 2020-01-29 | 1 | -6/+8 |
| | |||||
* | ES release schema updates | Bryan Newbold | 2020-01-29 | 1 | -5/+76 |
| | |||||
* | container ES schema changes | Bryan Newbold | 2020-01-29 | 1 | -16/+18 |
| | |||||
* | first implementation of ES file schema | Bryan Newbold | 2020-01-29 | 1 | -0/+45 |
| | | | | | Includes a trivial test and transform, but not any workers or doc updates. | ||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 1 | -1/+1 |
| | |||||
* | comment clarifying container.ident in ES release transform | Bryan Newbold | 2019-09-03 | 1 | -0/+2 |
| | |||||
* | fix previous fix (need tests) | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | fix typo bug in container ES transform | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | use EZB and szczepanski as OA signals (ES) | Bryan Newbold | 2019-09-03 | 1 | -0/+12 |
| | |||||
* | elasticsearch transform: fix url.url bug | Bryan Newbold | 2019-05-24 | 1 | -11/+11 |
| | |||||
* | add 'superceded' release extra flag to elastic schema | Bryan Newbold | 2019-05-23 | 1 | -0/+1 |
| | |||||
* | also track work_id in release elasticsearch table | Bryan Newbold | 2019-05-22 | 1 | -0/+1 |
| | |||||
* | count linked refs (not just raw refs) in elasticsearch | Bryan Newbold | 2019-05-22 | 1 | -0/+3 |
| | |||||
* | include creator_ids in release elastic schema | Bryan Newbold | 2019-05-20 | 1 | -0/+6 |
| | | | | Intent is to allow fast creator search/lookup | ||||
* | elastic release schema update | Bryan Newbold | 2019-05-20 | 1 | -2/+5 |
| | |||||
* | fix elastic file pdf check | Bryan Newbold | 2019-05-16 | 1 | -1/+3 |
| | |||||
* | elastic transforms: work around missing pdf mimetypes | Bryan Newbold | 2019-05-15 | 1 | -1/+1 |
| | |||||
* | partial python impl of ext_id and release_stage refactors | Bryan Newbold | 2019-05-13 | 1 | -10/+11 |
| | |||||
* | handle null abstracts for release | Bryan Newbold | 2019-05-07 | 1 | -1/+1 |
| | |||||
* | improve test coverage | Bryan Newbold | 2019-04-04 | 1 | -0/+1 |
| | |||||
* | refactor transforms into sub-dir | Bryan Newbold | 2019-03-11 | 1 | -0/+327 |