aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/transforms/elasticsearch.py
Commit message (Collapse)AuthorAgeFilesLines
* ES release transform: handle redirected containers betterBryan Newbold2021-11-241-1/+1
| | | | | Despite the inline comment, we were not actually grabbing the "redirected" ident correctly, meaning some counts would not be accurate.
* content_scope: include in file ES schema and transformBryan Newbold2021-11-171-0/+1
|
* lint: resolve existing mypy type errorsBryan Newbold2021-11-021-3/+6
| | | | | | | | | Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added.
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-314/+354
|
* python: isort everythingBryan Newbold2021-11-021-3/+8
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-021-5/+5
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* small python tweaks for annotations, importsBryan Newbold2021-11-021-1/+1
|
* try some type annotationsBryan Newbold2021-11-021-6/+6
|
* python: implement ES schema changesBryan Newbold2021-10-131-4/+17
|
* transforms: fix 'display_ame' typoBryan Newbold2021-04-191-2/+2
|
* prefer contrib.creator.display_name over contrib.raw_nameBryan Newbold2021-04-121-1/+4
| | | | | | | | These will be getting updates from ORCID and are usually more complete and more correct for display, attribution, and search purposes. Might need to tweak fuzzycat code to handle multiple names at the verification stage.
* ES schema updates: doc_index_ts as a str, not datetimeBryan Newbold2021-04-061-4/+4
| | | | | The schema is a timestamp, but python needs to serialize as JSON, and doesn't do datetime automatically.
* container search schema: preservation stats, new fieldsBryan Newbold2021-04-061-2/+18
| | | | Includes transform code updates and partial test coverage.
* release ES: add discipline fieldBryan Newbold2021-04-061-0/+2
|
* ES schemas: add doc_index_ts to all mappingsBryan Newbold2021-04-061-0/+4
|
* elasticsearch: simple new dblp and doaj fieldsBryan Newbold2021-01-201-0/+4
|
* bug fix: is_preserved should always be boolBryan Newbold2020-12-171-2/+2
|
* fix indentationBryan Newbold2020-12-161-2/+2
|
* have release elasticsearch transform count webcaptures and filesets towards ↵Bryan Newbold2020-12-161-26/+57
| | | | | | | | | | | | | preservation These are simple/partial changes to have webcaptures and filesets show up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A longer-term TODO is to update the ES schema to have more granular analytic flags. Also includes a small generalization refactor for URL object parsing into preservation status, shared across file+fileset+webcapture entity types (all have similar URL objects with url+rel fields).
* small release_to_elasticsearch refactorsBryan Newbold2020-12-161-7/+12
| | | | | | | These should have almost no change in behavior, but improve code quality. The one behavior change is counting ftp URLs as "in_web"
* refactor release_to_elasticsearch transformBryan Newbold2020-12-161-131/+148
| | | | | | | | | | | | This method was huge an monolithic. This commit splits out the content and container specific sections into helper functions to make it more managable. This involved refactoring to make many flags ("is_*" and "in_*") part of the output dict through the entire code path, allowing simple update() calls on the dict. Noting that in the future should refactor to use a type-annotated class for the elasticsearch output object. Perhaps something auto-generated from the ES schema itself (JSON files).
* if a release has DOAJ article id, count as OABryan Newbold2020-11-191-0/+3
|
* elastic transform: more preservation keepersBryan Newbold2020-10-081-1/+2
|
* release ES transform tweaksBryan Newbold2020-08-071-3/+5
| | | | | | | | pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright"
* simplify in_kbart check statementBryan Newbold2020-07-231-1/+1
| | | | Thanks @martin
* make in_kbart transform inclusive of last yearBryan Newbold2020-07-231-0/+9
| | | | | | | | | | | | | | | | | Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year.
* lint (flake8) tool python filesBryan Newbold2020-07-011-7/+5
|
* ES schema: add best_url to file schemaBryan Newbold2020-06-041-0/+12
| | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
* improve is_oa flag accuracyBryan Newbold2020-02-261-8/+4
| | | | | | Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window?
* ES container last tweaksBryan Newbold2020-02-261-0/+3
|
* ES release: last minor tweaksBryan Newbold2020-02-261-2/+2
|
* ES files: don't remove archive.org domains/hostsBryan Newbold2020-02-071-5/+0
|
* ES releases: host/domain fixesBryan Newbold2020-01-311-2/+2
|
* fix release es transform missing 'issue'Bryan Newbold2020-01-301-0/+1
|
* add upper-case work-around from kibana map joinBryan Newbold2020-01-301-0/+1
|
* tweak file ES archive.org domain trackingBryan Newbold2020-01-301-0/+6
|
* implement host+domain parsing for file ES transformBryan Newbold2020-01-301-9/+5
|
* fix ES file schema plural field namesBryan Newbold2020-01-291-4/+3
|
* elastic schema fixesBryan Newbold2020-01-291-0/+5
|
* add country to v03b release schemaBryan Newbold2020-01-291-0/+2
|
* actually implement changelog transformBryan Newbold2020-01-291-17/+45
|
* fix some transform bugs, add some testsBryan Newbold2020-01-291-6/+8
|
* ES release schema updatesBryan Newbold2020-01-291-5/+76
|
* container ES schema changesBryan Newbold2020-01-291-16/+18
|
* first implementation of ES file schemaBryan Newbold2020-01-291-0/+45
| | | | | Includes a trivial test and transform, but not any workers or doc updates.
* refactor all python source for client lib nameBryan Newbold2019-09-051-1/+1
|
* comment clarifying container.ident in ES release transformBryan Newbold2019-09-031-0/+2
|
* fix previous fix (need tests)Bryan Newbold2019-09-031-2/+2
|
* fix typo bug in container ES transformBryan Newbold2019-09-031-2/+2
|
* use EZB and szczepanski as OA signals (ES)Bryan Newbold2019-09-031-0/+12
|