Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ES schemas: add doc_index_ts to all mappings | Bryan Newbold | 2021-04-06 | 1 | -0/+4 |
| | |||||
* | elasticsearch: simple new dblp and doaj fields | Bryan Newbold | 2021-01-20 | 1 | -0/+4 |
| | |||||
* | bug fix: is_preserved should always be bool | Bryan Newbold | 2020-12-17 | 1 | -2/+2 |
| | |||||
* | fix indentation | Bryan Newbold | 2020-12-16 | 1 | -2/+2 |
| | |||||
* | have release elasticsearch transform count webcaptures and filesets towards ↵ | Bryan Newbold | 2020-12-16 | 1 | -26/+57 |
| | | | | | | | | | | | | | preservation These are simple/partial changes to have webcaptures and filesets show up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A longer-term TODO is to update the ES schema to have more granular analytic flags. Also includes a small generalization refactor for URL object parsing into preservation status, shared across file+fileset+webcapture entity types (all have similar URL objects with url+rel fields). | ||||
* | small release_to_elasticsearch refactors | Bryan Newbold | 2020-12-16 | 1 | -7/+12 |
| | | | | | | | These should have almost no change in behavior, but improve code quality. The one behavior change is counting ftp URLs as "in_web" | ||||
* | refactor release_to_elasticsearch transform | Bryan Newbold | 2020-12-16 | 1 | -131/+148 |
| | | | | | | | | | | | | This method was huge an monolithic. This commit splits out the content and container specific sections into helper functions to make it more managable. This involved refactoring to make many flags ("is_*" and "in_*") part of the output dict through the entire code path, allowing simple update() calls on the dict. Noting that in the future should refactor to use a type-annotated class for the elasticsearch output object. Perhaps something auto-generated from the ES schema itself (JSON files). | ||||
* | if a release has DOAJ article id, count as OA | Bryan Newbold | 2020-11-19 | 1 | -0/+3 |
| | |||||
* | ingest tool: support for setting ingest type | Bryan Newbold | 2020-11-06 | 1 | -6/+6 |
| | |||||
* | elastic transform: more preservation keepers | Bryan Newbold | 2020-10-08 | 1 | -1/+2 |
| | |||||
* | release ES transform tweaks | Bryan Newbold | 2020-08-07 | 1 | -3/+5 |
| | | | | | | | | pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright" | ||||
* | basic toml transform helper | Bryan Newbold | 2020-07-30 | 2 | -4/+20 |
| | |||||
* | simplify in_kbart check statement | Bryan Newbold | 2020-07-23 | 1 | -1/+1 |
| | | | | Thanks @martin | ||||
* | make in_kbart transform inclusive of last year | Bryan Newbold | 2020-07-23 | 1 | -0/+9 |
| | | | | | | | | | | | | | | | | | Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year. | ||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 4 | -18/+10 |
| | |||||
* | ES schema: add best_url to file schema | Bryan Newbold | 2020-06-04 | 1 | -0/+12 |
| | | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile. | ||||
* | improve citeproc/CSL web interface | Bryan Newbold | 2020-03-25 | 1 | -6/+12 |
| | | | | | | | | | | | | | | This tries to show the citeproc (bibtext, MLA, CSL-JSON) options for more releases, and not show the links when they would break. The primary motivation here is to work around two exceptions being thrown in prod every day (according to sentry): KeyError: 'role' ValueError: CLS requries some surname (family name) I'm guessing these are mostly coming from crawlers following the citeproc links on release landing pages. | ||||
* | Merge branch 'bnewbold-elastic-v03b' | Bryan Newbold | 2020-02-26 | 2 | -46/+198 |
|\ | |||||
| * | improve is_oa flag accuracy | Bryan Newbold | 2020-02-26 | 1 | -8/+4 |
| | | | | | | | | | | | | Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window? | ||||
| * | ES container last tweaks | Bryan Newbold | 2020-02-26 | 1 | -0/+3 |
| | | |||||
| * | ES release: last minor tweaks | Bryan Newbold | 2020-02-26 | 1 | -2/+2 |
| | | |||||
| * | ES files: don't remove archive.org domains/hosts | Bryan Newbold | 2020-02-07 | 1 | -5/+0 |
| | | |||||
| * | ES releases: host/domain fixes | Bryan Newbold | 2020-01-31 | 1 | -2/+2 |
| | | |||||
| * | fix release es transform missing 'issue' | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | | |||||
| * | add upper-case work-around from kibana map join | Bryan Newbold | 2020-01-30 | 1 | -0/+1 |
| | | |||||
| * | tweak file ES archive.org domain tracking | Bryan Newbold | 2020-01-30 | 1 | -0/+6 |
| | | |||||
| * | implement host+domain parsing for file ES transform | Bryan Newbold | 2020-01-30 | 1 | -9/+5 |
| | | |||||
| * | fix ES file schema plural field names | Bryan Newbold | 2020-01-29 | 1 | -4/+3 |
| | | |||||
| * | elastic schema fixes | Bryan Newbold | 2020-01-29 | 1 | -0/+5 |
| | | |||||
| * | add country to v03b release schema | Bryan Newbold | 2020-01-29 | 1 | -0/+2 |
| | | |||||
| * | actually implement changelog transform | Bryan Newbold | 2020-01-29 | 1 | -17/+45 |
| | | |||||
| * | fix some transform bugs, add some tests | Bryan Newbold | 2020-01-29 | 1 | -6/+8 |
| | | |||||
| * | ES release schema updates | Bryan Newbold | 2020-01-29 | 1 | -5/+76 |
| | | |||||
| * | container ES schema changes | Bryan Newbold | 2020-01-29 | 1 | -16/+18 |
| | | |||||
| * | first implementation of ES file schema | Bryan Newbold | 2020-01-29 | 2 | -1/+46 |
| | | | | | | | | | | Includes a trivial test and transform, but not any workers or doc updates. | ||||
* | | default to PMC ingest URLs over DOI | Bryan Newbold | 2020-02-04 | 1 | -4/+4 |
|/ | | | | | | | For cases where there might be both PMC and DOI urls, do the europmc.org PMC ones over DOI option. May want to turn this into a config or command-line option in the future. | ||||
* | remove 'oa_only' feature from ingest transform | Bryan Newbold | 2020-01-28 | 1 | -14/+1 |
| | | | | Refactoring to move this filter elsewhere | ||||
* | transform ingests via pmc/pmcid, not pubmed/pmid | Bryan Newbold | 2019-12-24 | 1 | -4/+4 |
| | |||||
* | update ingest request schema | Bryan Newbold | 2019-12-13 | 1 | -5/+22 |
| | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups. | ||||
* | tweaks to ingest-file transform | Bryan Newbold | 2019-12-12 | 1 | -13/+7 |
| | |||||
* | project -> ingest_request_source | Bryan Newbold | 2019-11-15 | 1 | -2/+2 |
| | |||||
* | fix release.pmcid typo | Bryan Newbold | 2019-11-15 | 1 | -2/+2 |
| | |||||
* | more ingest importer comments and counts | Bryan Newbold | 2019-11-15 | 1 | -1/+1 |
| | |||||
* | add ingest request transform (and test) | Bryan Newbold | 2019-11-15 | 2 | -0/+67 |
| | |||||
* | dict wrapper for entity_from_json() | Bryan Newbold | 2019-10-08 | 2 | -3/+7 |
| | |||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 3 | -3/+3 |
| | |||||
* | comment clarifying container.ident in ES release transform | Bryan Newbold | 2019-09-03 | 1 | -0/+2 |
| | |||||
* | fix previous fix (need tests) | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | fix typo bug in container ES transform | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | use EZB and szczepanski as OA signals (ES) | Bryan Newbold | 2019-09-03 | 1 | -0/+12 |
| |