From 9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@robocracy.org>
Date: Tue, 21 Jan 2020 17:48:39 -0800
Subject: cleanup some of old TODO list into proposals

---
 proposals/20190911_v04_schema_tweaks.md |   2 +-
 proposals/2020_elasticsearch_schemas.md | 157 ++++++++++++++++++++++++++++++++
 proposals/2020_metadata_cleanups.md     | 109 ++++++++++++++++++++++
 3 files changed, 267 insertions(+), 1 deletion(-)
 create mode 100644 proposals/2020_elasticsearch_schemas.md
 create mode 100644 proposals/2020_metadata_cleanups.md

(limited to 'proposals')
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
index 0e789ad1..916e8816 100644
--- a/proposals/20190911_v04_schema_tweaks.md
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -36,7 +36,7 @@ API endpoints:
 - `GET /editor/<ident>/bots` (?) endpoint to enumerate bots wrangled by a
   specific editor
 
-Elasticsearch schema:
+See `2020_search_improvements` for elasticsearch-only schema updates.
 
 - releases *may* need an "_all" field (or `biblio`?) containing most fields to
   make some search experiences work
diff --git a/proposals/2020_elasticsearch_schemas.md b/proposals/2020_elasticsearch_schemas.md
new file mode 100644
index 00000000..d931efd3
--- /dev/null
+++ b/proposals/2020_elasticsearch_schemas.md
@@ -0,0 +1,157 @@
+
+status: planning
+
+This document tracks "easy" elasticsearch schema and behavior changes that
+could be made while being backwards compatible with the current v0.3 schema and
+not requiring any API/database schema changes.
+
+## Release Field Additions
+
+Simple additions:
+
+- volume
+- issue
+- pages
+- `first_page` (parsed from pages) (?)
+- number
+- `in_shadow`
+- OA license slug (?)
+- `doi_prefix`
+- `doi_registrar` (based on extra)
+
+"Array" keyword types for reverse lookups:
+
+- referenced releases idents
+- contrib creator idents
+
+
+## Preservation Summary Field
+
+To make facet/aggregate queries easier, propose summarizing the preservation
+status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which
+is:
+
+- `bright`
+- `dark_only`
+- `shadow_only`
+- `none`
+
+Note that these don't align with OA color or work-level preservation (aka, no
+"green"), it is a release-level status.
+
+Filters like "papers only", "published only", "not stub", "single container"
+would be overlaid in queries.
+
+
+## OA Color Summary Field
+
+Might not be ready for this yet, but for both releases and containers may be
+able to do a better job of indicating OA status/policy for published works.
+
+Not clear if this should be for "published" only, or whether we should try to
+handle embargo time spans and dates.
+
+
+## Release Merged Default Field
+
+A current issue with searches is that term queries only match on a single
+field, unless alternative fields are explicitly indicated. This breaks obvious
+queries like "principa newton" which include both title terms and author terms,
+or "coffee death bmj" which include a release title and journal title.
+
+A partial solution to this is to index a field with multiple fields "copied"
+into it, and have that be the default for term queries.
+
+Fields to copy in include at least:
+
+- `title`
+- `subtitle`
+- `original_title`
+- `container_name`
+- names of all authors (maybe not acronyms?)
+
+May also want to include volume, issue, year, and any container acronyms or
+aliases. If we did that, users could paste in citations and there is a better
+chance the best match would be the exact cited paper.
+
+This should be a pretty simple change. The biggest downside will be larger (up
+to double?) index size.
+
+
+## Partial Query Parsing
+
+At some point we may want to build a proper query parser (see separate
+proposal), but in the short term there is some low-hanging fruit simple token
+parsing and re-writing we could do.
+
+- strings like `N/A` which are parse bugs; auto-quote these
+- pasting/searching for entire titles which include a word then colon ("Yummy
+  Food: A Review"). We can detect that "food" is not a valid facet, and quote
+  that single token
+- ability to do an empty search (to get total count) (?)
+
+This would require at least a simple escaped quotes tokenizer.
+
+
+## Basic Filtering
+
+This would be in the user interface, not schema.
+
+At simple google-style filtering in release searches like:
+
+- time span (last year, last 5, last 20, all)
+- fulltext availability
+- release type; stage; withdrawn
+- language
+- country/region
+
+For containers:
+
+- is OA
+- stub (zero releases)
+
+## Work Grouping
+
+Release searches can be "grouped by" work identifier in the default responses,
+to prevent the common situation where there are multiple release which are just
+different versions of the same work.
+
+Need to ensure this is performant.
+
+Would need to update query UI/UX to display another line under hits ("also XYZ
+other copies {including retraction or update} {having fulltext if this
+hit does not}").
+
+
+## Container Fields
+
+- `all_issns`
+- `release_count`
+
+The `release_count` would not be indexed (left null) by default, and would be
+"patched" in to entities by a separate script (periodically?).
+
+
+## Container Copied Fields
+
+Like releases, container entities could have a merged biblio field to use as
+default in term queries:
+
+- `name`
+- `original_name`
+- `aliases` (in extra?)
+- `publisher`
+
+Maybe also language and country names?
+
+
+## Background Reading
+
+"Which Academic Search Systems are Suitable for Systematic Reviews or
+Meta-Analyses?  Evaluating Retrieval Qualities of Google Scholar, PubMed and 26
+other Resources"
+
+https://musingsaboutlibrarianship.blogspot.com/2019/12/the-rise-of-open-discovery-indexes.html
+
+"Scholarly Search Engine Comparison"
+https://docs.google.com/spreadsheets/d/1ZiCUuKNse8dwHRFAyhFsZsl6kG0Fkgaj5gttdwdVZEM/edit#gid=1016151070
diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md
new file mode 100644
index 00000000..e53c47d3
--- /dev/null
+++ b/proposals/2020_metadata_cleanups.md
@@ -0,0 +1,109 @@
+
+status: planning
+
+This proposal tracks a batch of catalog metadata cleanups planned for 2020.
+
+
+## File Hash Duplication
+
+There are at least a few dozen file entities with duplicate SHA-1.
+
+These should simply be merged via redirect. This is probably the simplest
+cleanup case, as the number of entities is low and the complexity of merging
+metadata is also low.
+
+
+## Release Identifier (DOI, PMID, PMCID, arxiv) Duplication
+
+At least a few thousand DOIs (some from Datacite import due to normalization
+behavior, some from previous Crossref issues), hundreds of thousands of PMIDs,
+and an unknown number of PMCIDs and arxiv ids have duplicate releases. This
+means, multiple releases exist with the same external identifier.
+
+The cleanup is same as with file hashes: the duplicate releases and works
+should be merged (via redirects).
+
+TODO: It is possible that works should be deleted instead of merged.
+
+
+## PDF File Metadata Completeness
+
+All PDF files should be "complete" over {SHA1, SHA256, MD5, size, mimetype},
+all of which metadata should be confirmed by calculating the values directly
+from the file.
+
+A good fraction of file entities have metadata from direct CDX imports, which
+did not include (uncompressed) size, hashes other than SHA-1, or confirmed
+mimetype. Additionally, the SHA-1 itself is not accurate for the "inner" file
+in a fraction of cases (at least thousands of files, possibly 1% or more) due
+to CDX/WARC behavior with transport compressed bodies (where the recorded SHA-1
+is of the compressed body, not the actual inner file).
+
+
+## File URL Cleanups
+
+The current file URL metadata has a few warts:
+
+- inconsistent or incorrect tagging of URL "rel" type. It is possible we should
+  just strip/skip this tag and always recompute from scratch
+- duplicate URLs (lack of normalization):
+    - `http://example.com/file.pdf`
+    - `http://example.com:80/file.pdf`
+    - `https://example.com/file.pdf`
+    - `http://www.example.com/file.pdf`
+- URLs with many and long query parameters, such as `jsessionid` or AWS token
+  parameters. These are necessary in wayback URLs (for replay), but meaningless
+  and ugly as regular URLs
+- possibly some remaining `https://web.archive.org/web/None/...` URLs, which
+  at best should be replaced with the actual capture timestamp or at least
+  deleted
+- some year-only wayback links (`https://web.archive.org/web/2016/...`)
+  basically same as `None`
+- many wayback links per file
+
+Some of these issues are partially user-interface driven. There is also a
+balance between wanting many URLs (and datetimes for wayback URLs) for
+diversity and as an archival signal, but there being diminishing returns for
+this kind of completeness.
+
+I would propose that one URL per host and the oldest wayback link per host and
+transport (treating http/https as same transport type, but ftp as distinct) is
+a reasonable constraint, but am open to other opinions. I think that all web
+URLs should be normalized for issues like `jsessionid` and `:80` port
+specification.
+
+In user interface we should limit to a single wayback link, and single link per domain.
+
+NOTE: "host" means the fully qualified domain hostname; domain means the
+"registered" part of the domain.
+
+
+## Container Metadata
+
+At some point, had many "NULL" publishers.
+
+"Type" coverage should be improved.
+
+"Publisher type" (infered in various ways in chocula tool) could be included in
+`extra` and end up in search faceting.
+
+Overall OA status should probably be more sophisticated: gold, green, etc.
+
+
+## Stub Hunting
+
+There are a lot of release entities which should probably be marked `stub` or
+in some other way indicated as unimportant or other (see also proposal to add
+new `release_types`). The main priority is to change the type of releases that
+are currently `published` and "paper-like", thus showing up in coverage stats.
+
+A partial list:
+
+- bad/weird titles
+    - "[Blank page]"
+    - "blank page"
+    - "Temporary Empty DOI 0"
+    - "ADVERTISEMENT"
+    - "Full title page with Editorial board (with Elsevier tree)"
+    - "Advisory Board Editorial Board"
+
-- 
cgit v1.2.3