cleanup some of old TODO list into proposals

author: Bryan Newbold <bnewbold@robocracy.org> 2020-01-21 17:48:39 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2020-01-21 17:48:39 -0800
commit: 9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb (patch)
tree: 500e3e593ae61fd3c22831c35cfe0a55741759fb
parent: 2fcc59388a4eb53a7e2370275366272459874e99 (diff)
download: fatcat-9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb.tar.gz
fatcat-9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb.zip
4 files changed, 269 insertions, 44 deletions
diff --git a/TODO.md b/TODO.md
index 0c766204..9538e7ed 100644
--- a/TODO.md
+++ b/TODO.md
@@ -4,21 +4,9 @@
 
 ## Next Up
 
-- more/better identifier normalization in normalize.py
-    => then use this code in importers
-- update existing 1.5 mil longtail OA PDFs with container/ISSN-L
-- use collapsing fields in default release search
-    => start using elasticsearch-py
 
 ## Bugs
 
-- identifier and hash duplication
-    => couple dozen SHA-1
-    => couple thousand DOI
-    => 400k PMID (!)
-- did, somehow, end up with web.archive.org/web/None/ URLs (should remove)
-- searching 'N/A' is a bug, because not quoted; auto-quote it?
-- author (contrib) names not getting included in search (unless explicit)
 
 ## Next Full Release "Touch"
 
@@ -27,7 +15,7 @@ Want to minimize edit counts, so will bundle a bunch of changes
 
 - structured contrib names (given, sur)
 - reference linking (release-to-release), via crossref DOI refs
-- subtitle as string, not array
+- subtitle as field; remove from extra
 - remove crossref alt ids that are just the DOI (?)
 
 ## Production Public Launch Blockers
@@ -44,9 +32,9 @@ Want to minimize edit counts, so will bundle a bunch of changes
 
 ## Unsorted
 
+- broader use of external identifier normalizer functions
 - "delete entity" and "merge entity" webface flows
 - update editor, editgroup, changelog views?
-- ability to "edit edits" (update in-progress edits)
 - review bots:
     - tests
     - not actually processing work entities
@@ -79,12 +67,10 @@ Want to minimize edit counts, so will bundle a bunch of changes
         should `release_year` be of date type, instead of int?
     files: domain list; mimetype; release count; url count; web/publisher/etc;
         size; has_md5/sha256/sha1; in_ia, in_shadow
-- should elastic `release_year` be of date type, instead of int?
 - webface: still need to collapse links by domain better, and also vs. www.x/x
 - entity edit JSON objects could include `entity_type`
 - refactor 'fatcatd' to 'fatcat-api'
 - changelog elastic stuff (is there even a fatcat-export for this?)
-- container count "enrich"
 - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)
 - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/
 - changelog elastic index (for stats)
@@ -121,20 +107,12 @@ Want to minimize edit counts, so will bundle a bunch of changes
 - crossref: many ISBNs not getting copied; use python library to convert?
 - remove 'first' from contrib crossref 'seq' (not helpful?)
 - should probably check for 'jats:' in abstract before setting mimetype, even from crossref
-- web.archive.org response not SHA1 match? => need /<dt>id_/ thing
 - XML etc in metadata
     => (python) tests for these!
     https://qa.fatcat.wiki/release/search?q=xmlns
     https://qa.fatcat.wiki/release/search?q=%24gt
-- bad/weird titles
-    "[Blank page]", "blank page"
-    "Temporary Empty DOI 0"
-    "ADVERTISEMENT"
-    "Full title page with Editorial board (with Elsevier tree)"
-    "Advisory Board Editorial Board"
 - better/complete reltypes probably good (eg, list of IRs, academic domain)
 - include crossref-capitalized DOI in extra
-- manifest: multiple URLs per SHA1
 - crossref: relations ("is-preprint-of")
 - crossref: two phase: no citations, then matched citations (via DOI table)
 - special "alias" DOIs... in crossref metadata?
@@ -148,21 +126,6 @@ new importers:
 
 - more+better terms+policies: https://tosdr.org/index.html
 
-## Fun Features
-
-- "save paper now"
-    => is it in GWB? if not, SPN
-    => get hash + url from GWB, verify mimetype acceptable
-    => is file in fatcat?
-    => what about HBase? GROBID?
-    => create edit, redirect user to editgroup submit page
-- python client tool and library in pypi
-    => or maybe rust?
-
-## Metadata Harvesting
-
-- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)"
-
 ## Schema / Entity Fields
 
 - file+fileset "first seen" datetime
@@ -171,8 +134,6 @@ new importers:
 - `translation_of` field on releases (or similar/general). `retraction_of` to a
   specific release? `alias_of`/`duplicate_of`
 - 'part-of' relation for releases (release to release, eg for book chapters) and possibly containers
-- `container_type` for containers (journal, conference, book series, etc)
-    => in schema, needs vocabulary and implementation
 
 ## API Schema / Design
 
@@ -185,8 +146,6 @@ new importers:
 
 ## Other / Backburner
 
-- file entity full update with all hashes, file size, corrected/expanded wayback links
-    => some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size.
 - regression test imports for missing orcid display and journal metadata name
 - try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
 - `doi` field for containers (at least for "journal" type; maybe for "series" as well?)
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
index 0e789ad1..916e8816 100644
--- a/proposals/20190911_v04_schema_tweaks.md
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -36,7 +36,7 @@ API endpoints:
 - `GET /editor/<ident>/bots` (?) endpoint to enumerate bots wrangled by a
   specific editor
 
-Elasticsearch schema:
+See `2020_search_improvements` for elasticsearch-only schema updates.
 
 - releases *may* need an "_all" field (or `biblio`?) containing most fields to
   make some search experiences work
diff --git a/proposals/2020_elasticsearch_schemas.md b/proposals/2020_elasticsearch_schemas.md
new file mode 100644
index 00000000..d931efd3
--- /dev/null
+++ b/proposals/2020_elasticsearch_schemas.md
@@ -0,0 +1,157 @@
+
+status: planning
+
+This document tracks "easy" elasticsearch schema and behavior changes that
+could be made while being backwards compatible with the current v0.3 schema and
+not requiring any API/database schema changes.
+
+## Release Field Additions
+
+Simple additions:
+
+- volume
+- issue
+- pages
+- `first_page` (parsed from pages) (?)
+- number
+- `in_shadow`
+- OA license slug (?)
+- `doi_prefix`
+- `doi_registrar` (based on extra)
+
+"Array" keyword types for reverse lookups:
+
+- referenced releases idents
+- contrib creator idents
+
+
+## Preservation Summary Field
+
+To make facet/aggregate queries easier, propose summarizing the preservation
+status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which
+is:
+
+- `bright`
+- `dark_only`
+- `shadow_only`
+- `none`
+
+Note that these don't align with OA color or work-level preservation (aka, no
+"green"), it is a release-level status.
+
+Filters like "papers only", "published only", "not stub", "single container"
+would be overlaid in queries.
+
+
+## OA Color Summary Field
+
+Might not be ready for this yet, but for both releases and containers may be
+able to do a better job of indicating OA status/policy for published works.
+
+Not clear if this should be for "published" only, or whether we should try to
+handle embargo time spans and dates.
+
+
+## Release Merged Default Field
+
+A current issue with searches is that term queries only match on a single
+field, unless alternative fields are explicitly indicated. This breaks obvious
+queries like "principa newton" which include both title terms and author terms,
+or "coffee death bmj" which include a release title and journal title.
+
+A partial solution to this is to index a field with multiple fields "copied"
+into it, and have that be the default for term queries.
+
+Fields to copy in include at least:
+
+- `title`
+- `subtitle`
+- `original_title`
+- `container_name`
+- names of all authors (maybe not acronyms?)
+
+May also want to include volume, issue, year, and any container acronyms or
+aliases. If we did that, users could paste in citations and there is a better
+chance the best match would be the exact cited paper.
+
+This should be a pretty simple change. The biggest downside will be larger (up
+to double?) index size.
+
+
+## Partial Query Parsing
+
+At some point we may want to build a proper query parser (see separate
+proposal), but in the short term there is some low-hanging fruit simple token
+parsing and re-writing we could do.
+
+- strings like `N/A` which are parse bugs; auto-quote these
+- pasting/searching for entire titles which include a word then colon ("Yummy
+  Food: A Review"). We can detect that "food" is not a valid facet, and quote
+  that single token
+- ability to do an empty search (to get total count) (?)
+
+This would require at least a simple escaped quotes tokenizer.
+
+
+## Basic Filtering
+
+This would be in the user interface, not schema.
+
+At simple google-style filtering in release searches like:
+
+- time span (last year, last 5, last 20, all)
+- fulltext availability
+- release type; stage; withdrawn
+- language
+- country/region
+
+For containers:
+
+- is OA
+- stub (zero releases)
+
+## Work Grouping
+
+Release searches can be "grouped by" work identifier in the default responses,
+to prevent the common situation where there are multiple release which are just
+different versions of the same work.
+
+Need to ensure this is performant.
+
+Would need to update query UI/UX to display another line under hits ("also XYZ
+other copies {including retraction or update} {having fulltext if this
+hit does not}").
+
+
+## Container Fields
+
+- `all_issns`
+- `release_count`
+
+The `release_count` would not be indexed (left null) by default, and would be
+"patched" in to entities by a separate script (periodically?).
+
+
+## Container Copied Fields
+
+Like releases, container entities could have a merged biblio field to use as
+default in term queries:
+
+- `name`
+- `original_name`
+- `aliases` (in extra?)
+- `publisher`
+
+Maybe also language and country names?
+
+
+## Background Reading
+
+"Which Academic Search Systems are Suitable for Systematic Reviews or
+Meta-Analyses?  Evaluating Retrieval Qualities of Google Scholar, PubMed and 26
+other Resources"
+
+https://musingsaboutlibrarianship.blogspot.com/2019/12/the-rise-of-open-discovery-indexes.html
+
+"Scholarly Search Engine Comparison"
+https://docs.google.com/spreadsheets/d/1ZiCUuKNse8dwHRFAyhFsZsl6kG0Fkgaj5gttdwdVZEM/edit#gid=1016151070
diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md
new file mode 100644
index 00000000..e53c47d3
--- /dev/null
+++ b/proposals/2020_metadata_cleanups.md
@@ -0,0 +1,109 @@
+
+status: planning
+
+This proposal tracks a batch of catalog metadata cleanups planned for 2020.
+
+
+## File Hash Duplication
+
+There are at least a few dozen file entities with duplicate SHA-1.
+
+These should simply be merged via redirect. This is probably the simplest
+cleanup case, as the number of entities is low and the complexity of merging
+metadata is also low.
+
+
+## Release Identifier (DOI, PMID, PMCID, arxiv) Duplication
+
+At least a few thousand DOIs (some from Datacite import due to normalization
+behavior, some from previous Crossref issues), hundreds of thousands of PMIDs,
+and an unknown number of PMCIDs and arxiv ids have duplicate releases. This
+means, multiple releases exist with the same external identifier.
+
+The cleanup is same as with file hashes: the duplicate releases and works
+should be merged (via redirects).
+
+TODO: It is possible that works should be deleted instead of merged.
+
+
+## PDF File Metadata Completeness
+
+All PDF files should be "complete" over {SHA1, SHA256, MD5, size, mimetype},
+all of which metadata should be confirmed by calculating the values directly
+from the file.
+
+A good fraction of file entities have metadata from direct CDX imports, which
+did not include (uncompressed) size, hashes other than SHA-1, or confirmed
+mimetype. Additionally, the SHA-1 itself is not accurate for the "inner" file
+in a fraction of cases (at least thousands of files, possibly 1% or more) due
+to CDX/WARC behavior with transport compressed bodies (where the recorded SHA-1
+is of the compressed body, not the actual inner file).
+
+
+## File URL Cleanups
+
+The current file URL metadata has a few warts:
+
+- inconsistent or incorrect tagging of URL "rel" type. It is possible we should
+  just strip/skip this tag and always recompute from scratch
+- duplicate URLs (lack of normalization):
+    - `http://example.com/file.pdf`
+    - `http://example.com:80/file.pdf`
+    - `https://example.com/file.pdf`
+    - `http://www.example.com/file.pdf`
+- URLs with many and long query parameters, such as `jsessionid` or AWS token
+  parameters. These are necessary in wayback URLs (for replay), but meaningless
+  and ugly as regular URLs
+- possibly some remaining `https://web.archive.org/web/None/...` URLs, which
+  at best should be replaced with the actual capture timestamp or at least
+  deleted
+- some year-only wayback links (`https://web.archive.org/web/2016/...`)
+  basically same as `None`
+- many wayback links per file
+
+Some of these issues are partially user-interface driven. There is also a
+balance between wanting many URLs (and datetimes for wayback URLs) for
+diversity and as an archival signal, but there being diminishing returns for
+this kind of completeness.
+
+I would propose that one URL per host and the oldest wayback link per host and
+transport (treating http/https as same transport type, but ftp as distinct) is
+a reasonable constraint, but am open to other opinions. I think that all web
+URLs should be normalized for issues like `jsessionid` and `:80` port
+specification.
+
+In user interface we should limit to a single wayback link, and single link per domain.
+
+NOTE: "host" means the fully qualified domain hostname; domain means the
+"registered" part of the domain.
+
+
+## Container Metadata
+
+At some point, had many "NULL" publishers.
+
+"Type" coverage should be improved.
+
+"Publisher type" (infered in various ways in chocula tool) could be included in
+`extra` and end up in search faceting.
+
+Overall OA status should probably be more sophisticated: gold, green, etc.
+
+
+## Stub Hunting
+
+There are a lot of release entities which should probably be marked `stub` or
+in some other way indicated as unimportant or other (see also proposal to add
+new `release_types`). The main priority is to change the type of releases that
+are currently `published` and "paper-like", thus showing up in coverage stats.
+
+A partial list:
+
+- bad/weird titles
+    - "[Blank page]"
+    - "blank page"
+    - "Temporary Empty DOI 0"
+    - "ADVERTISEMENT"
+    - "Full title page with Editorial board (with Elsevier tree)"
+    - "Advisory Board Editorial Board"
+
author	Bryan Newbold <bnewbold@robocracy.org>	2020-01-21 17:48:39 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2020-01-21 17:48:39 -0800
commit	9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb (patch)
tree	500e3e593ae61fd3c22831c35cfe0a55741759fb
parent	2fcc59388a4eb53a7e2370275366272459874e99 (diff)
download	fatcat-9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb.tar.gz fatcat-9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb.zip