From 9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 21 Jan 2020 17:48:39 -0800 Subject: cleanup some of old TODO list into proposals --- TODO.md | 45 +-------- proposals/20190911_v04_schema_tweaks.md | 2 +- proposals/2020_elasticsearch_schemas.md | 157 ++++++++++++++++++++++++++++++++ proposals/2020_metadata_cleanups.md | 109 ++++++++++++++++++++++ 4 files changed, 269 insertions(+), 44 deletions(-) create mode 100644 proposals/2020_elasticsearch_schemas.md create mode 100644 proposals/2020_metadata_cleanups.md diff --git a/TODO.md b/TODO.md index 0c766204..9538e7ed 100644 --- a/TODO.md +++ b/TODO.md @@ -4,21 +4,9 @@ ## Next Up -- more/better identifier normalization in normalize.py - => then use this code in importers -- update existing 1.5 mil longtail OA PDFs with container/ISSN-L -- use collapsing fields in default release search - => start using elasticsearch-py ## Bugs -- identifier and hash duplication - => couple dozen SHA-1 - => couple thousand DOI - => 400k PMID (!) -- did, somehow, end up with web.archive.org/web/None/ URLs (should remove) -- searching 'N/A' is a bug, because not quoted; auto-quote it? -- author (contrib) names not getting included in search (unless explicit) ## Next Full Release "Touch" @@ -27,7 +15,7 @@ Want to minimize edit counts, so will bundle a bunch of changes - structured contrib names (given, sur) - reference linking (release-to-release), via crossref DOI refs -- subtitle as string, not array +- subtitle as field; remove from extra - remove crossref alt ids that are just the DOI (?) ## Production Public Launch Blockers @@ -44,9 +32,9 @@ Want to minimize edit counts, so will bundle a bunch of changes ## Unsorted +- broader use of external identifier normalizer functions - "delete entity" and "merge entity" webface flows - update editor, editgroup, changelog views? -- ability to "edit edits" (update in-progress edits) - review bots: - tests - not actually processing work entities @@ -79,12 +67,10 @@ Want to minimize edit counts, so will bundle a bunch of changes should `release_year` be of date type, instead of int? files: domain list; mimetype; release count; url count; web/publisher/etc; size; has_md5/sha256/sha1; in_ia, in_shadow -- should elastic `release_year` be of date type, instead of int? - webface: still need to collapse links by domain better, and also vs. www.x/x - entity edit JSON objects could include `entity_type` - refactor 'fatcatd' to 'fatcat-api' - changelog elastic stuff (is there even a fatcat-export for this?) -- container count "enrich" - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps) - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ - changelog elastic index (for stats) @@ -121,20 +107,12 @@ Want to minimize edit counts, so will bundle a bunch of changes - crossref: many ISBNs not getting copied; use python library to convert? - remove 'first' from contrib crossref 'seq' (not helpful?) - should probably check for 'jats:' in abstract before setting mimetype, even from crossref -- web.archive.org response not SHA1 match? => need /
id_/ thing - XML etc in metadata => (python) tests for these! https://qa.fatcat.wiki/release/search?q=xmlns https://qa.fatcat.wiki/release/search?q=%24gt -- bad/weird titles - "[Blank page]", "blank page" - "Temporary Empty DOI 0" - "ADVERTISEMENT" - "Full title page with Editorial board (with Elsevier tree)" - "Advisory Board Editorial Board" - better/complete reltypes probably good (eg, list of IRs, academic domain) - include crossref-capitalized DOI in extra -- manifest: multiple URLs per SHA1 - crossref: relations ("is-preprint-of") - crossref: two phase: no citations, then matched citations (via DOI table) - special "alias" DOIs... in crossref metadata? @@ -148,21 +126,6 @@ new importers: - more+better terms+policies: https://tosdr.org/index.html -## Fun Features - -- "save paper now" - => is it in GWB? if not, SPN - => get hash + url from GWB, verify mimetype acceptable - => is file in fatcat? - => what about HBase? GROBID? - => create edit, redirect user to editgroup submit page -- python client tool and library in pypi - => or maybe rust? - -## Metadata Harvesting - -- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)" - ## Schema / Entity Fields - file+fileset "first seen" datetime @@ -171,8 +134,6 @@ new importers: - `translation_of` field on releases (or similar/general). `retraction_of` to a specific release? `alias_of`/`duplicate_of` - 'part-of' relation for releases (release to release, eg for book chapters) and possibly containers -- `container_type` for containers (journal, conference, book series, etc) - => in schema, needs vocabulary and implementation ## API Schema / Design @@ -185,8 +146,6 @@ new importers: ## Other / Backburner -- file entity full update with all hashes, file size, corrected/expanded wayback links - => some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size. - regression test imports for missing orcid display and journal metadata name - try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349) - `doi` field for containers (at least for "journal" type; maybe for "series" as well?) diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md index 0e789ad1..916e8816 100644 --- a/proposals/20190911_v04_schema_tweaks.md +++ b/proposals/20190911_v04_schema_tweaks.md @@ -36,7 +36,7 @@ API endpoints: - `GET /editor//bots` (?) endpoint to enumerate bots wrangled by a specific editor -Elasticsearch schema: +See `2020_search_improvements` for elasticsearch-only schema updates. - releases *may* need an "_all" field (or `biblio`?) containing most fields to make some search experiences work diff --git a/proposals/2020_elasticsearch_schemas.md b/proposals/2020_elasticsearch_schemas.md new file mode 100644 index 00000000..d931efd3 --- /dev/null +++ b/proposals/2020_elasticsearch_schemas.md @@ -0,0 +1,157 @@ + +status: planning + +This document tracks "easy" elasticsearch schema and behavior changes that +could be made while being backwards compatible with the current v0.3 schema and +not requiring any API/database schema changes. + +## Release Field Additions + +Simple additions: + +- volume +- issue +- pages +- `first_page` (parsed from pages) (?) +- number +- `in_shadow` +- OA license slug (?) +- `doi_prefix` +- `doi_registrar` (based on extra) + +"Array" keyword types for reverse lookups: + +- referenced releases idents +- contrib creator idents + + +## Preservation Summary Field + +To make facet/aggregate queries easier, propose summarizing the preservation +status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which +is: + +- `bright` +- `dark_only` +- `shadow_only` +- `none` + +Note that these don't align with OA color or work-level preservation (aka, no +"green"), it is a release-level status. + +Filters like "papers only", "published only", "not stub", "single container" +would be overlaid in queries. + + +## OA Color Summary Field + +Might not be ready for this yet, but for both releases and containers may be +able to do a better job of indicating OA status/policy for published works. + +Not clear if this should be for "published" only, or whether we should try to +handle embargo time spans and dates. + + +## Release Merged Default Field + +A current issue with searches is that term queries only match on a single +field, unless alternative fields are explicitly indicated. This breaks obvious +queries like "principa newton" which include both title terms and author terms, +or "coffee death bmj" which include a release title and journal title. + +A partial solution to this is to index a field with multiple fields "copied" +into it, and have that be the default for term queries. + +Fields to copy in include at least: + +- `title` +- `subtitle` +- `original_title` +- `container_name` +- names of all authors (maybe not acronyms?) + +May also want to include volume, issue, year, and any container acronyms or +aliases. If we did that, users could paste in citations and there is a better +chance the best match would be the exact cited paper. + +This should be a pretty simple change. The biggest downside will be larger (up +to double?) index size. + + +## Partial Query Parsing + +At some point we may want to build a proper query parser (see separate +proposal), but in the short term there is some low-hanging fruit simple token +parsing and re-writing we could do. + +- strings like `N/A` which are parse bugs; auto-quote these +- pasting/searching for entire titles which include a word then colon ("Yummy + Food: A Review"). We can detect that "food" is not a valid facet, and quote + that single token +- ability to do an empty search (to get total count) (?) + +This would require at least a simple escaped quotes tokenizer. + + +## Basic Filtering + +This would be in the user interface, not schema. + +At simple google-style filtering in release searches like: + +- time span (last year, last 5, last 20, all) +- fulltext availability +- release type; stage; withdrawn +- language +- country/region + +For containers: + +- is OA +- stub (zero releases) + +## Work Grouping + +Release searches can be "grouped by" work identifier in the default responses, +to prevent the common situation where there are multiple release which are just +different versions of the same work. + +Need to ensure this is performant. + +Would need to update query UI/UX to display another line under hits ("also XYZ +other copies {including retraction or update} {having fulltext if this +hit does not}"). + + +## Container Fields + +- `all_issns` +- `release_count` + +The `release_count` would not be indexed (left null) by default, and would be +"patched" in to entities by a separate script (periodically?). + + +## Container Copied Fields + +Like releases, container entities could have a merged biblio field to use as +default in term queries: + +- `name` +- `original_name` +- `aliases` (in extra?) +- `publisher` + +Maybe also language and country names? + + +## Background Reading + +"Which Academic Search Systems are Suitable for Systematic Reviews or +Meta-Analyses? Evaluating Retrieval Qualities of Google Scholar, PubMed and 26 +other Resources" + +https://musingsaboutlibrarianship.blogspot.com/2019/12/the-rise-of-open-discovery-indexes.html + +"Scholarly Search Engine Comparison" +https://docs.google.com/spreadsheets/d/1ZiCUuKNse8dwHRFAyhFsZsl6kG0Fkgaj5gttdwdVZEM/edit#gid=1016151070 diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md new file mode 100644 index 00000000..e53c47d3 --- /dev/null +++ b/proposals/2020_metadata_cleanups.md @@ -0,0 +1,109 @@ + +status: planning + +This proposal tracks a batch of catalog metadata cleanups planned for 2020. + + +## File Hash Duplication + +There are at least a few dozen file entities with duplicate SHA-1. + +These should simply be merged via redirect. This is probably the simplest +cleanup case, as the number of entities is low and the complexity of merging +metadata is also low. + + +## Release Identifier (DOI, PMID, PMCID, arxiv) Duplication + +At least a few thousand DOIs (some from Datacite import due to normalization +behavior, some from previous Crossref issues), hundreds of thousands of PMIDs, +and an unknown number of PMCIDs and arxiv ids have duplicate releases. This +means, multiple releases exist with the same external identifier. + +The cleanup is same as with file hashes: the duplicate releases and works +should be merged (via redirects). + +TODO: It is possible that works should be deleted instead of merged. + + +## PDF File Metadata Completeness + +All PDF files should be "complete" over {SHA1, SHA256, MD5, size, mimetype}, +all of which metadata should be confirmed by calculating the values directly +from the file. + +A good fraction of file entities have metadata from direct CDX imports, which +did not include (uncompressed) size, hashes other than SHA-1, or confirmed +mimetype. Additionally, the SHA-1 itself is not accurate for the "inner" file +in a fraction of cases (at least thousands of files, possibly 1% or more) due +to CDX/WARC behavior with transport compressed bodies (where the recorded SHA-1 +is of the compressed body, not the actual inner file). + + +## File URL Cleanups + +The current file URL metadata has a few warts: + +- inconsistent or incorrect tagging of URL "rel" type. It is possible we should + just strip/skip this tag and always recompute from scratch +- duplicate URLs (lack of normalization): + - `http://example.com/file.pdf` + - `http://example.com:80/file.pdf` + - `https://example.com/file.pdf` + - `http://www.example.com/file.pdf` +- URLs with many and long query parameters, such as `jsessionid` or AWS token + parameters. These are necessary in wayback URLs (for replay), but meaningless + and ugly as regular URLs +- possibly some remaining `https://web.archive.org/web/None/...` URLs, which + at best should be replaced with the actual capture timestamp or at least + deleted +- some year-only wayback links (`https://web.archive.org/web/2016/...`) + basically same as `None` +- many wayback links per file + +Some of these issues are partially user-interface driven. There is also a +balance between wanting many URLs (and datetimes for wayback URLs) for +diversity and as an archival signal, but there being diminishing returns for +this kind of completeness. + +I would propose that one URL per host and the oldest wayback link per host and +transport (treating http/https as same transport type, but ftp as distinct) is +a reasonable constraint, but am open to other opinions. I think that all web +URLs should be normalized for issues like `jsessionid` and `:80` port +specification. + +In user interface we should limit to a single wayback link, and single link per domain. + +NOTE: "host" means the fully qualified domain hostname; domain means the +"registered" part of the domain. + + +## Container Metadata + +At some point, had many "NULL" publishers. + +"Type" coverage should be improved. + +"Publisher type" (infered in various ways in chocula tool) could be included in +`extra` and end up in search faceting. + +Overall OA status should probably be more sophisticated: gold, green, etc. + + +## Stub Hunting + +There are a lot of release entities which should probably be marked `stub` or +in some other way indicated as unimportant or other (see also proposal to add +new `release_types`). The main priority is to change the type of releases that +are currently `published` and "paper-like", thus showing up in coverage stats. + +A partial list: + +- bad/weird titles + - "[Blank page]" + - "blank page" + - "Temporary Empty DOI 0" + - "ADVERTISEMENT" + - "Full title page with Editorial board (with Elsevier tree)" + - "Advisory Board Editorial Board" + -- cgit v1.2.3