From 9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 21 Jan 2020 17:48:39 -0800 Subject: cleanup some of old TODO list into proposals --- proposals/2020_metadata_cleanups.md | 109 ++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 proposals/2020_metadata_cleanups.md (limited to 'proposals/2020_metadata_cleanups.md') diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md new file mode 100644 index 00000000..e53c47d3 --- /dev/null +++ b/proposals/2020_metadata_cleanups.md @@ -0,0 +1,109 @@ + +status: planning + +This proposal tracks a batch of catalog metadata cleanups planned for 2020. + + +## File Hash Duplication + +There are at least a few dozen file entities with duplicate SHA-1. + +These should simply be merged via redirect. This is probably the simplest +cleanup case, as the number of entities is low and the complexity of merging +metadata is also low. + + +## Release Identifier (DOI, PMID, PMCID, arxiv) Duplication + +At least a few thousand DOIs (some from Datacite import due to normalization +behavior, some from previous Crossref issues), hundreds of thousands of PMIDs, +and an unknown number of PMCIDs and arxiv ids have duplicate releases. This +means, multiple releases exist with the same external identifier. + +The cleanup is same as with file hashes: the duplicate releases and works +should be merged (via redirects). + +TODO: It is possible that works should be deleted instead of merged. + + +## PDF File Metadata Completeness + +All PDF files should be "complete" over {SHA1, SHA256, MD5, size, mimetype}, +all of which metadata should be confirmed by calculating the values directly +from the file. + +A good fraction of file entities have metadata from direct CDX imports, which +did not include (uncompressed) size, hashes other than SHA-1, or confirmed +mimetype. Additionally, the SHA-1 itself is not accurate for the "inner" file +in a fraction of cases (at least thousands of files, possibly 1% or more) due +to CDX/WARC behavior with transport compressed bodies (where the recorded SHA-1 +is of the compressed body, not the actual inner file). + + +## File URL Cleanups + +The current file URL metadata has a few warts: + +- inconsistent or incorrect tagging of URL "rel" type. It is possible we should + just strip/skip this tag and always recompute from scratch +- duplicate URLs (lack of normalization): + - `http://example.com/file.pdf` + - `http://example.com:80/file.pdf` + - `https://example.com/file.pdf` + - `http://www.example.com/file.pdf` +- URLs with many and long query parameters, such as `jsessionid` or AWS token + parameters. These are necessary in wayback URLs (for replay), but meaningless + and ugly as regular URLs +- possibly some remaining `https://web.archive.org/web/None/...` URLs, which + at best should be replaced with the actual capture timestamp or at least + deleted +- some year-only wayback links (`https://web.archive.org/web/2016/...`) + basically same as `None` +- many wayback links per file + +Some of these issues are partially user-interface driven. There is also a +balance between wanting many URLs (and datetimes for wayback URLs) for +diversity and as an archival signal, but there being diminishing returns for +this kind of completeness. + +I would propose that one URL per host and the oldest wayback link per host and +transport (treating http/https as same transport type, but ftp as distinct) is +a reasonable constraint, but am open to other opinions. I think that all web +URLs should be normalized for issues like `jsessionid` and `:80` port +specification. + +In user interface we should limit to a single wayback link, and single link per domain. + +NOTE: "host" means the fully qualified domain hostname; domain means the +"registered" part of the domain. + + +## Container Metadata + +At some point, had many "NULL" publishers. + +"Type" coverage should be improved. + +"Publisher type" (infered in various ways in chocula tool) could be included in +`extra` and end up in search faceting. + +Overall OA status should probably be more sophisticated: gold, green, etc. + + +## Stub Hunting + +There are a lot of release entities which should probably be marked `stub` or +in some other way indicated as unimportant or other (see also proposal to add +new `release_types`). The main priority is to change the type of releases that +are currently `published` and "paper-like", thus showing up in coverage stats. + +A partial list: + +- bad/weird titles + - "[Blank page]" + - "blank page" + - "Temporary Empty DOI 0" + - "ADVERTISEMENT" + - "Full title page with Editorial board (with Elsevier tree)" + - "Advisory Board Editorial Board" + -- cgit v1.2.3