summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-01-21 17:48:39 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-01-21 17:48:39 -0800
commit9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb (patch)
tree500e3e593ae61fd3c22831c35cfe0a55741759fb
parent2fcc59388a4eb53a7e2370275366272459874e99 (diff)
downloadfatcat-9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb.tar.gz
fatcat-9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb.zip
cleanup some of old TODO list into proposals
-rw-r--r--TODO.md45
-rw-r--r--proposals/20190911_v04_schema_tweaks.md2
-rw-r--r--proposals/2020_elasticsearch_schemas.md157
-rw-r--r--proposals/2020_metadata_cleanups.md109
4 files changed, 269 insertions, 44 deletions
diff --git a/TODO.md b/TODO.md
index 0c766204..9538e7ed 100644
--- a/TODO.md
+++ b/TODO.md
@@ -4,21 +4,9 @@
## Next Up
-- more/better identifier normalization in normalize.py
- => then use this code in importers
-- update existing 1.5 mil longtail OA PDFs with container/ISSN-L
-- use collapsing fields in default release search
- => start using elasticsearch-py
## Bugs
-- identifier and hash duplication
- => couple dozen SHA-1
- => couple thousand DOI
- => 400k PMID (!)
-- did, somehow, end up with web.archive.org/web/None/ URLs (should remove)
-- searching 'N/A' is a bug, because not quoted; auto-quote it?
-- author (contrib) names not getting included in search (unless explicit)
## Next Full Release "Touch"
@@ -27,7 +15,7 @@ Want to minimize edit counts, so will bundle a bunch of changes
- structured contrib names (given, sur)
- reference linking (release-to-release), via crossref DOI refs
-- subtitle as string, not array
+- subtitle as field; remove from extra
- remove crossref alt ids that are just the DOI (?)
## Production Public Launch Blockers
@@ -44,9 +32,9 @@ Want to minimize edit counts, so will bundle a bunch of changes
## Unsorted
+- broader use of external identifier normalizer functions
- "delete entity" and "merge entity" webface flows
- update editor, editgroup, changelog views?
-- ability to "edit edits" (update in-progress edits)
- review bots:
- tests
- not actually processing work entities
@@ -79,12 +67,10 @@ Want to minimize edit counts, so will bundle a bunch of changes
should `release_year` be of date type, instead of int?
files: domain list; mimetype; release count; url count; web/publisher/etc;
size; has_md5/sha256/sha1; in_ia, in_shadow
-- should elastic `release_year` be of date type, instead of int?
- webface: still need to collapse links by domain better, and also vs. www.x/x
- entity edit JSON objects could include `entity_type`
- refactor 'fatcatd' to 'fatcat-api'
- changelog elastic stuff (is there even a fatcat-export for this?)
-- container count "enrich"
- 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)
- https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/
- changelog elastic index (for stats)
@@ -121,20 +107,12 @@ Want to minimize edit counts, so will bundle a bunch of changes
- crossref: many ISBNs not getting copied; use python library to convert?
- remove 'first' from contrib crossref 'seq' (not helpful?)
- should probably check for 'jats:' in abstract before setting mimetype, even from crossref
-- web.archive.org response not SHA1 match? => need /<dt>id_/ thing
- XML etc in metadata
=> (python) tests for these!
https://qa.fatcat.wiki/release/search?q=xmlns
https://qa.fatcat.wiki/release/search?q=%24gt
-- bad/weird titles
- "[Blank page]", "blank page"
- "Temporary Empty DOI 0"
- "ADVERTISEMENT"
- "Full title page with Editorial board (with Elsevier tree)"
- "Advisory Board Editorial Board"
- better/complete reltypes probably good (eg, list of IRs, academic domain)
- include crossref-capitalized DOI in extra
-- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- special "alias" DOIs... in crossref metadata?
@@ -148,21 +126,6 @@ new importers:
- more+better terms+policies: https://tosdr.org/index.html
-## Fun Features
-
-- "save paper now"
- => is it in GWB? if not, SPN
- => get hash + url from GWB, verify mimetype acceptable
- => is file in fatcat?
- => what about HBase? GROBID?
- => create edit, redirect user to editgroup submit page
-- python client tool and library in pypi
- => or maybe rust?
-
-## Metadata Harvesting
-
-- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)"
-
## Schema / Entity Fields
- file+fileset "first seen" datetime
@@ -171,8 +134,6 @@ new importers:
- `translation_of` field on releases (or similar/general). `retraction_of` to a
specific release? `alias_of`/`duplicate_of`
- 'part-of' relation for releases (release to release, eg for book chapters) and possibly containers
-- `container_type` for containers (journal, conference, book series, etc)
- => in schema, needs vocabulary and implementation
## API Schema / Design
@@ -185,8 +146,6 @@ new importers:
## Other / Backburner
-- file entity full update with all hashes, file size, corrected/expanded wayback links
- => some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size.
- regression test imports for missing orcid display and journal metadata name
- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
- `doi` field for containers (at least for "journal" type; maybe for "series" as well?)
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
index 0e789ad1..916e8816 100644
--- a/proposals/20190911_v04_schema_tweaks.md
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -36,7 +36,7 @@ API endpoints:
- `GET /editor/<ident>/bots` (?) endpoint to enumerate bots wrangled by a
specific editor
-Elasticsearch schema:
+See `2020_search_improvements` for elasticsearch-only schema updates.
- releases *may* need an "_all" field (or `biblio`?) containing most fields to
make some search experiences work
diff --git a/proposals/2020_elasticsearch_schemas.md b/proposals/2020_elasticsearch_schemas.md
new file mode 100644
index 00000000..d931efd3
--- /dev/null
+++ b/proposals/2020_elasticsearch_schemas.md
@@ -0,0 +1,157 @@
+
+status: planning
+
+This document tracks "easy" elasticsearch schema and behavior changes that
+could be made while being backwards compatible with the current v0.3 schema and
+not requiring any API/database schema changes.
+
+## Release Field Additions
+
+Simple additions:
+
+- volume
+- issue
+- pages
+- `first_page` (parsed from pages) (?)
+- number
+- `in_shadow`
+- OA license slug (?)
+- `doi_prefix`
+- `doi_registrar` (based on extra)
+
+"Array" keyword types for reverse lookups:
+
+- referenced releases idents
+- contrib creator idents
+
+
+## Preservation Summary Field
+
+To make facet/aggregate queries easier, propose summarizing the preservation
+status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which
+is:
+
+- `bright`
+- `dark_only`
+- `shadow_only`
+- `none`
+
+Note that these don't align with OA color or work-level preservation (aka, no
+"green"), it is a release-level status.
+
+Filters like "papers only", "published only", "not stub", "single container"
+would be overlaid in queries.
+
+
+## OA Color Summary Field
+
+Might not be ready for this yet, but for both releases and containers may be
+able to do a better job of indicating OA status/policy for published works.
+
+Not clear if this should be for "published" only, or whether we should try to
+handle embargo time spans and dates.
+
+
+## Release Merged Default Field
+
+A current issue with searches is that term queries only match on a single
+field, unless alternative fields are explicitly indicated. This breaks obvious
+queries like "principa newton" which include both title terms and author terms,
+or "coffee death bmj" which include a release title and journal title.
+
+A partial solution to this is to index a field with multiple fields "copied"
+into it, and have that be the default for term queries.
+
+Fields to copy in include at least:
+
+- `title`
+- `subtitle`
+- `original_title`
+- `container_name`
+- names of all authors (maybe not acronyms?)
+
+May also want to include volume, issue, year, and any container acronyms or
+aliases. If we did that, users could paste in citations and there is a better
+chance the best match would be the exact cited paper.
+
+This should be a pretty simple change. The biggest downside will be larger (up
+to double?) index size.
+
+
+## Partial Query Parsing
+
+At some point we may want to build a proper query parser (see separate
+proposal), but in the short term there is some low-hanging fruit simple token
+parsing and re-writing we could do.
+
+- strings like `N/A` which are parse bugs; auto-quote these
+- pasting/searching for entire titles which include a word then colon ("Yummy
+ Food: A Review"). We can detect that "food" is not a valid facet, and quote
+ that single token
+- ability to do an empty search (to get total count) (?)
+
+This would require at least a simple escaped quotes tokenizer.
+
+
+## Basic Filtering
+
+This would be in the user interface, not schema.
+
+At simple google-style filtering in release searches like:
+
+- time span (last year, last 5, last 20, all)
+- fulltext availability
+- release type; stage; withdrawn
+- language
+- country/region
+
+For containers:
+
+- is OA
+- stub (zero releases)
+
+## Work Grouping
+
+Release searches can be "grouped by" work identifier in the default responses,
+to prevent the common situation where there are multiple release which are just
+different versions of the same work.
+
+Need to ensure this is performant.
+
+Would need to update query UI/UX to display another line under hits ("also XYZ
+other copies {including retraction or update} {having fulltext if this
+hit does not}").
+
+
+## Container Fields
+
+- `all_issns`
+- `release_count`
+
+The `release_count` would not be indexed (left null) by default, and would be
+"patched" in to entities by a separate script (periodically?).
+
+
+## Container Copied Fields
+
+Like releases, container entities could have a merged biblio field to use as
+default in term queries:
+
+- `name`
+- `original_name`
+- `aliases` (in extra?)
+- `publisher`
+
+Maybe also language and country names?
+
+
+## Background Reading
+
+"Which Academic Search Systems are Suitable for Systematic Reviews or
+Meta-Analyses? Evaluating Retrieval Qualities of Google Scholar, PubMed and 26
+other Resources"
+
+https://musingsaboutlibrarianship.blogspot.com/2019/12/the-rise-of-open-discovery-indexes.html
+
+"Scholarly Search Engine Comparison"
+https://docs.google.com/spreadsheets/d/1ZiCUuKNse8dwHRFAyhFsZsl6kG0Fkgaj5gttdwdVZEM/edit#gid=1016151070
diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md
new file mode 100644
index 00000000..e53c47d3
--- /dev/null
+++ b/proposals/2020_metadata_cleanups.md
@@ -0,0 +1,109 @@
+
+status: planning
+
+This proposal tracks a batch of catalog metadata cleanups planned for 2020.
+
+
+## File Hash Duplication
+
+There are at least a few dozen file entities with duplicate SHA-1.
+
+These should simply be merged via redirect. This is probably the simplest
+cleanup case, as the number of entities is low and the complexity of merging
+metadata is also low.
+
+
+## Release Identifier (DOI, PMID, PMCID, arxiv) Duplication
+
+At least a few thousand DOIs (some from Datacite import due to normalization
+behavior, some from previous Crossref issues), hundreds of thousands of PMIDs,
+and an unknown number of PMCIDs and arxiv ids have duplicate releases. This
+means, multiple releases exist with the same external identifier.
+
+The cleanup is same as with file hashes: the duplicate releases and works
+should be merged (via redirects).
+
+TODO: It is possible that works should be deleted instead of merged.
+
+
+## PDF File Metadata Completeness
+
+All PDF files should be "complete" over {SHA1, SHA256, MD5, size, mimetype},
+all of which metadata should be confirmed by calculating the values directly
+from the file.
+
+A good fraction of file entities have metadata from direct CDX imports, which
+did not include (uncompressed) size, hashes other than SHA-1, or confirmed
+mimetype. Additionally, the SHA-1 itself is not accurate for the "inner" file
+in a fraction of cases (at least thousands of files, possibly 1% or more) due
+to CDX/WARC behavior with transport compressed bodies (where the recorded SHA-1
+is of the compressed body, not the actual inner file).
+
+
+## File URL Cleanups
+
+The current file URL metadata has a few warts:
+
+- inconsistent or incorrect tagging of URL "rel" type. It is possible we should
+ just strip/skip this tag and always recompute from scratch
+- duplicate URLs (lack of normalization):
+ - `http://example.com/file.pdf`
+ - `http://example.com:80/file.pdf`
+ - `https://example.com/file.pdf`
+ - `http://www.example.com/file.pdf`
+- URLs with many and long query parameters, such as `jsessionid` or AWS token
+ parameters. These are necessary in wayback URLs (for replay), but meaningless
+ and ugly as regular URLs
+- possibly some remaining `https://web.archive.org/web/None/...` URLs, which
+ at best should be replaced with the actual capture timestamp or at least
+ deleted
+- some year-only wayback links (`https://web.archive.org/web/2016/...`)
+ basically same as `None`
+- many wayback links per file
+
+Some of these issues are partially user-interface driven. There is also a
+balance between wanting many URLs (and datetimes for wayback URLs) for
+diversity and as an archival signal, but there being diminishing returns for
+this kind of completeness.
+
+I would propose that one URL per host and the oldest wayback link per host and
+transport (treating http/https as same transport type, but ftp as distinct) is
+a reasonable constraint, but am open to other opinions. I think that all web
+URLs should be normalized for issues like `jsessionid` and `:80` port
+specification.
+
+In user interface we should limit to a single wayback link, and single link per domain.
+
+NOTE: "host" means the fully qualified domain hostname; domain means the
+"registered" part of the domain.
+
+
+## Container Metadata
+
+At some point, had many "NULL" publishers.
+
+"Type" coverage should be improved.
+
+"Publisher type" (infered in various ways in chocula tool) could be included in
+`extra` and end up in search faceting.
+
+Overall OA status should probably be more sophisticated: gold, green, etc.
+
+
+## Stub Hunting
+
+There are a lot of release entities which should probably be marked `stub` or
+in some other way indicated as unimportant or other (see also proposal to add
+new `release_types`). The main priority is to change the type of releases that
+are currently `published` and "paper-like", thus showing up in coverage stats.
+
+A partial list:
+
+- bad/weird titles
+ - "[Blank page]"
+ - "blank page"
+ - "Temporary Empty DOI 0"
+ - "ADVERTISEMENT"
+ - "Full title page with Editorial board (with Elsevier tree)"
+ - "Advisory Board Editorial Board"
+