## In Progress ## Prod Metadata Checks x edit and editgroup metadata x crossref citation not saving 'article-title' or 'unstructured', and 'author' should be 'authors' (list) x crossref not saving 'language' (looks like iso code already) - longtail_oa flag getting set on GROBID imports - grobid reference should be under extra (not nested): issue, volume, authors - uniqueness of: sha1 - via SQL dump doi - via SQL dump issnl - via JSON dump orcid - via JSON dump notes: - crossref references look great! - extra/crossref/alternative-id often includes exact full DOI 10.1158/1538-7445.AM10-3529 10.1158/1538-7445.am10-3529 => but not always? publisher-specific - contribs[]/extra/seq often has "first" from crossref => is this helpful? - abstracts content is fine, but should probably check for "jats:" when setting mimetype x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0 => https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave 10.26891/jik.v10i2.2016.92-97 - original title works, yay! https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im 10.2504/kds.26.358 - new license: https://www.karger.com/Services/SiteLicenses - not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7 "9780080373027" could at least put in alternative-id? - BUG: subtitle coming through as an array, not string - `license_slug` does get set eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/ ## Next Up - bootstrap_bots script should set -ex and output admin and webface tokens - regression test imports for missing orcid display and journal metadata name - serveral tweaks/fixes to webface (eg, container metadata schema changed) - container count "enrich" - changelog elastic stuff (is there even a fatcat-export for this?) - QA sentry has very little host info; also not URL of request - start prod crossref harvesting (from ~start of 2019) - 158 "NULL" publishers in journal metadata - should elastic release_year be of date type, instead of int? - QA/prod needs updated credentials - ansible: ISSN-L download/symlink - searching 'N/A' is a bug - formalize release_status: => https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings - entity edit JSON objects could include `entity_type` ## Production public launch blockers - handle 'wip' status entities in web UI - guide updates for auth - privacy policy, and link from: create account, create edit - refactors and correctness in rust/TODO - update /about page ## Production Tech Sanity - postgresql replication - pg_dump/load test - haproxy somewhere/how - logging iteration: larger journald buffers? point somewhere? ## Ideas - ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil unique DOIs; could import those other "work activities"? do they have identifiers? - write up notes on biblio metadata in general => "extensibility" and extra keys => proliferation of arrays vs. concrete values => various ways to record history/progeny => "subtitle", "short-title", "full-title" complexity => human names => translated metadata: titles/names/abstracts => "typing" for metadata (eg, math in titles) - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps) - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ - use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls - changelog elastic index (for stats) - import from arabesque output (eg, specific crawls) - more logins: orcid, wikimedia - `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens - fast path to skip recursive redirect checks for bulk inserts - when getting "wip" entities, require a parameter ("allow_wip"), else get a 404 - consider dropping CORE identifier - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) - API: allow deletion of empty, un-accepted editgroups - refactor API schema for some entity-generic methos (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} => /{entity}/{ident}/redirects => /{entity}/{ident}/history - investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc ## Metadata Import - web.archive.org response not SHA1 match? => need /