## In Progress - in dev, make JSON API link to localhost:9810 - "as bibtext" webface URL - example entities => work with multiple releases => dataset/fileset => webcapture => dweb URLs - basic web editing/creation of containers and papers - commenting and accepting editgroups via web interface - example editgroup review bot (can be trivial) ## Next Up - import from arabesque output (eg, specific crawls) - missing SQL indices: `ENTITY_edit.editgroup_id, ENTITY_edit.ident_id` - test logins, and add loginpass support for: orcid, wikimedia ## Bugs - did, somehow, end up with web.archive.org/web/None/ URLs (should remove) - searching 'N/A' is a bug, because not quoted; auto-quote it? - author (contrib) names not getting included in search (unless explicit) - fatcat flask lookup ValueError should return 4xx (and message?) => if blank: UnboundLocalError: local variable 'extid' referenced before assignment ## Next Schema Iteration Changes to SQL (and swagger): - structured names in contribs (given/sur) - `release_status` => `release_stage` - `withdrawn_date` and retraction as a release stage - subtitle as a string field? what about translation (`original_subtitle`)? Changes to swagger only: - edit URLs: editgroup_id in URL, not a query param ## Next Full Release "Touch" Will update all release entities (or at least all Crossref-derived entities). Want to minimize edit counts, so will bundle a bunch of changes - structured contrib names (given, sur) - reference linking (release-to-release), via crossref DOI refs - subtitle as string, not array ## Production Public Launch Blockers - `withdrawn_date` => either SQL schema addition, or pull from extra => but what if date isn't known? - update /about page - login/signup iteration (orcid, etc) - audit fatcat metadata for CC-0 - handle 'wip' status entities in web UI - guide updates for auth - privacy policy, and link from: create account, create edit ## Production Tech Sanity - postgresql replication - haproxy somewhere/how - logging iteration: larger journald buffers? point somewhere? ## Unsorted - API: ability to expand containers (and files, etc?) in releases-for-work - API: /releases endpoint (and/or expansion) for releases-for-file (etc) - cleanup ./notes/ directory - links say "Download ..." but open in same page, not download - workers (like entity updater) should use env vars more - ansible: ISSN-L download/symlink - page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude - QA sentry has very little host info; also not URL of request - elastic schemas: release: drop revision?; container_id; creator_id should `release_year` be of date type, instead of int? files: domain list; mimetype; release count; url count; web/publisher/etc; size; has_md5/sha256/sha1; in_ia, in_shadow - should elastic `release_year` be of date type, instead of int? - webface: still need to collapse links by domain better, and also vs. www.x/x - entity edit JSON objects could include `entity_type` - refactor 'fatcatd' to 'fatcat-api' - changelog elastic stuff (is there even a fatcat-export for this?) - container count "enrich" - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps) - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ - changelog elastic index (for stats) - API: allow deletion of empty, un-accepted editgroups ## Ideas - `poster` as a `release_type` - "revert editgroup" mechanism (creates new editgroup) - can guess some `release_status` of files by looking at wayback date vs. published date - ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil unique DOIs; could import those other "work activities"? do they have identifiers? - use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls - `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens - fast path to skip recursive redirect checks for bulk inserts - when getting "wip" entities, require a parameter ("allow_wip"), else get a 404 - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) - refactor API schema for some entity-generic methos (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} => /{entity}/{ident}/redirects => /{entity}/{ident}/history - investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc ## Metadata Import - 158 "NULL" publishers in journal metadata - crossref: many ISBNs not getting copied; use python library to convert? - remove 'first' from contrib crossref 'seq' (not helpful?) - should probably check for 'jats:' in abstract before setting mimetype, even from crossref - web.archive.org response not SHA1 match? => need /