From 10af3b5ab23b8df76b08ef1173f7547db3df4125 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 6 Mar 2019 11:15:01 -0800 Subject: TODO updates --- TODO.md | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 48 insertions(+), 16 deletions(-) diff --git a/TODO.md b/TODO.md index ab6b5bb7..40f7dcf7 100644 --- a/TODO.md +++ b/TODO.md @@ -1,18 +1,34 @@ ## In Progress +- update TODO, roadmap + ## Next Up -- formalize release_status: - => https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings -- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude -- QA sentry has very little host info; also not URL of request -- should elastic release_year be of date type, instead of int? -- subtitle as array vs. string +- import from arabesque output (eg, specific crawls) +- more logins: orcid, wikimedia +- missing SQL indices: `ENTITY_edit.editgroup_id, ENTITY_edit.ident_id` +- environment (QA/dev) in webface ``, header, footer + => in dev, make JSON API link to localhost:9810 +- test logins, and add loginpass support for: orcid, wikimedia + +## Bugs + +- did, somehow, end up with web.archive.org/web/None/ URLs (should remove) +- searching 'N/A' is a bug, because not quoted; auto-quote it? +- author (contrib) names not getting included in search (unless explicit) +- fatcat flask lookup ValueError should return 4xx (and message?) + => if blank: UnboundLocalError: local variable 'extid' referenced before assignment +- subtitles: glom on to full title? search index? cleanup existing imported instances ## Production Public Launch Blockers +- `withdrawn_date` + => either SQL schema addition, or pull from extra + => but what if date isn't known? - update /about page +- login/signup iteration (orcid, etc) +- audit fatcat metadata for CC-0 - handle 'wip' status entities in web UI - guide updates for auth - privacy policy, and link from: create account, create edit @@ -23,38 +39,55 @@ - haproxy somewhere/how - logging iteration: larger journald buffers? point somewhere? -## Ideas +## Unsorted +- API: ability to expand containers (and files, etc?) in releases-for-work +- API: /releases endpoint (and/or expansion) for releases-for-file (etc) +- cleanup ./notes/ directory +- links say "Download ..." but open in same page, not download +- workers (like entity updater) should use env vars more - ansible: ISSN-L download/symlink +- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude +- QA sentry has very little host info; also not URL of request +- elastic schemas: + release: drop revision?; container_id; creator_id + files: domain list; mimetype; release count; url count; web/publisher/etc; + size; has_md5/sha256/sha1; in_ia, in_shadow +- should elastic `release_year` be of date type, instead of int? - webface: still need to collapse links by domain better, and also vs. www.x/x - entity edit JSON objects could include `entity_type` - refactor 'fatcatd' to 'fatcat-api' - changelog elastic stuff (is there even a fatcat-export for this?) - container count "enrich" +- 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps) +- https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ +- changelog elastic index (for stats) +- API: allow deletion of empty, un-accepted editgroups + +## Ideas + +- `poster` as a `release_type` +- "revert editgroup" mechanism (creates new editgroup) +- can guess some `release_status` of files by looking at wayback date vs. + published date - ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil unique DOIs; could import those other "work activities"? do they have identifiers? -- 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps) -- https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ - use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls -- changelog elastic index (for stats) -- import from arabesque output (eg, specific crawls) -- more logins: orcid, wikimedia - `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens - fast path to skip recursive redirect checks for bulk inserts - when getting "wip" entities, require a parameter ("allow_wip"), else get a 404 -- consider dropping CORE identifier - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) -- API: allow deletion of empty, un-accepted editgroups - refactor API schema for some entity-generic methos (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} => /{entity}/{ident}/redirects => /{entity}/{ident}/history -- investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc +- investigate data quality by looking at, eg, most popular author strings, most + popular titles, duplicated containers, etc ## Metadata Import @@ -135,7 +168,6 @@ new importers: - file entity full update with all hashes, file size, corrected/expanded wayback links => some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size. -- searching 'N/A' is a bug, because not quoted; auto-quote it? - regression test imports for missing orcid display and journal metadata name - try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349) - `doi` field for containers (at least for "journal" type; maybe for "series" as well?) -- cgit v1.2.3