update TODO

author: Bryan Newbold <bnewbold@robocracy.org> 2019-02-05 17:01:15 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2019-02-05 17:01:15 -0800
commit: ea4102024def3a535790f5e2570d0692f7a9e41d (patch)
tree: 620d405a930c2a57efb1ed4b285f1c73b9f3af6c
parent: 1a7ef0c7cb8e1b84e24cd75b910e62e613fdc726 (diff)
download: fatcat-ea4102024def3a535790f5e2570d0692f7a9e41d.tar.gz
fatcat-ea4102024def3a535790f5e2570d0692f7a9e41d.zip
1 files changed, 35 insertions, 83 deletions
diff --git a/TODO.md b/TODO.md
index da3cda6e..ab6b5bb7 100644
--- a/TODO.md
+++ b/TODO.md
@@ -1,102 +1,48 @@
 
 ## In Progress
 
-## Prod Metadata Checks
-
-x edit and editgroup metadata
-x crossref citation not saving 'article-title' or 'unstructured', and 'author'
-  should be 'authors' (list)
-x crossref not saving 'language' (looks like iso code already)
-- longtail_oa flag getting set on GROBID imports
-- grobid reference should be under extra (not nested): issue, volume, authors
-- uniqueness of:
-    sha1 - via SQL dump
-    doi - via SQL dump
-    issnl - via JSON dump
-    orcid - via JSON dump
-
-notes:
-- crossref references look great!
-- extra/crossref/alternative-id often includes exact full DOI
-        10.1158/1538-7445.AM10-3529
-        10.1158/1538-7445.am10-3529
-    => but not always? publisher-specific
-- contribs[]/extra/seq often has "first" from crossref
-    => is this helpful?
-- abstracts content is fine, but should probably check for "jats:" when setting
-  mimetype
-x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
-    => https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave
-       10.26891/jik.v10i2.2016.92-97
-- original title works, yay!
-    https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im
-    10.2504/kds.26.358
-- new license: https://www.karger.com/Services/SiteLicenses
-- not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7
-    "9780080373027"
-    could at least put in alternative-id?
-- BUG: subtitle coming through as an array, not string
-- `license_slug` does get set
-    eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/
-
 ## Next Up
 
-- bootstrap_bots script should set -ex and output admin and webface tokens
-- regression test imports for missing orcid display and journal metadata name
-- serveral tweaks/fixes to webface (eg, container metadata schema changed)
-- container count "enrich"
-- changelog elastic stuff (is there even a fatcat-export for this?)
-- QA sentry has very little host info; also not URL of request
-- start prod crossref harvesting (from ~start of 2019)
-- 158 "NULL" publishers in journal metadata
-- should elastic release_year be of date type, instead of int?
-- QA/prod needs updated credentials
-- ansible: ISSN-L download/symlink
-- searching 'N/A' is a bug
 - formalize release_status:
     => https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings
-- entity edit JSON objects could include `entity_type`
+- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude
+- QA sentry has very little host info; also not URL of request
+- should elastic release_year be of date type, instead of int?
+- subtitle as array vs. string
 
-## Production public launch blockers
+## Production Public Launch Blockers
 
+- update /about page
 - handle 'wip' status entities in web UI
 - guide updates for auth
 - privacy policy, and link from: create account, create edit
-- refactors and correctness in rust/TODO
-- update /about page
 
 ## Production Tech Sanity
 
 - postgresql replication
-- pg_dump/load test
 - haproxy somewhere/how
 - logging iteration: larger journald buffers? point somewhere?
 
 ## Ideas
 
+- ansible: ISSN-L download/symlink
+- webface: still need to collapse links by domain better, and also vs. www.x/x
+- entity edit JSON objects could include `entity_type`
+- refactor 'fatcatd' to 'fatcat-api'
+- changelog elastic stuff (is there even a fatcat-export for this?)
+- container count "enrich"
 - ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil
   unique DOIs; could import those other "work activities"? do they have
   identifiers?
-- write up notes on biblio metadata in general
-    => "extensibility" and extra keys
-    => proliferation of arrays vs. concrete values
-    => various ways to record history/progeny
-    => "subtitle", "short-title", "full-title" complexity
-    => human names
-    => translated metadata: titles/names/abstracts
-    => "typing" for metadata (eg, math in titles)
 - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)
 - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/
-- use https://github.com/codelucas/newspaper to extract fulltext+metadata from
-  HTML crawls
+- use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls
 - changelog elastic index (for stats)
 - import from arabesque output (eg, specific crawls)
 - more logins: orcid, wikimedia
-- `fatcat-auth` tool should support more caveats, both when generating new or
-  mutating existing tokens
+- `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens
 - fast path to skip recursive redirect checks for bulk inserts
-- when getting "wip" entities, require a parameter ("allow_wip"), else get a
-  404
+- when getting "wip" entities, require a parameter ("allow_wip"), else get a 404
 - consider dropping CORE identifier
 - maybe better 'success' return message? eg, "success: true" flag
 - idea: allow users to generate their own editgroup UUIDs, to reduce a round
@@ -108,11 +54,14 @@ x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
     => /{entity}/edit/{edit_id}
     => /{entity}/{ident}/redirects
     => /{entity}/{ident}/history
-- investigate data quality by looking at, eg, most popular author strings, most
-  popular titles, duplicated containers, etc
+- investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc
 
 ## Metadata Import
 
+- 158 "NULL" publishers in journal metadata
+- crossref: many ISBNs not getting copied; use python library to convert?
+- remove 'first' from contrib crossref 'seq' (not helpful?)
+- should probably check for 'jats:' in abstract before setting mimetype, even from crossref
 - web.archive.org response not SHA1 match? => need /<dt>id_/ thing
 - XML etc in metadata
     => (python) tests for these!
@@ -127,11 +76,6 @@ x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
 - better/complete reltypes probably good (eg, list of IRs, academic domain)
 - 'expand' in lookups (derp! for single hit lookups)
 - include crossref-capitalized DOI in extra
-- some "Elsevier " stuff as publisher
-    => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi
-- crossref import: don't store citation unstructured if len() == 0:
-    {"crossref": {"unstructured": ""}}
-- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
 - manifest: multiple URLs per SHA1
 - crossref: relations ("is-preprint-of")
 - crossref: two phase: no citations, then matched citations (via DOI table)
@@ -169,23 +113,32 @@ new importers:
 ## Schema / Entity Fields
 
 - elastic transform should only include authors, not editors (?)
-- `doi` field for containers (at least for "journal" type; maybe for "series"
-  as well?)
-- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
-  instead of release_status?
+- `retracted`, `translation`, and perhaps `corrected` as flags on releases, instead of release_status?
     => see notes file on retractions, etc
-- 'part-of' relation for releases (release to release, eg for book chapters)
-  and possibly containers
+- 'part-of' relation for releases (release to release, eg for book chapters) and possibly containers
 - `container_type` for containers (journal, conference, book series, etc)
     => in schema, needs vocabulary and implementation
 
+## API Schema / Design
+
+- refactor entity mutation (CUD) endpoints to be like `/editgroup/{editgroup_id}/release/{ident}`
+    => changes editgroup_id from query param to URL param
+- refactor bulk POST to include editgroup plus array of entity objects (instead of just a couple fields as query params)
+
 ## Web Interface
 
 - include that ISO library to do lang/country name decodes
 - container-name when no `container_id`. eg: 10.1016/b978-0-08-037302-7.50022-7
+- fileset/webcapture webface anything
 
 ## Other / Backburner
 
+- file entity full update with all hashes, file size, corrected/expanded wayback links
+    => some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size.
+- searching 'N/A' is a bug, because not quoted; auto-quote it?
+- regression test imports for missing orcid display and journal metadata name
+- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
+- `doi` field for containers (at least for "journal" type; maybe for "series" as well?)
 - refactor webface views to use shared entity_view.html template
 - shadow library manifest importer
 - book identifiers: OCLC, openlibrary
@@ -194,7 +147,6 @@ new importers:
 - fake DOI (use in examples): 10.5555/12345678
 - refactor elasticsearch inserter to be a class (eg, for command line use)
 - document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31]
-- fileset/webcapture webface anything
 - display abstracts better. no hashes or metadata; prefer plain or HTML,
   convert JATS if necessary
 - switch from slog to simple pretty_env_log
author	Bryan Newbold <bnewbold@robocracy.org>	2019-02-05 17:01:15 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2019-02-05 17:01:15 -0800
commit	ea4102024def3a535790f5e2570d0692f7a9e41d (patch)
tree	620d405a930c2a57efb1ed4b285f1c73b9f3af6c
parent	1a7ef0c7cb8e1b84e24cd75b910e62e613fdc726 (diff)
download	fatcat-ea4102024def3a535790f5e2570d0692f7a9e41d.tar.gz fatcat-ea4102024def3a535790f5e2570d0692f7a9e41d.zip