diff options
| author | Bryan Newbold <bnewbold@robocracy.org> | 2019-02-05 17:01:15 -0800 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-02-05 17:01:15 -0800 | 
| commit | ea4102024def3a535790f5e2570d0692f7a9e41d (patch) | |
| tree | 620d405a930c2a57efb1ed4b285f1c73b9f3af6c | |
| parent | 1a7ef0c7cb8e1b84e24cd75b910e62e613fdc726 (diff) | |
| download | fatcat-ea4102024def3a535790f5e2570d0692f7a9e41d.tar.gz fatcat-ea4102024def3a535790f5e2570d0692f7a9e41d.zip | |
update TODO
| -rw-r--r-- | TODO.md | 118 | 
1 files changed, 35 insertions, 83 deletions
| @@ -1,102 +1,48 @@  ## In Progress -## Prod Metadata Checks - -x edit and editgroup metadata -x crossref citation not saving 'article-title' or 'unstructured', and 'author' -  should be 'authors' (list) -x crossref not saving 'language' (looks like iso code already) -- longtail_oa flag getting set on GROBID imports -- grobid reference should be under extra (not nested): issue, volume, authors -- uniqueness of: -    sha1 - via SQL dump -    doi - via SQL dump -    issnl - via JSON dump -    orcid - via JSON dump - -notes: -- crossref references look great! -- extra/crossref/alternative-id often includes exact full DOI -        10.1158/1538-7445.AM10-3529 -        10.1158/1538-7445.am10-3529 -    => but not always? publisher-specific -- contribs[]/extra/seq often has "first" from crossref -    => is this helpful? -- abstracts content is fine, but should probably check for "jats:" when setting -  mimetype -x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0 -    => https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave -       10.26891/jik.v10i2.2016.92-97 -- original title works, yay! -    https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im -    10.2504/kds.26.358 -- new license: https://www.karger.com/Services/SiteLicenses -- not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7 -    "9780080373027" -    could at least put in alternative-id? -- BUG: subtitle coming through as an array, not string -- `license_slug` does get set -    eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/ -  ## Next Up -- bootstrap_bots script should set -ex and output admin and webface tokens -- regression test imports for missing orcid display and journal metadata name -- serveral tweaks/fixes to webface (eg, container metadata schema changed) -- container count "enrich" -- changelog elastic stuff (is there even a fatcat-export for this?) -- QA sentry has very little host info; also not URL of request -- start prod crossref harvesting (from ~start of 2019) -- 158 "NULL" publishers in journal metadata -- should elastic release_year be of date type, instead of int? -- QA/prod needs updated credentials -- ansible: ISSN-L download/symlink -- searching 'N/A' is a bug  - formalize release_status:      => https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings -- entity edit JSON objects could include `entity_type` +- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude +- QA sentry has very little host info; also not URL of request +- should elastic release_year be of date type, instead of int? +- subtitle as array vs. string -## Production public launch blockers +## Production Public Launch Blockers +- update /about page  - handle 'wip' status entities in web UI  - guide updates for auth  - privacy policy, and link from: create account, create edit -- refactors and correctness in rust/TODO -- update /about page  ## Production Tech Sanity  - postgresql replication -- pg_dump/load test  - haproxy somewhere/how  - logging iteration: larger journald buffers? point somewhere?  ## Ideas +- ansible: ISSN-L download/symlink +- webface: still need to collapse links by domain better, and also vs. www.x/x +- entity edit JSON objects could include `entity_type` +- refactor 'fatcatd' to 'fatcat-api' +- changelog elastic stuff (is there even a fatcat-export for this?) +- container count "enrich"  - ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil    unique DOIs; could import those other "work activities"? do they have    identifiers? -- write up notes on biblio metadata in general -    => "extensibility" and extra keys -    => proliferation of arrays vs. concrete values -    => various ways to record history/progeny -    => "subtitle", "short-title", "full-title" complexity -    => human names -    => translated metadata: titles/names/abstracts -    => "typing" for metadata (eg, math in titles)  - 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)  - https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/ -- use https://github.com/codelucas/newspaper to extract fulltext+metadata from -  HTML crawls +- use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls  - changelog elastic index (for stats)  - import from arabesque output (eg, specific crawls)  - more logins: orcid, wikimedia -- `fatcat-auth` tool should support more caveats, both when generating new or -  mutating existing tokens +- `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens  - fast path to skip recursive redirect checks for bulk inserts -- when getting "wip" entities, require a parameter ("allow_wip"), else get a -  404 +- when getting "wip" entities, require a parameter ("allow_wip"), else get a 404  - consider dropping CORE identifier  - maybe better 'success' return message? eg, "success: true" flag  - idea: allow users to generate their own editgroup UUIDs, to reduce a round @@ -108,11 +54,14 @@ x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0      => /{entity}/edit/{edit_id}      => /{entity}/{ident}/redirects      => /{entity}/{ident}/history -- investigate data quality by looking at, eg, most popular author strings, most -  popular titles, duplicated containers, etc +- investigate data quality by looking at, eg, most popular author strings, most popular titles, duplicated containers, etc  ## Metadata Import +- 158 "NULL" publishers in journal metadata +- crossref: many ISBNs not getting copied; use python library to convert? +- remove 'first' from contrib crossref 'seq' (not helpful?) +- should probably check for 'jats:' in abstract before setting mimetype, even from crossref  - web.archive.org response not SHA1 match? => need /<dt>id_/ thing  - XML etc in metadata      => (python) tests for these! @@ -127,11 +76,6 @@ x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0  - better/complete reltypes probably good (eg, list of IRs, academic domain)  - 'expand' in lookups (derp! for single hit lookups)  - include crossref-capitalized DOI in extra -- some "Elsevier " stuff as publisher -    => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi -- crossref import: don't store citation unstructured if len() == 0: -    {"crossref": {"unstructured": ""}} -- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)  - manifest: multiple URLs per SHA1  - crossref: relations ("is-preprint-of")  - crossref: two phase: no citations, then matched citations (via DOI table) @@ -169,23 +113,32 @@ new importers:  ## Schema / Entity Fields  - elastic transform should only include authors, not editors (?) -- `doi` field for containers (at least for "journal" type; maybe for "series" -  as well?) -- `retracted`, `translation`, and perhaps `corrected` as flags on releases, -  instead of release_status? +- `retracted`, `translation`, and perhaps `corrected` as flags on releases, instead of release_status?      => see notes file on retractions, etc -- 'part-of' relation for releases (release to release, eg for book chapters) -  and possibly containers +- 'part-of' relation for releases (release to release, eg for book chapters) and possibly containers  - `container_type` for containers (journal, conference, book series, etc)      => in schema, needs vocabulary and implementation +## API Schema / Design + +- refactor entity mutation (CUD) endpoints to be like `/editgroup/{editgroup_id}/release/{ident}` +    => changes editgroup_id from query param to URL param +- refactor bulk POST to include editgroup plus array of entity objects (instead of just a couple fields as query params) +  ## Web Interface  - include that ISO library to do lang/country name decodes  - container-name when no `container_id`. eg: 10.1016/b978-0-08-037302-7.50022-7 +- fileset/webcapture webface anything  ## Other / Backburner +- file entity full update with all hashes, file size, corrected/expanded wayback links +    => some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size. +- searching 'N/A' is a bug, because not quoted; auto-quote it? +- regression test imports for missing orcid display and journal metadata name +- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349) +- `doi` field for containers (at least for "journal" type; maybe for "series" as well?)  - refactor webface views to use shared entity_view.html template  - shadow library manifest importer  - book identifiers: OCLC, openlibrary @@ -194,7 +147,6 @@ new importers:  - fake DOI (use in examples): 10.5555/12345678  - refactor elasticsearch inserter to be a class (eg, for command line use)  - document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31] -- fileset/webcapture webface anything  - display abstracts better. no hashes or metadata; prefer plain or HTML,    convert JATS if necessary  - switch from slog to simple pretty_env_log | 
