summaryrefslogtreecommitdiffstats
path: root/TODO.md
diff options
context:
space:
mode:
Diffstat (limited to 'TODO.md')
-rw-r--r--TODO.md193
1 files changed, 193 insertions, 0 deletions
diff --git a/TODO.md b/TODO.md
new file mode 100644
index 00000000..a1f44f52
--- /dev/null
+++ b/TODO.md
@@ -0,0 +1,193 @@
+
+## In Progress
+
+## Prod Metadata Checks
+
+- edit and editgroup metadata
+- longtail_oa flag getting set on GROBID imports
+- crossref citation not saving 'article-title' or 'unstructured', and 'author'
+ should be 'authors' (list)
+- crossref not saving 'language' (looks like iso code already)
+- grobid reference should be under extra (not nested): issue, volume, authors
+- uniqueness of:
+ sha1 - via SQL dump
+ doi - via SQL dump
+ issnl - via JSON dump
+ orcid - via JSON dump
+
+## Next Up
+
+- bootstrap_bots script should set -ex and output admin and webface tokens
+- regression test imports for missing orcid display and journal metadata name
+- serveral tweaks/fixes to webface (eg, container metadata schema changed)
+- container count "enrich"
+- changelog elastic stuff (is there even a fatcat-export for this?)
+- QA sentry has very little host info; also not URL of request
+- start prod crossref harvesting (from ~start of 2019)
+- 158 "NULL" publishers in journal metadata
+- should elastic release_year be of date type, instead of int?
+- QA/prod needs updated credentials
+- ansible: ISSN-L download/symlink
+- searching 'N/A' is a bug
+- formalize release_status:
+ => https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings
+
+## Production public launch blockers
+
+- handle 'wip' status entities in web UI
+- guide updates for auth
+- webface 4xx and 5xx pages
+- privacy policy, and link from: create account, create edit
+- refactors and correctness in rust/TODO
+- update /about page
+
+## Production Tech Sanity
+
+- postgresql replication
+- pg_dump/load test
+- haproxy somewhere/how
+- logging iteration: larger journald buffers? point somewhere?
+
+## Ideas
+
+- write up notes on biblio metadata in general
+ => "extensibility" and extra keys
+ => proliferation of arrays vs. concrete values
+ => various ways to record history/progeny
+ => "subtitle", "short-title", "full-title" complexity
+ => human names
+ => translated metadata: titles/names/abstracts
+ => "typing" for metadata (eg, math in titles)
+- 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)
+- https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/
+- use https://github.com/codelucas/newspaper to extract fulltext+metadata from
+ HTML crawls
+- changelog elastic index (for stats)
+- import from arabesque output (eg, specific crawls)
+- more logins: orcid, wikimedia
+- `fatcat-auth` tool should support more caveats, both when generating new or
+ mutating existing tokens
+- fast path to skip recursive redirect checks for bulk inserts
+- when getting "wip" entities, require a parameter ("allow_wip"), else get a
+ 404
+- consider dropping CORE identifier
+- maybe better 'success' return message? eg, "success: true" flag
+- idea: allow users to generate their own editgroup UUIDs, to reduce a round
+ trips and "hanging" editgroups (created but never edited)
+- API: allow deletion of empty, un-accepted editgroups
+- refactor API schema for some entity-generic methos (eg, history, edit
+ operations) to take entity type as a URL path param. greatly reduce macro
+ foolery and method count/complexity, and ease creation of new entities
+ => /{entity}/edit/{edit_id}
+ => /{entity}/{ident}/redirects
+ => /{entity}/{ident}/history
+- investigate data quality by looking at, eg, most popular author strings, most
+ popular titles, duplicated containers, etc
+
+## Metadata Import
+
+- web.archive.org response not SHA1 match? => need /<dt>id_/ thing
+- XML etc in metadata
+ => (python) tests for these!
+ https://qa.fatcat.wiki/release/search?q=xmlns
+ https://qa.fatcat.wiki/release/search?q=%24gt
+- bad/weird titles
+ "[Blank page]", "blank page"
+ "Temporary Empty DOI 0"
+ "ADVERTISEMENT"
+ "Full title page with Editorial board (with Elsevier tree)"
+ "Advisory Board Editorial Board"
+- better/complete reltypes probably good (eg, list of IRs, academic domain)
+- 'expand' in lookups (derp! for single hit lookups)
+- include crossref-capitalized DOI in extra
+- some "Elsevier " stuff as publisher
+ => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi
+- crossref import: don't store citation unstructured if len() == 0:
+ {"crossref": {"unstructured": ""}}
+- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
+- manifest: multiple URLs per SHA1
+- crossref: relations ("is-preprint-of")
+- crossref: two phase: no citations, then matched citations (via DOI table)
+- special "alias" DOIs... in crossref metadata?
+
+new importers:
+- pubmed (medline) (filtered)
+ => and/or, use pubmed ID lookups on crossref import
+- arxiv.org
+- DOAJ
+- CORE (filtered)
+- semantic scholar (up to 39 million; includes author de-dupe)
+
+## Guide / Book / Style
+
+- release_type, release_status, url.rel schemas (enforced in API)
+- more+better terms+policies: https://tosdr.org/index.html
+
+## Fun Features
+
+- "save paper now"
+ => is it in GWB? if not, SPN
+ => get hash + url from GWB, verify mimetype acceptable
+ => is file in fatcat?
+ => what about HBase? GROBID?
+ => create edit, redirect user to editgroup submit page
+- python client tool and library in pypi
+ => or maybe rust?
+- bibtext (etc) export
+
+## Metadata Harvesting
+
+- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)"
+
+## Schema / Entity Fields
+
+- elastic transform should only include authors, not editors (?)
+- `doi` field for containers (at least for "journal" type; maybe for "series"
+ as well?)
+- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
+ instead of release_status?
+ => use extra flags and release_status for now
+- 'part-of' relation for releases (release to release) and possibly containers
+- `container_type` field for containers (journal, conference, book series, etc)
+
+## Other / Backburner
+
+- refactor webface views to use shared entity_view.html template
+- shadow library manifest importer
+- book identifiers: OCLC, openlibrary
+- ref from guide: https://creativecommons.org/2012/08/14/library-catalog-metadata-open-licensing-or-public-domain/
+- test redirect/delete elasticsearch change
+- fake DOI (use in examples): 10.5555/12345678
+- refactor elasticsearch inserter to be a class (eg, for command line use)
+- document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31]
+- fileset/webcapture webface anything
+- display abstracts better. no hashes or metadata; prefer plain or HTML,
+ convert JATS if necessary
+- switch from slog to simple pretty_env_log
+- format returned datetimes with only second precision, not millisecond (RFC mode)
+ => burried in model serialization internals
+- refactor openapi schema to use shared response types
+- consider using "HTTP 202: Accepted" for entity-mutating calls
+- basic python hbase/elastic matcher
+ => takes sha1 keys
+ => checks fatcat API + hbase
+ => if not matched yet, tries elastic search
+ => simple ~exact match heuristic
+ => proof-of-concept, no tests
+- add_header Strict-Transport-Security "max-age=3600";
+ => 12 hours? 24?
+- haproxy for rate-limiting
+
+better API docs
+- readme.io has a free open source plan (or at least used to)
+- https://github.com/readmeio/api-explorer
+- https://github.com/lord/slate
+- https://sourcey.com/spectacle/
+- https://github.com/DapperDox/dapperdox
+
+CSL:
+- https://citationstyles.org/
+- https://github.com/citation-style-language/documentation/blob/master/primer.txt
+- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
+- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
+- perhaps a "create from CSL" endpoint?