aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
Diffstat (limited to 'notes')
-rw-r--r--notes/UNSORTED.txt40
-rw-r--r--notes/auth.md251
-rw-r--r--notes/auth_thoughts.txt61
-rw-r--r--notes/fatcat_idents.md25
-rw-r--r--notes/golang.txt45
-rw-r--r--notes/ideas/bot_tools.txt (renamed from notes/bot_tools.txt)0
-rw-r--r--notes/ideas/domains.txt (renamed from notes/domains.txt)0
-rw-r--r--notes/ideas/more_api_patterns.txt (renamed from notes/more_api_patterns.txt)0
-rw-r--r--notes/ideas/thoughts.txt (renamed from notes/thoughts.txt)0
-rw-r--r--notes/oauth_statements.md14
10 files changed, 330 insertions, 106 deletions
diff --git a/notes/UNSORTED.txt b/notes/UNSORTED.txt
new file mode 100644
index 00000000..3960f5eb
--- /dev/null
+++ b/notes/UNSORTED.txt
@@ -0,0 +1,40 @@
+
+Not allowed to PUT edits to the same entity in the same editgroup. If you want
+to update an edit, need to delete the old one first.
+
+The state depends only on the current entity state, not any redirect. This
+means that if the target of a redirect is delted, the redirecting entity is
+still "redirect", not "deleted".
+
+Redirects-to-redirects are not allowed; this is enforced when the editgroup is
+accepted, to prevent race conditions.
+
+Redirects to "work-in-progress" (WIP) rows are disallowed at update time (and
+not re-checked at accept time).
+
+"ident table" parameters are ignored for entity updates. This is so clients can
+simply re-use object instantiations.
+
+The "state" parameter of an entity body is used as a flag when deciding whether
+to do non-normal updates (eg, redirect or undelete, as opposed to inserting a
+new revision).
+
+In the API, if you, eg, expand=files on a redirected release, you will get
+files that point to the *target* release entity. If you use the /files endpoint
+(instead of expand), you will get the files pointing to the redirected entity
+(which probably need updating!). Also, if you expand=files on the target
+entity, you *won't* get the files pointing to the redirected release. A
+high-level merge process might make these changes at the same time? Or at least
+tag at edit review time. A sweeper task can look for and auto-correct such
+redirects after some delay period.
+
+=> it would not be too hard to update get_release_files to check for such
+ redirects; could be handled by request flag?
+
+`prev_rev` is naively set to the most-recent previous state. If the curent
+state was deleted or a redirect, it is set to null.
+
+This parameter is not checked/enforced at edit accept time (but could be, and
+maybe introduce `prev_redirect`, for race detection). Or, could have ident
+point to most-recent edit, and have edits point to prev, for firmer control.
+
diff --git a/notes/auth.md b/notes/auth.md
new file mode 100644
index 00000000..ea249cf7
--- /dev/null
+++ b/notes/auth.md
@@ -0,0 +1,251 @@
+
+This file summarizes the current fatcat authentication schema, which is based
+on 3rd party OAuth2/OIDC sign-in and macaroon tokens.
+
+## Overview
+
+The informal high-level requirements for the auth system were:
+
+- public read-only (HTTP GET) API and website require no login or
+ authentication
+- all changes to the catalog happen through the API and are associated with an
+ abstract editor (the entity behind an editor could be human, a bots, an
+ organization, change over time, etc). basic editor metadata (eg, identifier)
+ is public for all time.
+- editors can signup (create account) and login using the web interface
+- bots and scripts access the API directly; their actions are associated with
+ an editor (which could be a bot account)
+- authentication can be managed via the web interface (eg, creating any tokens
+ or bot accounts)
+- there is a mechanism to revoke API access and lock editor accounts (eg, to
+ block spam); this mechanism doesn't need to be a web interface, but shouldn't
+ be raw SQL commands
+- store an absolute minimum of PII (personally identifiable intformation) that
+ can't be "mixed in" with public database dumps, or would make the database a
+ security target. eg, if possible don't store emails or passwords
+- the web interface should, as much as possible, not be "special". Eg, should
+ work through the API and not have secret keys, if possible
+- be as simple an efficient as possible (eg, minimize per-request database
+ hits)
+
+The initial design that came out of these requirements is to use bearer tokens
+(in the form of macaroons) for all API authentication needs, and to have editor
+account creation and authentication offloaded to third parties via OAuth2
+(specifically OpenID Connect (OIDC) when available). By storing only OIDC
+identifiers in a single database table (linked but separate from the editor
+table), PII collection is minimized, and no code needs to be written to handle
+password recovery, email verification, etc. Tokens can be embedded in web
+interface session cookies and "passed through" in API calls that require
+authentication, so the web interface is effectively stateless (in that it does
+not hold any session or user information internally).
+
+Macaroons, like JSON Web Tokens (JWT) contain signed (verifiable) constraints,
+called caveats. Unlike JWT, these caveats can easily be "further constrained"
+by any party. There is additional support for signed third party caveats, but
+we don't use that feature currently. Caveats can be used to set an expiry time
+for each token, which is appropriate for cookies (requiring a fresh login). We
+also use creation timestamps and per-editor "authentication epoches" (publicly
+stored in the editor table, non-sensitive) to revoke API tokens per-editor (or
+globally, if necessary). Basically, only macaroons that were "minted" after the
+current `auth_epoch` for the editor are considered valid. If a token is lost,
+the `auth_epoch` is reset to the current time (after the compromised token was
+minted, or any subsequent tokens possibly created by an attacker), all existing
+tokens are considered invalid, and the editor must log back in (and generate
+new API tokens for any bots/scripts). In the event of a serious security
+compromise (like the secret signing key being compromised, or a bug in macaroon
+generation is found), all `auth_epoch` timestamps are updated at once (and a
+new key is used).
+
+The account login/signup flow for new editors is to visit the web interface and
+select an OAuth provider (from a fixed list) where they have an account. After
+they approve Fatcat as an application on the third party site, they bounce back
+to the web interface. If they had signed up previously they are signed in,
+otherwise a new editor account is automatically created. A username is
+generated based on the OAuth remote account name, but the editor can change
+this immediately. The web interface allows (or will, when implemented) creation
+of bot accounts (linked to a "wrangler" editor account), generation of tokens,
+etc.
+
+In theory, the API tokens, as macaroons, can be "attenuated" by the user with
+additional caveats before being used. Eg, the expiry could be throttled down to
+a minute or two, or constrained to edits of a specific editgroup, or to a
+specific API endpoint. A use-case for this would be pasting a token in a
+single-page app or untrusted script with minimal delgated authority. Not all of
+these caveat checks have been implemented in the server yet though.
+
+As an "escape hatch", there is a rust command (`fatcat-auth`) for debugging,
+creating new keys and tokens, revoking tokens (via `auth_epoch`), etc. There is
+also a web interface mechanism to "login via existing token". These mechanisms
+aren't intended for general use, but are helpful when developing (when login
+via OAuth may not be configured or accessible) and for admins/operators.
+
+## Current Limitations
+
+No mechanism for linking (or unlinking) multiple remote OAuth accounts into a
+single editor account. The database schema supports this, there just aren't API
+endpoints or a web interface.
+
+There is no obvious place to store persistent non-public user information:
+things like preferences, or current editgroup being operated on via the web
+interface. This info can go in session cookies, but is lost when user logs
+out/in or uses another device.
+
+## API Tokens (Macaroons)
+
+Macaroons contain "caveats" which constrain their scope. In the context of
+fatcat, macaroons should always be constrained to a single editor account (by
+`editor_id`) and a valid creation timestamp; this enables revocation.
+
+In general, want to keep caveats, identifier, and other macaroon contents as
+short as possible, because they can bloat up the token size.
+
+Use identifiers (unique names for looking up signing keys) that contain the
+date and (short) domain, like `20190110-qa`.
+
+Caveats:
+
+- general model is that macaroon is omnipotent and passes all verification,
+ unless caveats are added. eg, adding verification checks doesn't constrain
+ auth, only the caveats constrain auth; verification check *allow* additional
+ auth. each caveat only needs to be allowed by one verifiation.
+- can (and should?) add as many caveat checkers/constrants in code as possible
+
+## Web Signup/Login
+
+OpenID Connect (OIDC) is basically a convention for servers and clients to use
+OAuth2 for the specific purpose of just logging in or linking accounts, a la
+"Sign In With ...". OAuth is often used to provider interoperability between
+service (eg, a client app can take actions as the user, when granted
+permissions, on the authenticating platform); OIDC doesn't grant any such
+permissions, just refreshing logins at most.
+
+The web interface (webface) does all the OAuth/OIDC trickery, and extracts a
+simple platform identifier and user identifier if authentication was
+successful. It sends this in a fatcat API request to the `/auth/oidc` endpoint,
+using admin authentication (the web interface stores an internal token "for
+itself" for this one purpose). The API will return both an Editor object and a
+token for that editor in the response. If the user had signed in previously
+using the same provider/service/user pair as before, the Editor object is the
+user's login. If the pair is new, a new account is created automatically and
+returned; the HTTP status code indicates which happened. The editor username is
+automatically generated from the remote username and platform (user can change
+it if they want).
+
+The returned token and editor metadata are stored in session cookies. The flask
+framework has a secure cookie implementation that prevents users from making up
+cookies, but this isn't the real security mechanism; the real mechanism is that
+they can't generate valid macaroons because they are signed. Cookie *theft* is
+an issue, so aggressive cookie protections should be activated in the Flask
+configuration.
+
+The `auth_oidc` enforces uniqueness on accounts in a few ways:
+
+- lowercase UNIQ constaint on usernames (can't register upper- and lower-case
+ variants)
+- UNIQ {`editor_id`, `platform`}: can't login using multiple remote accounts
+ from the same platform
+- UNIQ {`platform`, `remote_host`, `remote_id`}: can't login to multiple local
+ accounts using the same remote account
+- all fields are NOT NULL
+
+### archive.org "XAuth" Login
+
+The internet archive has it's own bespoke internal API for authentication
+between services. Internal (non-public) documentation link:
+
+ https://git.archive.org/ia/petabox/blob/master/www/sf/services/xauthn/README.md
+
+Fatcat implements "passthrough" authentication to this endpoint by accepting
+email/password (in plaintext! red lights and sirens!) and passes them through,
+along with with special staff-level authentication keys, to authenticate and
+fetch user info. Fatcat then pretends this was a regular OAuth/OIDC
+interaction, substituting the archive.org user "itemname" as a persistent
+identifier, and the XAuth endpoint as the service key.
+
+## Role-Based Authentication (RBAC)
+
+Current acknowledge roles:
+
+- public (not authenticated)
+- bot
+- human
+- editor (bot or human)
+- admin
+- superuser
+
+Will probably rename these. Additionally, editor accounts have an `is_active`
+flag (used to lock disabled/deleted/abusive/compromised accounts); no roles
+beyond public are given for inactive accounts.
+
+## Developer Affordances
+
+A few core accounts are created automatically, with fixed `username`,
+`auth_epoch` and `editor_id`, to make testing and administration easier across
+database resets (aka, tokens keep working as long as the signing key stays the
+same).
+
+Tokens and other secrets can be store in environment variables, scripts, or
+`.env` files.
+
+## Future Work and Alternatives
+
+Want to support more OAuth/OIDC endpoints:
+
+- orcid.org: supports OIDC
+- wikipedia/wikimedia: OAuth; https://github.com/valhallasw/flask-mwoauth
+
+Additional macaroon caveats:
+
+- `endpoint` (API method; caveat can include a list)
+- `editgroup`
+- (etc)
+
+Looked at a few other options for managing use accounts:
+
+- portier, the successor to persona, which basically uses email for magic-link
+ login, unless the email provider supports OIDC or similar. There is a central
+ hosted version to use for bootstrap. Appealing/minimal, but feels somewhat
+ neglected.
+- use something like 'dex' as a proxy to multiple OIDC (and other) providers
+- deploy a huge all-in-one platform like keycloak for all auth anything ever.
+ sort of wish Internet Archive, or somebody (Wikimedia?) ran one of these as
+ public infrastructure.
+- having webface generate macaroons itself
+
+Will probably eventually need to support multiple logins per editor account.
+Shouldn't be too hard, but will require additional API endpoints (POST with
+`editor_id` included, DELETE to remove, etc).
+
+On mobile folks might not be signed in to as many accounts, or it might be
+annoying to enter long/secure passwords (eg, to login to github). Could get
+around this with "login via token via QR code" with long/unlimited expiry.
+Might make more sense to support google OIDC as my guess is that many (most?)
+people have a google account logged in on their phone.
+
+## Implementation Notes
+
+To start, using the `loginpass` python library to handle logins, which is built
+on `authlib`. May need to extend or just use `authlib` directly in the future.
+Supports many large commercial providers, including gitlab.com, github.com, and
+google.
+
+There are many other flask/oauth/OIDC libraries out there, but this one worked
+well with multiple popular providers, mostly by being flexible about actual
+OIDC support. For example, Github doesn't support OIDC (only OAuth2), and
+apparently Gitlab's is incomplete/broken.
+
+### Background Reading
+
+Other flask OIDC integrations:
+
+- https://flask-oidc.readthedocs.io/en/latest/
+- https://github.com/zamzterz/Flask-pyoidc
+
+Background reading on macaroons:
+
+- https://github.com/rescrv/libmacaroons
+- http://evancordell.com/2015/09/27/macaroons-101-contextual-confinement.html
+- https://blog.runscope.com/posts/understanding-oauth-2-and-openid-connect
+- https://latacora.micro.blog/2018/06/12/a-childs-garden.html
+- https://github.com/go-macaroon-bakery/macaroon-bakery (for the "bakery" API pattern)
+
diff --git a/notes/auth_thoughts.txt b/notes/auth_thoughts.txt
deleted file mode 100644
index ba19f4c2..00000000
--- a/notes/auth_thoughts.txt
+++ /dev/null
@@ -1,61 +0,0 @@
-
-For users: use openid connect (oauth2) to sign up and login to web app. From
-web app, can create (and disable?) API tokens
-
-For impl: fatcat-web has private key to create tokens. tokens used both in
-cookies and as API keys. tokens are macaroons (?). fatcatd only verifies
-tokens. optionally, some redis or other fast shared store to verify that tokens
-haven't been revoked.
-
-Could use portier with openid connect as an email-based option. Otherwise,
-orcid, github, google.
-
----------
-
-Use macaroons!
-
-editor/user table has a "auth_epoch" timestamp; only macaroons generated
-after this timestamp are valid. revocation is done by incrementing this
-timestamp ("touch").
-
-Rust CLI tool for managing users:
-- create editor
-
-Special users/editor that can create editor accounts via API; eg, one for
-fatcat-web.
-
-Associate one oauth2 id per domain per editor/user.
-
-Users come to fatcat-web and do oauth2 to login or create an account. All
-oauth2 internal to fatcat-web. If successful, fatcat-web does an
-(authenticated) lookup to API for that identifier. If found, requests a
-new macaroon to use as a cookie for auth. All future requests pass this
-cookie through as bearer auth. fatcat-web remains stateless! macaroon
-contains username (for display); no lookup-per page. Need to logout/login for
-this to update?
-
-Later, can do a "add additional account" feature.
-
-Backend:
-- oauth2 account table, foreign key to editor table
- => this is the only private table
-- auth_epoch timestamp column on editor table
-- lock editor by setting auth_epoch to deep future
-
-Deploy process:
-- auto-create root (admin), import-bootstrap (admin,bot), and demo-user
- editors, with fixed editor_id and "early" auth_epoch, as part of SQL. save
- tokens in env files, on laptop and QA instance.
-- on live QA instance, revoke all keys when live (?)
-
-TODO: privacy policy
-
-fatcat API doesn't *require* auth, but if auth is provided, it will check
-macaroon, and validate against editor table's timestamp.
-
-support oauth2 against:
-- orcid
-- git.archive.org
-- github
-? google
-
diff --git a/notes/fatcat_idents.md b/notes/fatcat_idents.md
new file mode 100644
index 00000000..84322604
--- /dev/null
+++ b/notes/fatcat_idents.md
@@ -0,0 +1,25 @@
+
+## Identifiers
+
+Fatcat entity identifiers are 128-bit UUIDs encoded in base32 format. Revision
+ids are also UUIDs, and encoded in normal UUID fashion, to disambiguate from
+edity identifiers.
+
+Python helpers for conversion:
+
+ import base64
+ import uuid
+
+ def fcid2uuid(s):
+ s = s.split('_')[-1].upper().encode('utf-8')
+ assert len(s) == 26
+ raw = base64.b32decode(s + b"======")
+ return str(uuid.UUID(bytes=raw)).lower()
+
+ def uuid2fcid(s):
+ raw = uuid.UUID(s).bytes
+ return base64.b32encode(raw)[:26].lower().decode('utf-8')
+
+ test_uuid = '00000000-0000-0000-3333-000000000001'
+ assert test_uuid == fcid2uuid(uuid2fcid(test_uuid))
+
diff --git a/notes/golang.txt b/notes/golang.txt
deleted file mode 100644
index 404741e8..00000000
--- a/notes/golang.txt
+++ /dev/null
@@ -1,45 +0,0 @@
-
-## Database Schema / ORM / Generation
-
-start simple, with pg (or sqlx if we wanted to be DB-agnostic):
-- pq: basic postgres driver and ORM (similar to sqlalchemy?)
-- sqlx: small extensions to builtin sql; row to struct mapping
-
-debug postgres with gocmdpev
-
-later, if code is too duplicated, look in to sqlboiler (first) or xo (second):
-- https://github.com/xo/xo
-- https://github.com/volatiletech/sqlboiler
-
-later, to do migrations, use goose, or consider alembic (python) for
-auto-generation
-- https://github.com/steinbacher/goose
-- possibly auto-generate with python alembic
-
-for identifiers, consider either built-in postgres UUID, or:
-- https://github.com/rs/xid
-- https://github.com/oklog/ulid
- like a UUID, but base32 and "sortable" (timestamp + random)
-
-## API In General
-
-Hope to use Kong for authentication.
-
-start with oauth2... orcid?
-
-## OpenAPI/Swagger
-
-go-swagger (OpenAPI 2.0):
-- generate initial API server skeleton from a yaml definition
-- export updated yaml from code after changes
-- web UI for documentation
-- templating/references
-- auto-generate client (in golang)
-
-also look at ReDoc as a UI; all in-brower generated from JSON (react)
-
-## Non-API stuff
-
-- logrus structured logging (or zap?)
-- testify tests (and assert?)
-- viper config
diff --git a/notes/bot_tools.txt b/notes/ideas/bot_tools.txt
index cf465bde..cf465bde 100644
--- a/notes/bot_tools.txt
+++ b/notes/ideas/bot_tools.txt
diff --git a/notes/domains.txt b/notes/ideas/domains.txt
index 8556494e..8556494e 100644
--- a/notes/domains.txt
+++ b/notes/ideas/domains.txt
diff --git a/notes/more_api_patterns.txt b/notes/ideas/more_api_patterns.txt
index ca61ac81..ca61ac81 100644
--- a/notes/more_api_patterns.txt
+++ b/notes/ideas/more_api_patterns.txt
diff --git a/notes/thoughts.txt b/notes/ideas/thoughts.txt
index c01c0d37..c01c0d37 100644
--- a/notes/thoughts.txt
+++ b/notes/ideas/thoughts.txt
diff --git a/notes/oauth_statements.md b/notes/oauth_statements.md
new file mode 100644
index 00000000..5f46c9ed
--- /dev/null
+++ b/notes/oauth_statements.md
@@ -0,0 +1,14 @@
+
+Copy text used when signing up for OAuth applications on various platforms.
+
+## Wikimedia
+
+Fatcat (https://fatcat.wiki) is a publicly-editable bibliographic catalog, containing metadata about tens of millions of research articles, conference proceedings, and books. The particular emphasis is on linking different "releases" of the same "work" (eg, preprint and final copies of a journal paper), and matching specific (archived) files to releases.
+Fatcat is a project of the Internet Archive (https://archive.org).
+
+## ORCID
+
+Fatcat is an open, editable database of bibliographic metadata. You can sign-up
+and login using orcid.org; this option is used for identity and authentication
+only. Fatcat does not currently make changes to any data on orcid.org, which
+you can verify from the permissions requested.