summaryrefslogtreecommitdiffstats
path: root/guide
diff options
context:
space:
mode:
Diffstat (limited to 'guide')
-rw-r--r--guide/TODO7
-rw-r--r--guide/src/SUMMARY.md12
-rw-r--r--guide/src/alignments.md16
-rw-r--r--guide/src/bulk_exports.md50
-rw-r--r--guide/src/data_model.md220
-rw-r--r--guide/src/entity_types.md7
-rw-r--r--guide/src/goals.md38
-rw-r--r--guide/src/guide.md13
-rw-r--r--guide/src/overview.md23
-rw-r--r--guide/src/roadmap.md44
-rw-r--r--guide/src/style_guide.md26
-rw-r--r--guide/src/welcome.md37
-rw-r--r--guide/src/workflow.md13
13 files changed, 345 insertions, 161 deletions
diff --git a/guide/TODO b/guide/TODO
index e3f9f527..1c9b7110 100644
--- a/guide/TODO
+++ b/guide/TODO
@@ -1,9 +1,10 @@
-- break up RFC into sub sections
-- better landing page
- scope
+- quick passes: spellcheck, " I ", "would/will"
+
TODO
--
+- roadmap
+- revise 'implementation' page with details (hosting costs, etc)
DONE
- policies
diff --git a/guide/src/SUMMARY.md b/guide/src/SUMMARY.md
index 16f33ff1..9c7587a5 100644
--- a/guide/src/SUMMARY.md
+++ b/guide/src/SUMMARY.md
@@ -1,16 +1,22 @@
# Outline
+[Welcome!](./welcome.md)
+
- [Fatcat Overview](./overview.md)
- [Goals and Related Projects](./goals.md)
- [Data Model](./data_model.md)
- - [Workflow](./workflow.md)
- - [Sources](./sources.md)
- - [Implementation](./implementation.md)
+ - [Editing Workflow](./workflow.md)
+ - [Sources of Metadata](./sources.md)
+ - [Implementation and Infrastructure](./implementation.md)
- [Roadmap](./roadmap.md)
- [Cataloging Style Guide](./style_guide.md)
+ - [Entity Types](./entity_types.md)
+ - [Schema "Alignments"](./alignments.md)
- [Entity Field Reference](./entity_fields.md)
- [Public API](./http_api.md)
- [Bulk Exports](./bulk_exports.md)
- [Cookbook](./cookbook.md)
- [Software Contributions](./sw_contribute.md)
- [Policies](./policies.md)
+
+[About This Guide](./guide.md)
diff --git a/guide/src/alignments.md b/guide/src/alignments.md
new file mode 100644
index 00000000..291dd6e5
--- /dev/null
+++ b/guide/src/alignments.md
@@ -0,0 +1,16 @@
+# Schema "Alignments"
+
+A table (CSV) of "alignments" between fatcat entity types and fields with other
+file formats and standards is available under the `./notes/` directory of the
+source repo.
+
+TODO: in particular, highlight alignments with:
+
+- citation style language (CSL)
+- bibtex
+- crossref API schema
+- dublin core (schema.org, OAI-PMH)
+- BIBFRAME
+- resourceSync
+- google scholar
+- pubmed/medline
diff --git a/guide/src/bulk_exports.md b/guide/src/bulk_exports.md
index 0aac4475..21cb8226 100644
--- a/guide/src/bulk_exports.md
+++ b/guide/src/bulk_exports.md
@@ -1,8 +1,9 @@
# Bulk Exports
-There are a few different database dump formats folks might want:
+There are several types of bulk exports and database dumps folks might be
+interested in:
-- raw native database backups, for disaster recovery (would include
+- raw, native-format database backups: for disaster recovery (would include
volatile/unsupported schema details, user API credentials, full history,
in-process edits, comments, etc)
- a sanitized version of the above: roughly per-table dumps of the full state
@@ -21,3 +22,48 @@ There are a few different database dump formats folks might want:
just the Release table in a fully "hydrated" state to start. Unclear if
should be on a work or release basis; will go with release for now. Harder to
do using public interface because of the need for transaction locking.
+
+## Identifier Snapshots
+
+One form of bulk export is a fast, consistent (single database transaction)
+snapshot of all "live" entity identifiers and their current revisions. This
+snapshot can be used by non-blocking background scripts to generate full bulk
+exports that will be consistent.
+
+These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh`
+script, run on a primary database machine, and result in a single tarball,
+which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike
+all other dumps and public formats, the fatcat identifiers in these dumps are
+in raw UUID format (not base32-encoded).
+
+A variant of these dumps is to include external identifiers, resulting in files
+that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).
+
+## Abstract Table Dumps
+
+The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary
+database machine, outputs all raw abstract strings in JSON format,
+one-object-per-line.
+
+Abstracts are immutable and referenced by hash in the database, so the
+consistency of these dumps is not as much of a concern as with other exports.
+See the [Policy](./policy.md) page for more context around abstract exports.
+
+## "Expanded" Entity Dumps
+
+Using the above identifier snapshots, the `fatcat-export` script outputs
+single-entity-per-line JSON files with the same schema as the HTTP API. The
+most useful version of these for most users are the "expanded" (including
+container and file metadata) release exports.
+
+These exports are compressed and uploaded to archive.org.
+
+## Changelog Entity Dumps
+
+A final export type are changelog dumps. Currently these are implemented in
+python, and anybody can create them. They contain JSON,
+one-line-per-changelog-entry, with the full list of entity edits and editgroup
+metadata for the given changelog entry. Changelog history is immutable; this
+script works by iterating up the (monotonic) changelog counter until it
+encounters a 404.
+
diff --git a/guide/src/data_model.md b/guide/src/data_model.md
index b2a02688..f3b9b35a 100644
--- a/guide/src/data_model.md
+++ b/guide/src/data_model.md
@@ -1,12 +1,73 @@
# Data Model
-## Identifiers
-
-A fixed number of first-class "entities" are defined, with common behavior and
-schema layouts. These are all be semantic entities like "work", "release",
-"container", and "creator".
-
-fatcat identifiers are semantically meaningless fixed-length random numbers,
+## Entity Types and Ontology
+
+Loosely following "Functional Requirements for Bibliographic Records" (FRBR),
+but removing the "manifestation" abstraction, and favoring files (digital
+artifacts) over physical items, the primary bibliographic entity types are:
+
+- `work`: representing an abstract unit of creative output. Does not contain
+ any metadata itself; used only to group `release` entities. For example, a
+ journal article could be posted as a pre-print, published on a journal
+ website, translated into multiple languages, and then re-published (with
+ minimal changes) as a book chapter; these would all be variants of the same
+ `work`.
+- `release`: a specific "release" or "publicly published" (in a formal or
+ informal sense) version of a work. Contains traditional bibliographic
+ metadata (title, date of publiction, media type, language, etc). Has
+ relationships to other entities:
+ - "variant of" a single `work`
+ - "contributed to by" multiple `creators`
+ - "references to" (cites) multiple `releases`
+ - "published as part of" a single `container`
+- `file`: a single concrete, fixed ditigal artifact; a manifestation of one or
+ more `releases`. Machine-verifiable metadata includes file hashes, size, and
+ detected file format. Verified URLs link to locations on the open web where
+ this file can be found or has been archived. Has relationships:
+ - "manifestation of" multiple `releases` (though usually a single release)
+- `creator`: persona (pseudonym, group, or specific human name) that
+ contributions to `releases` have been attributed to. Not necessarily
+ one-to-one with a human person.
+- `container` (aka "venue", "serial", "title"): a grouping of releases from a
+ single publisher.
+
+Note that, compared to many similar bibliographic ontologies, the current one
+does not have entities to represent:
+
+- funding sources
+- publishing entities
+- "events at a time and place"
+- physical artifacts, either generically or specific copies
+- sets of files (eg, a dataset or webpage with media)
+
+Each entity type has it's own relations and fields (captured in a schema), but
+there are are also generic operations and fields common across all entities.
+The process of creating, updating, querying, and inspecting entities is roughly
+the same regardless of type.
+
+## Identifiers and Revisions
+
+A specific version of any entity in the catalog is called a "revision".
+Revisions are generally immutable (do not change and are not editable), and are
+not usually refered to directly by users. Instead, persistent identifiers can
+be created, which "point to" a specific revsiion at a time. This distinction
+means that entities refered to by an identifier can change over time (as
+metadata is corrected and expanded). Revision objects do not "point" back to
+specific identifiers, so they are not the same as a simple "version number" for
+an identifier.
+
+Identifiers also have the ability to be merged (by redirecting one identifier
+to another) and "deleted" (by pointing the identifier to no revision at all).
+All changes to identifiers are captured as an "edit" object. Edit history can
+be fetched and inspected on a per-identifier basis, and any changes can easily
+be reverted (even merges/redirects and "deletion").
+
+"Staged" or "proposed" changes are captured as edit objects without updating
+the identifers themselves.
+
+### Fatcat Identifiers
+
+Fatcat identifiers are semantically meaningless fixed-length random numbers,
usually represented in case-insensitive base32 format. Each entity type has its
own identifier namespace.
@@ -28,16 +89,18 @@ database Integer columns:
work_rzga5b9cd7efg
https://fatcat.wiki/work/rzga5b9cd7efg
-The idea would be to only have fatcat identifiers be used to interlink between
-databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
+Fatcat identifiers can used to interlink between databases, but are explicitly
+*not* intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
persistent identifiers.
-## Entities and Internal Schema
+### Entity States
+
+### Internal Schema
-Internally, identifiers would be lightweight pointers to "revisions" of an
-entity. Revisions are stored in their complete form, not as a patch or
-difference; if comparing to distributed version control systems, this is the
-git model, not the mercurial model.
+Internally, identifiers are lightweight pointers to "revisions" of an entity.
+Revisions are stored in their complete form, not as a patch or difference; if
+comparing to distributed version control systems (for managing changes to
+source code), this follows the git model, not the mercurial model.
The entity revisions are immutable once accepted; the editting process involves
the creation of new entity revisions and, if the edit is approved, pointing the
@@ -48,122 +111,87 @@ identifier to the new revision. Entities cross-reference between themselves by
Edit objects represent a change to a single entity; edits get batched together
into edit groups (like "commits" and "pull requests" in git parlance).
-SQL tables would probably look something like the (but specific to each entity
-type, with tables like `work_revision` not `entity_revision`):
+SQL tables look something like this (with separate tables for entity type a la
+`work_revision` and `work_edit`):
entity_ident
id (uuid)
current_revision (entity_revision foreign key)
redirect_id (optional; points to another entity_ident)
+ is_live (boolean; whether newly created entity has been accepted)
entity_revision
revision_id
- <entity-specific fields>
+ <all entity-tyle-specific fields>
extra: json blob for schema evolution
entity_edit
timestamp
- editgroup_id
+ editgroup_id (editgroup foreign key)
ident (entity_ident foreign key)
new_revision (entity_revision foreign key)
+ new_redirect (optional; points to entity_ident table)
previous_revision (optional; points to entity_revision)
extra: json blob for progeny metadata
editgroup
- editor_id
+ editor_id (editor table foreign key)
description
extra: json blob for progeny metadata
-Additional entity-specific columns would hold actual metadata. Additional
-tables (which would reference both `entity_revision` and `entity_id` foreign
-keys as appropriate) would represent things like authorship relationships
+An individual entity can be in the following "states", from which the given
+actions (transistion) can be made:
+
+- `wip` (not live; not redirect; has rev)
+ - activate (to `active`)
+- `active` (live; not redirect; has rev)
+ - redirect (to `redirect`)
+ - delete (to `deleted`)
+- `redirect` (live; redirect; rev or not)
+ - split (to `active`)
+ - delete (to `delete`)
+- `deleted` (live; not redirect; no rev)
+ - redirect (to `redirect`)
+ - activate (to `active`)
+
+"WIP, redirect" or "WIP, deleted" are invalid states.
+
+Additional entity-specific columns hold actual metadata. Additional
+tables (which reference both `entity_revision` and `entity_id` foreign
+keys as appropriate) represent things like authorship relationships
(creator/release), citations between works, etc. Every revision of an entity
-would require duplicating all of these associated rows, which could end up
+requires duplicating all of these associated rows, which could end up
being a large source of inefficiency, but is necessary to represent the full
history of an object.
-## Ontology
-
-Loosely following FRBR (Functional Requirements for Bibliographic Records), but
-removing the "manifestation" abstraction, and favoring files (digital
-artifacts) over physical items, the primary entities are:
-
- work
- <a stub, for grouping releases>
-
- release (aka "edition", "variant")
- title
- volume/pages/issue/chapter
- media/formfactor
- publication/peer-review status
- language
- <published> date
- <variant-of> work
- <published-in> container
- <has-contributors> creator
- <citation-to> release
- <has> identifier
-
- file (aka "digital artifact")
- <instantiates> release
- hashes/checksums
- mimetype
- <found-at> URLs
-
- creator (aka "author")
- name
- identifiers
- aliases
-
- container (aka "venue", "serial", "title")
- name
- open-access policy
- peer-review policy
- <has> aliases, acronyms
- <about> subject/category
- <has> identifier
- <published-in> container
- <published-by> publisher
-
-## Controlled Vocabularies
-
-Some special namespace tables and enums would probably be helpful; these could
-live in the database (not requiring a database migration to update), but should
-have more controlled editing workflow... perhaps versioned in the codebase:
+## Controlled Vocabularies
+
+Some individual fields have additional contraints, either in the form of
+pattern validation ("values must be upper case, contain only certain
+characters"), or membership in a fixed set of values. These may include:
-- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers
- themselves)
- subject categorization
- license and open access status
- work "types" (article vs. book chapter vs. proceeding, etc)
- contributor types (author, translator, illustrator, etc)
- human languages
-- file mimetypes
-
-These could also be enforced by QA bots that review all editgroups.
-
-## Entity States
+- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers
+ themselves)
- wip (not live; not redirect; has rev)
- activate
- active (live; not redirect; has rev)
- redirect
- delete
- redirect (live; redirect; rev or not)
- split
- delete
- deleted (live; not redirect; no rev)
- redirect
- activate
+Other fixed-set "vocabularies" become too large to easily maintain or express
+in code. These could be added to the backend databases, or be enforced by bots
+(instead of the core system itself). These mostly include externally-registered identifiers or types, such as:
- "wip redirect" or "wip deleted" are invalid states
+- file mimetypes
+- identifiers themselves (DOI, ORCID, etc), by checking for registeration
+ against canonical APIs and databases
## Global Edit Changelog
-As part of the process of "accepting" an edit group, a row would be written to
-an immutable, append-only log table (which internally could be a SQL table)
-documenting each identifier change. This changelog establishes a monotonically
-increasing version number for the entire corpus, and should make interaction
-with other systems easier (eg, search engines, replicated databases,
-alternative storage backends, notification frameworks, etc.).
+As part of the process of "accepting" an edit group, a row is written to an
+immutable, append-only table (which internally is a SQL table) documenting each
+identifier change. This changelog establishes a monotonically increasing
+version number for the entire corpus, and should make interaction with other
+systems easier (eg, search engines, replicated databases, alternative storage
+backends, notification frameworks, etc.).
diff --git a/guide/src/entity_types.md b/guide/src/entity_types.md
new file mode 100644
index 00000000..1a74f79e
--- /dev/null
+++ b/guide/src/entity_types.md
@@ -0,0 +1,7 @@
+# Entity Types
+
+TODO: entity-type-specific scope and quality guidance
+
+## Work/Release/File Distinctions
+
+TODO: clarify distinctions and relationship between theese three entity types
diff --git a/guide/src/goals.md b/guide/src/goals.md
index 80d0f145..048d9cb1 100644
--- a/guide/src/goals.md
+++ b/guide/src/goals.md
@@ -1,18 +1,18 @@
-# Goals and Related Projects
-## Goals and Ecosystem Niche
+## Project Goals and Ecosystem Niche
-For the Internet Archive use case, fatcat has two primary use cases:
+The Internet Archive has two primary use cases for fatcat:
-- Track the "completeness" of our holdings against all known published works.
- In particular, allow us to monitor and prioritize further collection work.
+- Tracking the "completeness" of our holdings against all known published
+ works. In particular, allow us to monitor progress, identify gaps, and
+ prioritize further collection work.
- Be a public-facing catalog and access mechanism for our open access holdings.
In the larger ecosystem, fatcat could also provide:
- A work-level (as opposed to title-level) archival dashboard: what fraction of
- all published works are preserved in archives? KBART, CLOCKSS, Portico, and
- other preservations don't provide granular metadata
+ all published works are preserved in archives? [KBART](), [CLOCKSS](),
+ [Portico](), and other preservations don't provide granular metadata
- A collaborative, independent, non-commercial, fully-open, field-agnostic,
"completeness"-oriented catalog of scholarly metadata
- Unified (centralized) foundation for discovery and access across repositories
@@ -25,16 +25,22 @@ In the larger ecosystem, fatcat could also provide:
- On-ramp for non-traditional digital works ("grey literature") into the
scholarly web
+[KBART]: https://thekeepers.org/
+[CLOCKSS]: https://clockss.org
+[Portico]: http://www.portico.org
+
## Scope
+What types of works should be included in the catalog?
+
The goal is to capture the "scholarly web": the graph of written works that
cite other works. Any work that is both cited more than once and cites more
than one other work in the catalog is very likely to be in scope. "Leaf nodes"
and small islands of intra-cited works may or may not be in scope.
-fatcat would not include any fulltext content itself, even for cleanly licensed
-(open access) works, but would have "strong" (verified) links to fulltext
-content, and would include file-level metadata (like hashes and fingerprints)
+Fatcat does not include any fulltext content itself, even for cleanly licensed
+(open access) works, but does have "strong" (verified) links to fulltext
+content, and includes file-level metadata (like hashes and fingerprints)
to help discovery and identify content from any source. File-level URLs with
context ("repository", "author-homepage", "web-archive") should make fatcat
more useful for both humans and machines to quickly access fulltext content of
@@ -54,11 +60,11 @@ open bibliographic database at this time (early 2018), including the
Wikidata is a general purpose semantic database of entities, facts, and
relationships; bibliographic metadata has become a large fraction of all
content in recent years. The focus there seems to be linking knowledge
-(statements) to specific sources unambiguously. Potential advantages fatcat
-would have would be a focus on a specific scope (not a general-purpose database
-of entities) and a goal of completeness (capturing as many works and
-relationships as rapidly as possible). However, it might be better to just
-pitch in to the wikidata efforts.
+(statements) to specific sources unambiguously. Potential advantages fatcat has
+are a focus on a specific scope (not a general-purpose database of entities)
+and a goal of completeness (capturing as many works and relationships as
+rapidly as possible). With so much overlap, the two efforts might merge in the
+future.
The technical design of fatcat is loosely inspired by the git
branch/tag/commit/tree architecture, and specifically inspired by Oliver
@@ -69,7 +75,7 @@ including Web of Science, Google Scholar, Microsoft Academic Graph, aminer,
Scopus, and Dimensions. There are excellent field-limited databases like dblp,
MEDLINE, and Semantic Scholar. There are some large general-purpose databases
that are not directly user-editable, including the OpenCitation corpus, CORE,
-BASE, and CrossRef. I don't know of any large (more than 60 million works),
+BASE, and CrossRef. We do not know of any large (more than 60 million works),
open (bulk-downloadable with permissive or no license), field agnostic,
user-editable corpus of scholarly publication bibliographic metadata.
diff --git a/guide/src/guide.md b/guide/src/guide.md
new file mode 100644
index 00000000..dccdc5b8
--- /dev/null
+++ b/guide/src/guide.md
@@ -0,0 +1,13 @@
+# About This Guide
+
+This guide is generated from markdown text files using the mdBook tool. The
+source is mirrored on Github at <https://github.com/bnewbold/fatcat>.
+
+Contributions and corrections are welcome! If you create a (free) account on
+github you can submit comments and corrections as "Issues", or directly edit
+the source and submit "Pull Requests" with changes.
+
+This guide is licensed under a Creative Commons Attribution (CC-BY) license,
+meaning you are free to redistribute, sell, and extend it without special
+permission, as long as you credit the original authors.
+
diff --git a/guide/src/overview.md b/guide/src/overview.md
index ef631b87..58107429 100644
--- a/guide/src/overview.md
+++ b/guide/src/overview.md
@@ -1,10 +1,17 @@
-# Fatcat Overview
-fatcat is an open bibliographic catalog of written works. The scope of works
-is somewhat flexible, with a focus on published research outputs like journal
-articles, pre-prints, and conference proceedings. Records are collaboratively
-editable, versioned, available in bulk form, and include URL-agnostic
-file-level metadata.
+# High-Level Overview
+
+This section gives an introduction to:
+
+- the goals of the project, and now it relates to the rest of the Open Access
+ and archival ecosystem
+- how catalog data is represented as entities and revisions with full edit
+ history, and how entities are refered to and cross-referenced with
+ identifiers
+- how humans and bots propose changes to the catalog, and how these changes are
+ reviewed
+- the major sources of bulk and continuously updated metadata that form the
+ foundation of the catalog
+- a rough sketch of the software back-end, database, and libraries
+- roadmap for near-future work
-fatcat is currently used internally at the Internet Archive, but interested
-folks are welcome to contribute to design and development.
diff --git a/guide/src/roadmap.md b/guide/src/roadmap.md
index b30a21ab..1a2def31 100644
--- a/guide/src/roadmap.md
+++ b/guide/src/roadmap.md
@@ -1,5 +1,47 @@
# Roadmap
+Major unimplemented features (as of September 2018) include:
+
+- backend "soundness" work to ensure corrupt data model states aren't reachable
+ via the API
+- authentication and account creation
+- rate-limiting and spam/abuse mitigation
+- "automated update" bots to consume metadata feeds (as opposed to one-time
+ bulk imports)
+- actual entity creation, editing, deleting through the web interface
+- updating the search index in near-real-time following editgroup merges. In
+ particular, the cache invalidation problem is tricky for some relationships
+ (eg, updating all releases if a container is updated)
+
+Once a reasonable degree of schema and API stability is attained, contributions
+would be helpful to implement:
+
+- import (bulk and/or continuous updates) for more metadata sources
+- better handling of work/release distinction in, eg, search results and
+ citation counting
+- de-duplication (via merging) for all entity types
+- matching improvements, eg, for references (citations), contributions
+ (authorship), work grouping, and file/release matching
+- internationalization of the web interface (translation to multiple languages)
+- review of design for accessibility
+- better handling of non-PDF file formats
+
+Longer term projects could include:
+
+- full-text search over release files
+- bi-directional synchronization with other user-editable catalogs, such as
+ Wikidata
+- better representation of multi-file objects such as websites and datasets
+- altenate/enhanced backend to store full edit history without overloading
+ traditional relational database
+
+## Known Issues
+
+Too many right now, but this section will be populated soon.
+
+- changelog index may have gaps due to postgresql sequence and transaction
+ roll-back behavior
+
## Unresolved Questions
How to handle translations of, eg, titles and author names? To be clear, not
@@ -31,7 +73,7 @@ here should mitigate locking. Hopefully few indexes would be needed in the
primary database, as user interfaces could rely on secondary read-only search
engines for more complex queries and views.
-I see a tension between focus and scope creep. If a central database like
+There is a tension between focus and scope creep. If a central database like
fatcat doesn't support enough fields and metadata, then it will not be possible
to completely import other corpuses, and this becomes "yet another" partial
bibliographic database. On the other hand, accepting arbitrary data leads to
diff --git a/guide/src/style_guide.md b/guide/src/style_guide.md
index 1457a544..35d13e97 100644
--- a/guide/src/style_guide.md
+++ b/guide/src/style_guide.md
@@ -13,8 +13,6 @@ the release listed in the work itself
This is not to be confused with *translations* of entire works, which should be
treated as an entirely separate `release`.
-## Work/Release/File Distinctions
-
## External Identifiers
"Fake identifiers", which are actually registered and used in examples and
@@ -51,30 +49,6 @@ to auto-create a release for every registered DOI. In particular,
aren't currently auto-created, but could be stored in "extra" metadata, or on a
case-by-case basis.
-#### ISSN
-
-TODO
-
-#### ORCID
-
-TODO
-
-#### Wikidata QID
-
-TODO
-
-#### CORE Identifier
-
-TODO
-
-#### ISBN-13
-
-TODO
-
-#### PubMed (PMID and PMCID)
-
-TODO
-
## Human Names
Representing names of human beings in databases is a fraught subject. For some
diff --git a/guide/src/welcome.md b/guide/src/welcome.md
new file mode 100644
index 00000000..0bdf36fa
--- /dev/null
+++ b/guide/src/welcome.md
@@ -0,0 +1,37 @@
+# Welcome, Welcome, Welcome!
+
+This guide you are reading contains:
+
+- a **[high-level introduction](./overview.md)** to the fatcat catalog and
+ software
+- a bibliographic **[style guide](./style_guide.md)** for editors, also useful
+ for understanding metadata found in the catalog
+- technical details and guidance for use of the catalog's
+ **[public REST API](./http_api.md)**, for developers building bots, services,
+ or contributing to the server software
+- **[policies and licensing details](./policies.md)** for all contributors and
+ downstream users of the catalog
+
+## What is Fatcat?
+
+Fatcat is an open bibliographic catalog of written works. The scope of works
+is somewhat flexible, with a focus on published research outputs like journal
+articles, pre-prints, and conference proceedings. Records are collaboratively
+editable, versioned, available in bulk form, and include URL-agnostic
+file-level metadata.
+
+Both the fatcat software and the metadata stored in the service are free (in
+both the libre and gratis sense) for others to share, reuse, fork, or extend.
+See [Policies](./policies.md) for licensing details, and
+[Sources](./sources.md) for attribution of the foundational metadata corpuses
+we build on top of.
+
+Fatcat is currently used internally at the [Internet Archive](), but interested
+folks are welcome to contribute to it's design and development, and we hope to
+ultimately crowd-source corrections and additional to bibliographic metadata,
+and receive direct automated feeds of new content.
+
+You can contact the Archive by email at <info@archive.org>, or the author
+directly at <bnewbold@archive.org>.
+
+[Internet Archive]: https://archive.org
diff --git a/guide/src/workflow.md b/guide/src/workflow.md
index 13370a13..fd53f6a9 100644
--- a/guide/src/workflow.md
+++ b/guide/src/workflow.md
@@ -9,7 +9,7 @@ software.
The normal workflow is to create edits (or updates, merges, deletions) on
individual entities. Individual changes are bundled into an "edit group" of
related edits (eg, correcting authorship info for multiple works related to a
-single author). When ready, the editor would "submit" the edit group for
+single author). When ready, the editor "submits" the edit group for
review. During the review period, human editors vote and bots can perform
automated checks. During this period the editor can make tweaks if necessary.
After some fixed time period (72 hours?) with no changes and no blocking
@@ -28,9 +28,10 @@ Data progeny and source references are captured in the edit metadata, instead
of being encoded in the entity data model itself. In the case of importing
external databases, the expectation is that special-purpose bot accounts
are be used, and tag timestamps and external identifiers in the edit metadata.
-Human editors would leave edit messages to clarify their sources.
+Human editors can leave edit messages to clarify their sources.
+
+A [style guide](./style_guide.md) and discussion forum are intended to be be
+hosted as separate stand-alone services for editors to propose projects and
+debate process or scope changes. These services should have unified accounts
+and logins (oauth?) for consistent account IDs across all services.
-A style guide (wiki) and discussion forum would be hosted as separate
-stand-alone services for editors to propose projects and debate process or
-scope changes. These services should have unified accounts and logins (oauth?)
-to have consistent account IDs across all mediums.