aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-02-14 16:19:26 -0800
committerBryan Newbold <bnewbold@robocracy.org>2019-02-14 16:19:26 -0800
commit70b4bc18b13f59c9d42c8e44ef872dfd2e1abef3 (patch)
tree1c4706394047bce6a086228e2efe8632d8bc1a23
parent56edebe7c2e090c4f25179f03a2d77d78ba59219 (diff)
downloadfatcat-70b4bc18b13f59c9d42c8e44ef872dfd2e1abef3.tar.gz
fatcat-70b4bc18b13f59c9d42c8e44ef872dfd2e1abef3.zip
more guide tweaks; not a full review/rewrite
-rw-r--r--guide/src/container_extra.md78
-rw-r--r--guide/src/entity_fields.md69
-rw-r--r--guide/src/goals.md48
-rw-r--r--guide/src/http_api.md39
-rw-r--r--guide/src/implementation.md10
-rw-r--r--guide/src/policies.md4
-rw-r--r--guide/src/roadmap.md36
-rw-r--r--guide/src/scope.md4
-rw-r--r--guide/src/style_guide.md8
-rw-r--r--guide/src/sw_contribute.md12
-rw-r--r--guide/src/welcome.md4
-rw-r--r--guide/src/workflow.md18
12 files changed, 129 insertions, 201 deletions
diff --git a/guide/src/container_extra.md b/guide/src/container_extra.md
deleted file mode 100644
index 224b7e8a..00000000
--- a/guide/src/container_extra.md
+++ /dev/null
@@ -1,78 +0,0 @@
-
-'extra' fields:
-
- doaj
- as_of: datetime of most recent check; if not set, not actually in DOAJ
- seal: bool
- work_level: bool (are work-level publications deposited with DOAJ?)
- archiving: array, can include 'library' or 'other'
- road
- as_of: datetime of most recent check; if not set, not actually in ROAD
- pubmed (TODO: delete?)
- as_of: datetime of most recent check; if not set, not actually indexed in pubmed
- norwegian (TODO: drop this?)
- as_of: datetime of most recent check; if not set, not actually indexed in pubmed
- id (integer)
- level (integer; 0-2)
- kbart
- lockss
- year_rle
- volume_rle
- portico
- ...
- clockss
- ...
- sherpa_romeo
- color
- jstor
- year_rle
- volume_rle
- scopus
- id
- TODO: print/electronic distinction?
- wos
- id
- doi
- crossref_doi: DOI of the title in crossref (if exists)
- prefixes: array of strings (DOI prefixes, up to the '/'; any registrar, not just Crossref)
- ia
- sim
- nap_id
- year_rle
- volume_rle
- longtail: boolean
- homepage
- as_of: datetime of last attempt
- url
- status: HTTP/heritrix status of homepage crawl
-
- issnp: string
- issne: string
- coden: string
- abbrev: string
- oclc_id: string (TODO: lookup?)
- lccn_id: string (TODO: lookup?)
- dblb_id: string
- default_license: slug
- original_name: native name (if name is translated)
- platform: hosting platform: OJS, wordpress, scielo, etc
- mimetypes: array of strings (eg, 'application/pdf', 'text/html')
- first_year: year (integer)
- last_year: if publishing has stopped
- primary_language: single ISO code, or 'mixed'
- languages: array of ISO codes
- region: TODO: continent/world-region
- nation: shortcode of nation
- discipline: TODO: highest-level subject; "life science", "humanities", etc
- field: TODO: narrower description of field
- subjects: TODO?
- url: homepage
- is_oa: boolean. If true, can assume all releases under this container are "Open Access"
- TODO: domains, if exclusive?
- TODO: fulltext_regex, if a known pattern?
-
-For KBART, etc:
- We "over-count" on the assumption that "in-progress" status works will soon actually be preserved.
- year and volume spans are run-length-encoded arrays, using integers:
- - if an integer, means that year is preserved
- - if an array of length 2, means everything between the two numbers (inclusive) is preserved
diff --git a/guide/src/entity_fields.md b/guide/src/entity_fields.md
index 7e5375b0..209b6154 100644
--- a/guide/src/entity_fields.md
+++ b/guide/src/entity_fields.md
@@ -84,6 +84,11 @@ Additional fields used in analytics and "curration" tracking:
- `sim` (object): same format as `kbart` preservation above; coverage in microfilm collection
- `longtail` (bool): is this considered a "long-tail" open access venue
+For KBART and other "coverage" fields, we "over-count" on the assumption that
+works with "in-progress" status will soon actually be preserved. Elements of
+these arrays are either an integer (means that single year is preserved), or an
+array of length two (meaning everything between the two numbers (inclusive) is
+preserved).
[CODEN]: https://en.wikipedia.org/wiki/CODEN
@@ -258,7 +263,7 @@ Warning: This schema is not yet stable.
always have an implicit order. Zero-indexed. Note that this is distinct
from the `key` field.
- `target_release_id` (fatcat identifier): if known, and the release
- exists, a cross-reference to the fatcat entity
+ exists, a cross-reference to the Fatcat entity
- `extra` (JSON, optional): additional citation format metadata can be
stored here, particularly if the citation schema does not align. Common
fields might be "volume", "authors", "issue", "publisher", "url", and
@@ -316,7 +321,6 @@ This vocabulary is based on the
with a small number of (proposed) extensions:
- `article-magazine`
-- `article-newspaper`
- `article-journal`, including pre-prints and working papers
- `book`
- `chapter` is allowed as they are frequently referenced and read independent
@@ -337,42 +341,45 @@ with a small number of (proposed) extensions:
- `patent`
- `post-weblog` for blog entries
- `report`
-- `review`, for things like book reviews, not the "literature review" form of `article-journal`
+- `review`, for things like book reviews, not the "literature review" form of
+ `article-journal`, nor peer reviews (see `peer_review`)
- `speech` can be used for eg, slides and recorded conference presentations
themselves, as distinct from `paper-conference`
- `thesis`
- `webpage`
- `peer_review` (fatcat extension)
- `software` (fatcat extension)
-- `standard` (fatcat extension)
-- `abstract` (fatcat extension)
+- `standard` (fatcat extension), for technical standards like RFCs
+- `abstract` (fatcat extension), for releases that are only an abstract of a
+ larger work. In particular, translations. Many are granted DOIs.
- `editorial` (custom extension) for columns, "in this issue", and other
- content published along peer-reviewed content in journals.
+ content published along peer-reviewed content in journals. Many are granted DOIs.
- `letter` for "letters to the editor", "authors respond", and
- sub-article-length published content
-- `example` (custom extension) for dummy or example releases that have valid
- (registered) identifiers. Other metadata does not need to match "canonical"
- examples.
+ sub-article-length published content. Many are granted DOIs.
- `stub` (fatcat extension) for releases which have notable external
identifiers, and thus are included "for completeness", but don't seem to
- represent a "full work". An example might be a paper that gets an extra DOI
- by accident; the primary DOI should be a full release, and the accidental DOI
- can be a `stub` release under the same work. `stub` releases shouldn't be
- considered full releases when counting or aggregating (though if technically
- difficult this may not always be implemented). Other things that can be
- categorized as stubs (which seem to often end up mis-categorized as full
- articles in bibliographic databases):
- - commercial advertisements
- - "trap" or "honey pot" works, which are fakes included in databases to
- detect re-publishing without attribution
- - "This page is intentionally blank"
- - "About the author", "About the editors", "About the cover"
- - "Acknowledgments"
- - "Notices"
+ represent a "full work".
+
+An example of a `stub` might be a paper that gets an extra DOI by accident; the
+primary DOI should be a full release, and the accidental DOI can be a `stub`
+release under the same work. `stub` releases shouldn't be considered full
+releases when counting or aggregating (though if technically difficult this may
+not always be implemented). Other things that can be categorized as stubs
+(which seem to often end up mis-categorized as full articles in bibliographic
+databases):
+
+- commercial advertisements
+- "trap" or "honey pot" works, which are fakes included in databases to
+ detect re-publishing without attribution
+- "This page is intentionally blank"
+- "About the author", "About the editors", "About the cover"
+- "Acknowledgments"
+- "Notices"
All other CSL types are also allowed, though they are mostly out of scope:
- `article` (generic; should usually be some other type)
+- `article-newspaper`
- `bill`
- `broadcast`
- `entry-dictionary`
@@ -438,6 +445,20 @@ Can often be interpreted as `published`, but be careful!
- `illustrator`
- `editor`
+All other CSL role types are also allowed, though are mostly out of scope for
+Fatcat:
+
+- `collection-editor`
+- `composer`
+- `container-author`
+- `director`
+- `editorial-director`
+- `editortranslator`
+- `interviewer`
+- `original-author`
+- `recipient`
+- `reviewed-author`
+
If blank, indicates that type of contribution is not known; this can often be
interpreted as authorship.
diff --git a/guide/src/goals.md b/guide/src/goals.md
index e7ef1512..9bb64b62 100644
--- a/guide/src/goals.md
+++ b/guide/src/goals.md
@@ -1,14 +1,14 @@
## Project Goals and Ecosystem Niche
-The Internet Archive has two primary use cases for fatcat:
+The Internet Archive has two primary use cases for Fatcat:
- Tracking the "completeness" of our holdings against all known published
works. In particular, allow us to monitor progress, identify gaps, and
prioritize further collection work.
- Be a public-facing catalog and access mechanism for our open access holdings.
-In the larger ecosystem, fatcat could also provide:
+In the larger ecosystem, Fatcat could also provide:
- A work-level (as opposed to title-level) archival dashboard: what fraction of
all published works are preserved in archives? [KBART](), [CLOCKSS](),
@@ -22,8 +22,8 @@ In the larger ecosystem, fatcat could also provide:
reproducibility (metadata corpus itself is open access, and file-level hashes
control for content drift)
- Foundational infrastructure for distributed digital preservation
-- On-ramp for non-traditional digital works ("grey literature") into the
- scholarly web
+- On-ramp for non-traditional digital works (web-native and "grey literature")
+ into the scholarly web
[KBART]: https://thekeepers.org/
[CLOCKSS]: https://clockss.org
@@ -35,22 +35,22 @@ What types of works should be included in the catalog?
The goal is to capture the "scholarly web": the graph of written works that
cite other works. Any work that is both cited more than once and cites more
-than one other work in the catalog is very likely to be in scope. "Leaf nodes"
-and small islands of intra-cited works may or may not be in scope.
-
-Fatcat does not include any fulltext content itself, even for cleanly licensed
-(open access) works, but does have "strong" (verified) links to fulltext
-content, and includes file-level metadata (like hashes and fingerprints)
-to help discovery and identify content from any source. File-level URLs with
-context ("repository", "author-homepage", "web-archive") should make fatcat
-more useful for both humans and machines to quickly access fulltext content of
-a given mimetype than existing redirect or landing page systems. So another
-factor in deciding scope is whether a work has "digital fixity" and can be
-contained in a single immutable file.
+than one other work in the catalog is likely to be in scope. "Leaf nodes" and
+small islands of intra-cited works may or may not be in scope.
+
+Fatcat does not include any fulltext content itself, even for clearly licensed
+open access works, but does have verified hyperlinks to fulltext content, and
+includes file-level metadata (hashes and fingerprints) to help identify content
+from any source. File-level URLs with context ("repository", "publisher",
+"webarchive") should make Fatcat more useful for both humans and machines to
+quickly access fulltext content of a given mimetype than existing redirect or
+landing page systems. So another factor in deciding scope is whether a work has
+"digital fixity" and can be contained in immutable files or can be captured by
+web archives.
## References and Previous Work
-The closest overall analog of fatcat is [MusicBrainz][mb], a collaboratively
+The closest overall analog of Fatcat is [MusicBrainz][mb], a collaboratively
edited music database. [Open Library][ol] is a very similar existing service,
which exclusively contains book metadata.
@@ -60,23 +60,23 @@ open bibliographic database at this time (early 2018), including the
Wikidata is a general purpose semantic database of entities, facts, and
relationships; bibliographic metadata has become a large fraction of all
content in recent years. The focus there seems to be linking knowledge
-(statements) to specific sources unambiguously. Potential advantages fatcat has
+(statements) to specific sources unambiguously. Potential advantages Fatcat has
are a focus on a specific scope (not a general-purpose database of entities)
and a goal of completeness (capturing as many works and relationships as
rapidly as possible). With so much overlap, the two efforts might merge in the
future.
-The technical design of fatcat is loosely inspired by the git
+The technical design of Fatcat is loosely inspired by the git
branch/tag/commit/tree architecture, and specifically inspired by Oliver
Charles' "New Edit System" [blog posts][nes-blog] from 2012.
-There are a whole bunch of proprietary, for-profit bibliographic databases,
+There are a number of proprietary, for-profit bibliographic databases,
including Web of Science, Google Scholar, Microsoft Academic Graph, aminer,
Scopus, and Dimensions. There are excellent field-limited databases like dblp,
-MEDLINE, and Semantic Scholar. There are some large general-purpose databases
-that are not directly user-editable, including the OpenCitation corpus, CORE,
-BASE, and CrossRef. We do not know of any large (more than 60 million works),
-open (bulk-downloadable with permissive or no license), field agnostic,
+MEDLINE, and Semantic Scholar. Large, general-purpose databases also exist that
+are not directly user-editable, including the OpenCitation corpus, CORE, BASE,
+and CrossRef. We do not know of any large (more than 60 million works), open
+(bulk-downloadable with permissive or no license), field agnostic,
user-editable corpus of scholarly publication bibliographic metadata.
[nes-blog]: https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html
diff --git a/guide/src/http_api.md b/guide/src/http_api.md
index 5769533d..e1b7f557 100644
--- a/guide/src/http_api.md
+++ b/guide/src/http_api.md
@@ -1,6 +1,6 @@
# REST API
-The fatcat HTTP API is mostly a classic REST CRUD (Create, Read, Update,
+The Fatcat HTTP API is mostly a classic REST "CRUD" (Create, Read, Update,
Delete) API, with a few twists.
A declarative specification of all API endpoints, JSON data models, and
@@ -9,9 +9,8 @@ used to generate both server-side type-safe endpoint routes and client-side
libraries. Auto-generated reference documentation is, for now, available at
<https://api.qa.fatcat.wiki>.
-All API traffic is over HTTPS; there is no insecure HTTP endpoint, even for
-read-only operations. To start, all endpoints accept and return only JSON
-serialized content.
+All API traffic is over HTTPS; there is no HTTP endpoint, even for read-only
+operations. All endpoints accept and return only JSON serialized content.
## Entity Endpoints/Actions
@@ -21,16 +20,13 @@ Actions could, in theory, be directed at any of:
revision
edit
-A design decision to be made is how much to abstract away the distinction
-between these three types (particularly the identifier/revision distinction).
-
Top-level entity actions (resulting in edits):
create (new rev)
- redirect
- split
update (new rev)
delete
+ redirect
+ split (remove redirect)
On existing entity edits (within a group):
@@ -45,17 +41,23 @@ An edit group as a whole can be:
Other per-entity endpoints:
- match (by field/context)
lookup (by external persistent identifier)
+ match (by field/context; unimplemented)
## Editgroups
-All mutating entity operations (create, update, delete) accept an
-`editgroup_id` query parameter. If the parameter isn't set, the editor's
-"currently active" editgroup will be used, or a new editgroup will be created
-from scratch. It's generally preferable to manually create an editgroup and use
-the `id` in edit requests; the allows appropriate metadata to be set. The
-"currently active" editgroup behavior may be removed in the future.
+All mutating entity operations (create, update, delete) accept a required
+`editgroup_id` query parameter. Editgroups (with contextual metadata) should be
+created before starting edits.
+
+Related edits (to multiple entities) should be collected under a single
+editgroup, up to a reasonable size. More than 50 edits per entity type, or more
+than 100 edits total in an editgroup become unwieldy.
+
+After creating and modifying the editgroup, it may be "submitted", which flags
+it for review by bot and human editors. The editgroup may be "accepted"
+(merged), or if changes are necessary the edits can be updated and
+re-submitted.
## Sub-Entity Expansion
@@ -77,9 +79,8 @@ editor may have additional privileges which allow them to, eg, directly accept
editgroups (as opposed to submitting edits for review).
All mutating API calls (POST, PUT, DELETE HTTP verbs) require token-based
-authentication using an HTTP Bearer token. If you can't generate such a token
-from the web interface (because that feature hasn't been implemented), look for
-a public demo token for experimentation, or ask an administrator for a token.
+authentication using an HTTP Bearer token. New tokens can be generated in the
+web interface.
## Autoaccept Flag
diff --git a/guide/src/implementation.md b/guide/src/implementation.md
index 33a53c21..8d1830b6 100644
--- a/guide/src/implementation.md
+++ b/guide/src/implementation.md
@@ -15,14 +15,14 @@ A cronjob will create periodic database dumps, both in "full" form (all tables
and all edit history, removing only authentication credentials) and "flattened"
form (with only the most recent version of each entity).
-A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not
-necessarily "first". It should be possible to export the database in a
+One design goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but
+not necessarily "first". It should be possible to export the database in a
relatively clean RDF form, and to fetch data in a variety of formats, but
-internally fatcat will not be backed by a triple-store, and will not be bound
-to a rigid third-party ontology or schema.
+internally Fatcat is not backed by a triple-store, and is not tied to any
+specific third-party ontology or schema.
Microservice daemons should be able to proxy between the primary API and
-standard protocols like ResourceSync and OAI-PMH, and third party bots could
+standard protocols like ResourceSync and OAI-PMH, and third party bots can
ingest or synchronize the database in those formats.
### Fatcat Identifiers
diff --git a/guide/src/policies.md b/guide/src/policies.md
index e61984be..3816f876 100644
--- a/guide/src/policies.md
+++ b/guide/src/policies.md
@@ -69,11 +69,11 @@ and CC-0 (public grant) licensing for declarative interface specifications
## Privacy Policy
*It is important to note that this section is currently aspirational: the
-servers hosting early deployments of fatcat are largely in a default
+servers hosting early deployments of Fatcat are largely in a defaults
configuration and have not been audited to ensure that these guidelines are
being followed.*
-It is a goal for fatcat to conduct as little surveillance of reader and editor
+It is a goal for Fatcat to conduct as little surveillance of reader and editor
behavior and activities as possible. In practical terms, this means minimizing
the overall amount of logging and collection of identifying information. This
is in contrast to *submitted edit content*, which is captured, preserved, and
diff --git a/guide/src/roadmap.md b/guide/src/roadmap.md
index 745380f9..c4cc6a98 100644
--- a/guide/src/roadmap.md
+++ b/guide/src/roadmap.md
@@ -1,20 +1,11 @@
# Roadmap
-Major unimplemented features (as of September 2018) include:
+Core unimplemented features (as of February 2019) include:
-- backend "soundness" work to ensure corrupt data model states aren't reachable
- via the API
-- authentication and account creation
- rate-limiting and spam/abuse mitigation
-- "automated update" bots to consume metadata feeds (as opposed to one-time
- bulk imports)
- actual entity creation, editing, deleting through the web interface
-- updating the search index in near-real-time following editgroup merges. In
- particular, the cache invalidation problem is tricky for some relationships
- (eg, updating all releases if a container is updated)
-Once a reasonable degree of schema and API stability is attained, contributions
-would be helpful to implement:
+Contributions would be helpful to implement:
- import (bulk and/or continuous updates) for more metadata sources
- better handling of work/release distinction in, eg, search results and
@@ -23,23 +14,19 @@ would be helpful to implement:
- matching improvements, eg, for references (citations), contributions
(authorship), work grouping, and file/release matching
- internationalization of the web interface (translation to multiple languages)
-- review of design for accessibility
-- better handling of non-PDF file formats
+- accessibility review of user interface
Longer term projects could include:
- full-text search over release files
- bi-directional synchronization with other user-editable catalogs, such as
Wikidata
-- better representation of multi-file objects such as websites and datasets
- alternate/enhanced backend to store full edit history without overloading
traditional relational database
## Known Issues
-Too many right now, but this section will be populated soon.
-
-- changelog index may have gaps due to postgresql sequence and transaction
+- changelog index may have gaps due to PostgreSQL sequence and transaction
roll-back behavior
## Unresolved Questions
@@ -48,22 +35,19 @@ How to handle translations of, eg, titles and author names? To be clear, not
translations of works (which are just separate releases), these are more like
aliases or "originally known as".
-Are bi-directional links a schema anti-pattern? Eg, should "work" point to a
-"primary release" (which itself points back to the work)?
-
-Should `identifier` and `citation` be their own entities, referencing other
-entities by UUID instead of by revision? Not sure if this would increase or
-decrease database resource utilization.
+Should external identifers be made generic? Eg, instead of having `arxiv_id` as
+a column, have a table of arbitary identifers, with either an `extid_type` or
+just use a prefix like `arxiv:someid`.
Should contributor/author affiliation and contact information be retained? It
could be very useful for disambiguation, but we don't want to build a huge
-database for spammers or "innovative" start-up marketing.
+database for "marketing" and other spam.
Can general-purpose SQL databases like Postgres or MySQL scale well enough to
hold several tables with billions of entity revisions? Right from the start
there are hundreds of millions of works and releases, many of which having
dozens of citations, many authors, and many identifiers, and then we'll have
-potentially dozens of edits for each of these, which multiply out to `1e8 * 2e1
+potentially dozens of edits for each of these. This multiplies out to `1e8 * 2e1
* 2e1 = 4e10`, or 40 billion rows in the citation table. If each row was 32
bytes on average (uncompressed, not including index size), that would be 1.3
TByte on its own, larger than common SSD disks. I do think a transactional SQL
@@ -74,7 +58,7 @@ primary database, as user interfaces could rely on secondary read-only search
engines for more complex queries and views.
There is a tension between focus and scope creep. If a central database like
-fatcat doesn't support enough fields and metadata, then it will not be possible
+Fatcat doesn't support enough fields and metadata, then it will not be possible
to completely import other corpuses, and this becomes "yet another" partial
bibliographic database. On the other hand, accepting arbitrary data leads to
other problems: sparseness increases (we have more "partial" data), potential
diff --git a/guide/src/scope.md b/guide/src/scope.md
index d5e74156..9815c44e 100644
--- a/guide/src/scope.md
+++ b/guide/src/scope.md
@@ -53,11 +53,11 @@ pre-prints to final publication is in scope.
I'm much less interested in altmetrics, funding, and grant relationships than
most existing databases in this space.
-fatcat would not include any fulltext content itself, even for cleanly licensed
+Fatcat would not include any fulltext content itself, even for cleanly licensed
(open access) works, but would have "strong" (verified) links to fulltext
content, and would include file-level metadata (like hashes and fingerprints)
to help discovery and identify content from any source. File-level URLs with
-context ("repository", "author-homepage", "web-archive") should make fatcat
+context ("repository", "author-homepage", "web-archive") should make Fatcat
more useful for both humans and machines to quickly access fulltext content of
a given mimetype than existing redirect or landing page systems. So another
factor in deciding scope is whether a work has "digital fixity" and can be
diff --git a/guide/src/style_guide.md b/guide/src/style_guide.md
index 7f819c8d..d670691a 100644
--- a/guide/src/style_guide.md
+++ b/guide/src/style_guide.md
@@ -19,12 +19,12 @@ treated as an entirely separate `release`.
documentation (such as DOI `10.5555/12345678`) are allowed (and the entity
should be tagged as a fake or example). Non-registered "identifier-like
strings", which are semantically valid but not registered, should not exist in
-fatcat metadata in an identifier column. Invalid identifier strings can be
+Fatcat metadata in an identifier column. Invalid identifier strings can be
stored in "extra" metadata. Crossref has [blogged]() about this distinction.
[blogged]: https://www.crossref.org/blog/doi-like-strings-and-fake-dois/
-#### DOI
+#### DOIs
All DOIs stored in an entity column should be registered (aka, should be
resolvable from `doi.org`). Invalid identifiers may be cleaned up or removed by
@@ -38,9 +38,9 @@ formatted strings.
[number of examples]: https://www.crossref.org/blog/dois-unambiguously-and-persistently-identify-published-trustworthy-citable-online-scholarly-literature-right/
-In the fatcat ontology, DOIs and release entities are one-to-one.
+In the Fatcat ontology, DOIs and release entities are one-to-one.
-It is the intention to automatically (via bot) create a fatcat release for
+It is the intention to automatically (via bot) create a Fatcat release for
every Crossref-registered DOI from a whitelist of media types
("journal-article" etc, but not all), and it would be desirable to auto-create
entities for in-scope publications from all registrars. It is not the intention
diff --git a/guide/src/sw_contribute.md b/guide/src/sw_contribute.md
index 999b2149..d408ef4b 100644
--- a/guide/src/sw_contribute.md
+++ b/guide/src/sw_contribute.md
@@ -2,13 +2,13 @@
For now, issues and patches can be filed at <https://github.com/internetarchive/fatcat>.
-To start, the back-end (fatcatd, in rust), web interface (fatcat-web, in
-python), bots, and this guide are all versioned in the same git repository.
+The back-end (`fatcatd`, in Rust), web interface (`fatcat-web`, in Python),
+bots, and this guide are all versioned in the same git repository.
-See the `rust/README` and `rust/HACKING` documents for some common tasks and
-gotchas when working with the rust backend.
+See the `rust/README.md` and `rust/HACKING.md` documents for some common tasks
+and gotchas when working with the rust backend.
When considering making a non-trivial contribution, it can save review time and
duplicated work to post an issue with your intentions and plan. New code and
-features will need to include unit tests before being merged, though we can
-help with writing them.
+features must include unit tests before being merged, though we can help with
+writing them.
diff --git a/guide/src/welcome.md b/guide/src/welcome.md
index 0bdf36fa..b0d8b1cc 100644
--- a/guide/src/welcome.md
+++ b/guide/src/welcome.md
@@ -2,7 +2,7 @@
This guide you are reading contains:
-- a **[high-level introduction](./overview.md)** to the fatcat catalog and
+- a **[high-level introduction](./overview.md)** to the Fatcat catalog and
software
- a bibliographic **[style guide](./style_guide.md)** for editors, also useful
for understanding metadata found in the catalog
@@ -20,7 +20,7 @@ articles, pre-prints, and conference proceedings. Records are collaboratively
editable, versioned, available in bulk form, and include URL-agnostic
file-level metadata.
-Both the fatcat software and the metadata stored in the service are free (in
+Both the Fatcat software and the metadata stored in the service are free (in
both the libre and gratis sense) for others to share, reuse, fork, or extend.
See [Policies](./policies.md) for licensing details, and
[Sources](./sources.md) for attribution of the foundational metadata corpuses
diff --git a/guide/src/workflow.md b/guide/src/workflow.md
index 94842e54..ff1552cf 100644
--- a/guide/src/workflow.md
+++ b/guide/src/workflow.md
@@ -3,8 +3,8 @@
## Basic Editing Workflow and Bots
Both human editors and bots should have edits go through the same API, with
-humans using either the default web interface, integration, or client
-software.
+humans using either the default web interface, client software, or third-party
+integrations.
The normal workflow is to create edits (or updates, merges, deletions) on
individual entities. Individual changes are bundled into an "edit group" of
@@ -12,13 +12,13 @@ related edits (eg, correcting authorship info for multiple works related to a
single author). When ready, the editor "submits" the edit group for
review. During the review period, human editors vote and bots can perform
automated checks. During this period the editor can make tweaks if necessary.
-After some fixed time period (72 hours?) with no changes and no blocking
-issues, the edit group would be auto-accepted if no merge conflicts have
-be created by other edits to the same entities. This process balances editing
-labor (reviews are easy, but optional) against quality (cool-down period makes
-it easier to detect and prevent spam or out-of-control bots). More
-sophisticated roles and permissions could allow some certain humans and bots to
-push through edits more rapidly (eg, importing new works from a publisher API).
+After some fixed time period (one week?) with no changes and no blocking
+issues, the edit group would be accepted if no merge conflicts have be created
+by other edits to the same entities. This process balances editing labor
+(reviews are easy, but optional) against quality (cool-down period makes it
+easier to detect and prevent spam or out-of-control bots). More sophisticated
+roles and permissions could allow some certain humans and bots to push through
+edits more rapidly (eg, importing new works from a publisher API).
Bots need to be tuned to have appropriate edit group sizes (eg, daily batches,
instead of millions of works in a single edit) to make human QA review and