start refactoring guide (per-entity pages)

author: Bryan Newbold <bnewbold@robocracy.org> 2019-05-20 16:57:57 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2019-05-20 16:57:57 -0700
commit: cd829eedb5bfc7328ab5266650a625a6c88db6fa (patch)
tree: 90aa164cbd7f4e86aadc25dbd036dab680c30e80 /guide
parent: eb31be2172264091e192bcb4f17ffd571253fffa (diff)
download: fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.tar.gz
fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.zip
9 files changed, 490 insertions, 457 deletions
diff --git a/guide/src/SUMMARY.md b/guide/src/SUMMARY.md
index c25615b8..2a986dcb 100644
--- a/guide/src/SUMMARY.md
+++ b/guide/src/SUMMARY.md
@@ -10,7 +10,14 @@
     - [Implementation and Infrastructure](./implementation.md)
     - [Roadmap](./roadmap.md)
 - [Cataloging Style Guide](./style_guide.md)
-    - [Entity Field Reference](./entity_fields.md)
+    - [All Entities](./entity_fields.md)
+    - [Container](./entity_container.md)
+    - [Creator](./entity_creator.md)
+    - [File](./entity_file.md)
+    - [Fileset](./entity_fileset.md)
+    - [Web Capture](./entity_webcapture.md)
+    - [Release](./entity_release.md)
+    - [Work](./entity_work.md)
 - [Public API](./http_api.md)
     - [Bulk Exports](./bulk_exports.md)
     - [Cookbook](./cookbook.md)
diff --git a/guide/src/entity_container.md b/guide/src/entity_container.md
new file mode 100644
index 00000000..f6568044
--- /dev/null
+++ b/guide/src/entity_container.md
@@ -0,0 +1,84 @@
+
+# Container Entity Reference
+
+## Fields
+
+- `name` (string, required): The title of the publication, as used in
+  international indexing services. Eg, "Journal of Important Results". Not
+  necessarily in the native language, but also not necessarily in English.
+  Alternative titles (and translations) can be stored in "extra" metadata (see
+  below)
+- `container_type` (string): eg, journal vs. conference vs. book series.
+  Controlled vocabulary is TODO.
+- `publisher` (string): The name of the publishing organization. Eg, "Society
+  of Curious Students".
+- `issnl` (string): an external identifier, with registration controlled by the
+  [ISSN organization](http://www.issn.org/). Registration is relatively
+  inexpensive and easy to obtain (depending on world region), so almost all
+  serial publications have one. The ISSN-L ("linking ISSN") is one of either
+  the print ("ISSNp") or electronic ("ISSNe") identifiers for a serial
+  publication; not all publications have both types of ISSN, but many do, which
+  can cause confusion. The ISSN master list is not gratis/public, but the
+  ISSN-L mapping is.
+- `wikidata_qid` (string): external linking identifier to a Wikidata entity.
+
+#### `extra` Fields
+
+- `abbrev` (string): a commonly used abbreviation for the publication, as used
+  in citations, following the [ISO 4]() standard. Eg, "Journal of Polymer
+  Science Part A" -> "J. Polym. Sci. A"
+- `coden` (string): an external identifier, the [CODEN code](). 6 characters,
+  all upper-case.
+- `issnp` (string): Print ISSN
+- `issne` (string): Electronic ISSN
+- `default_license` (string, slug): short name (eg, "CC-BY-SA") for the
+  default/recommended license for works published in this container
+- `original_name` (string): native name (if `name` is translated)
+- `platform` (string): hosting platform: OJS, wordpress, scielo, etc
+- `mimetypes` (array of string): formats that this container publishes all works
+  under (eg, 'application/pdf', 'text/html')
+- `first_year` (integer): first year of publication
+- `last_year` (integer): final year of publication (implies that container is no longer active)
+- `languages` (array of strings): ISO codes; the first entry is considered the
+  "primary" language (if that makes sense)
+- `country` (string): ISO abbreviation (two characters) for the country this
+  container is published in
+- `aliases` (array of strings): significant alternative names or abbreviations
+  for this container (not just capitalization/punctuation)
+- `region` (string, slug): continent/world-region (vocabulary is TODO)
+- `discipline` (string, slug): highest-level subject aread (vocabulary is TODO)
+- `urls` (array of strings): known homepage URLs for this container (first in array is default)
+
+Additional fields used in analytics and "curration" tracking:
+
+- `doaj` (object)
+  - `as_of` (string, ISO datetime): datetime of most recent check; if not set,
+    not actually in DOAJ
+  - `seal` (bool): has DOAJ seal
+  - `work_level` (bool): whether work-level publications are registered with DOAJ
+  - `archive` (array of strings): preservation archives
+- `road` (object)
+  - `as_of` (string, ISO datetime): datetime of most recent check; if not set,
+    not actually in ROAD
+- `kbart` (object)
+  - `lockss`, `clockss`, `portico`, `jstor` etc (object)
+    - `year_spans` (array of arrays of integers (pairs)): year spans (inclusive)
+      for which the given archive has preserved this container
+    - `volume_spans` (array of arrays of integers (pairs)): volume spans (inclusive)
+      for which the given archive has preserved this container
+- `sherpa_romeo` (object):
+    - `color` (string): the SHERPA/RoMEO "color" of the publisher of this container
+- `doi`: TODO: include list of prefixes and which (if any) DOI registrar is used
+- `dblp` (object):
+  - `id` (string)
+- `ia` (object): Internet Archive specific fields
+  - `sim` (object): same format as `kbart` preservation above; coverage in microfilm collection
+  - `longtail` (bool): is this considered a "long-tail" open access venue
+
+For KBART and other "coverage" fields, we "over-count" on the assumption that
+works with "in-progress" status will soon actually be preserved. Elements of
+these arrays are either an integer (means that single year is preserved), or an
+array of length two (meaning everything between the two numbers (inclusive) is
+preserved).
+
+[CODEN]: https://en.wikipedia.org/wiki/CODEN
diff --git a/guide/src/entity_creator.md b/guide/src/entity_creator.md
new file mode 100644
index 00000000..fded9e8d
--- /dev/null
+++ b/guide/src/entity_creator.md
@@ -0,0 +1,13 @@
+
+# Creator Entity Reference
+
+## Fields
+
+- `display_name` (string, required): Full name, as will be displayed in user
+  interfaces. Eg, "Grace Hopper"
+- `given_name` (string): Also known as "first name". Eg, "Grace".
+- `surname` (string): Also known as "last name". Eg, "Hooper".
+- `orcid` (string): external identifier, as registered with ORCID.
+- `wikidata_qid` (string): external linking identifier to a Wikidata entity.
+  
+See also ["Human Names"](./style_guide.md##human-names) sub-section of style guide.
diff --git a/guide/src/entity_fields.md b/guide/src/entity_fields.md
index d2e68f95..dfded89a 100644
--- a/guide/src/entity_fields.md
+++ b/guide/src/entity_fields.md
@@ -1,4 +1,4 @@
-# Entity Field Reference
+# Common Entity Fields
 
 All entities have:
 
@@ -10,458 +10,3 @@ well as accommodating niche or field-specific content. Reasonable care should
 be taken with this extra metadata: don't include large text or binary fields,
 hundreds of fields, duplicate metadata, etc.
 
-## Containers
-
-- `name` (string, required): The title of the publication, as used in
-  international indexing services. Eg, "Journal of Important Results". Not
-  necessarily in the native language, but also not necessarily in English.
-  Alternative titles (and translations) can be stored in "extra" metadata (see
-  below)
-- `container_type` (string): eg, journal vs. conference vs. book series.
-  Controlled vocabulary is TODO.
-- `publisher` (string): The name of the publishing organization. Eg, "Society
-  of Curious Students".
-- `issnl` (string): an external identifier, with registration controlled by the
-  [ISSN organization](http://www.issn.org/). Registration is relatively
-  inexpensive and easy to obtain (depending on world region), so almost all
-  serial publications have one. The ISSN-L ("linking ISSN") is one of either
-  the print ("ISSNp") or electronic ("ISSNe") identifiers for a serial
-  publication; not all publications have both types of ISSN, but many do, which
-  can cause confusion. The ISSN master list is not gratis/public, but the
-  ISSN-L mapping is.
-- `wikidata_qid` (string): external linking identifier to a Wikidata entity.
-
-#### `extra` Fields
-
-- `abbrev` (string): a commonly used abbreviation for the publication, as used
-  in citations, following the [ISO 4]() standard. Eg, "Journal of Polymer
-  Science Part A" -> "J. Polym. Sci. A"
-- `coden` (string): an external identifier, the [CODEN code](). 6 characters,
-  all upper-case.
-- `issnp` (string): Print ISSN
-- `issne` (string): Electronic ISSN
-- `default_license` (string, slug): short name (eg, "CC-BY-SA") for the
-  default/recommended license for works published in this container
-- `original_name` (string): native name (if `name` is translated)
-- `platform` (string): hosting platform: OJS, wordpress, scielo, etc
-- `mimetypes` (array of string): formats that this container publishes all works
-  under (eg, 'application/pdf', 'text/html')
-- `first_year` (integer): first year of publication
-- `last_year` (integer): final year of publication (implies that container is no longer active)
-- `languages` (array of strings): ISO codes; the first entry is considered the
-  "primary" language (if that makes sense)
-- `country` (string): ISO abbreviation (two characters) for the country this
-  container is published in
-- `aliases` (array of strings): significant alternative names or abbreviations
-  for this container (not just capitalization/punctuation)
-- `region` (string, slug): continent/world-region (vocabulary is TODO)
-- `discipline` (string, slug): highest-level subject aread (vocabulary is TODO)
-- `urls` (array of strings): known homepage URLs for this container (first in array is default)
-
-Additional fields used in analytics and "curration" tracking:
-
-- `doaj` (object)
-  - `as_of` (string, ISO datetime): datetime of most recent check; if not set,
-    not actually in DOAJ
-  - `seal` (bool): has DOAJ seal
-  - `work_level` (bool): whether work-level publications are registered with DOAJ
-  - `archive` (array of strings): preservation archives
-- `road` (object)
-  - `as_of` (string, ISO datetime): datetime of most recent check; if not set,
-    not actually in ROAD
-- `kbart` (object)
-  - `lockss`, `clockss`, `portico`, `jstor` etc (object)
-    - `year_spans` (array of arrays of integers (pairs)): year spans (inclusive)
-      for which the given archive has preserved this container
-    - `volume_spans` (array of arrays of integers (pairs)): volume spans (inclusive)
-      for which the given archive has preserved this container
-- `sherpa_romeo` (object):
-    - `color` (string): the SHERPA/RoMEO "color" of the publisher of this container
-- `doi`: TODO: include list of prefixes and which (if any) DOI registrar is used
-- `dblp` (object):
-  - `id` (string)
-- `ia` (object): Internet Archive specific fields
-  - `sim` (object): same format as `kbart` preservation above; coverage in microfilm collection
-  - `longtail` (bool): is this considered a "long-tail" open access venue
-
-For KBART and other "coverage" fields, we "over-count" on the assumption that
-works with "in-progress" status will soon actually be preserved. Elements of
-these arrays are either an integer (means that single year is preserved), or an
-array of length two (meaning everything between the two numbers (inclusive) is
-preserved).
-
-[CODEN]: https://en.wikipedia.org/wiki/CODEN
-
-## Creators
-
-- `display_name` (string, required): Full name, as will be displayed in user
-  interfaces. Eg, "Grace Hopper"
-- `given_name` (string): Also known as "first name". Eg, "Grace".
-- `surname` (string): Also known as "last name". Eg, "Hooper".
-- `orcid` (string): external identifier, as registered with ORCID.
-- `wikidata_qid` (string): external linking identifier to a Wikidata entity.
-  
-See also ["Human Names"](./style_guide.md##human-names) sub-section of style guide.
-
-## Files
-
-- `size` (integer, positive, non-zero): Size of file in bytes. Eg: 1048576.
-- `md5` (string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".
-- `sha1` (string): SHA-1 hash in lower-case hex. Not required, but the most-used of the hashes and should always be included. Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
-- `sha256`: SHA-256 hash in lower-case hex. Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
-- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
-  preserved.
-    - `url` (string, required):
-            Eg: "https://example.edu/~frau/prcding.pdf".
-    - `rel` (string, required):
-            Eg: "webarchive".
-- `mimetype` (string): Format of the file. If XML, specific schema can be
-  included after a `+`. Example: "application/pdf"
-- `release_ids` (array of string identifiers): references to `release` entities that this
-  file represents a manifestation of. Note that a single file can contain
-  multiple release references (eg, a PDF containing a full issue with many
-  articles), and that a release will often have multiple files (differing only
-  by watermarks, or different digitizations of the same printed work, or
-  variant MIME/media types of the same published work).
-
-## Filesets
-
-Warning: This schema is not yet stable.
-
-- `manifest` (array of objects): each entry represents a file
-  - `path` (string, required): relative path to file (including filename)
-  - `size` (integer, required): in bytes
-  - `md5` (string): MD5 hash in lower-case hex
-  - `sha1` (string): SHA-1 hash in lower-case hex
-  - `sha256` (string): SHA-256 hash in lower-case hex
-  - `extra` (object): any extra metadata about this specific file
-- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
-  preserved.
-    - `url` (string, required):
-            Eg: "https://example.edu/~frau/prcding.pdf".
-    - `rel` (string, required):
-            Eg: "webarchive".
-- `release_ids` (array of string identifiers): references to `release` entities
-
-## Webcaptures
-
-Warning: This schema is not yet stable.
-
-- `cdx` (array of objects): each entry represents a distinct web resource
-  (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
-  - `surt` (string, required): sortable URL format
-  - `timestamp` (string, datetime, required): ISO format, UTC timezone, with
-    `Z` prefix required, with second (or finer) precision. Eg,
-    "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should
-    be converted naively.
-  - `url` (string, required): full URL
-  - `mimetype` (string): content type of the resource
-  - `status_code` (integer, signed): HTTP status code
-  - `sha1` (string, required): SHA-1 hash in lower-case hex
-  - `sha256` (string): SHA-256 hash in lower-case hex
-- `archive_urls`: An array of "typed" URLs where this snapshot can be found.
-  Can be wayback/memento instances, or direct links to a WARC file containing
-  all the capture resources.  Often will only be a single archive. Order is not
-  meaningful, and may not be preserved.
-    - `url` (string, required):
-            Eg: "https://example.edu/~frau/prcding.pdf".
-    - `rel` (string, required): Eg: "wayback" or "warc"
-- `original_url` (string): base URL of the resource. May reference a specific
-  CDX entry, or maybe in normalized form.
-- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).
-  Corresponds to the overall capture timestamp. Can be the earliest of CDX
-  timestamps if that makes sense
-- `release_ids` (array of string identifiers): references to `release` entities
-
-## Releases
-
-- `title` (string, required): the display title of the release. May include subtitle.
-- `original_title` (string): the full original language title, if `title` is translated
-- `work_id` (fatcat identifier; required): the (single) work that this release
-  is grouped under. If not specified in a creation (`POST`) action, the API
-  will auto-generate a work.
-- `container_id` (fatcat identifier): a (single) container that this release is
-  part of. When expanded the `container` field contains the full `container`
-  entity.
-- `release_type` (string, controlled set): represents the medium or form-factor
-  of this release; eg, "book" versus "journal article". Not necessarily
-  the same across all releases of a work. See definitions below.
-- `release_status` (string, controlled set): represents the publishing/review
-  lifecycle status of this particular release of the work. See definitions
-  below.
-- `release_date` (string, ISO date format): when this release was first made
-  publicly available. Blank if only year is known.
-- `release_year` (integer): year when this release was first made
-  publicly available; should match `release_date` if both are known.
-- `doi` (string): full DOI number, lower-case. Example: "10.1234/abcde.789".
-  See the "External Identifiers" section of style guide.
-- `wikidata_qid` (string): external identifier for Wikidata entities. These are
-  integers prefixed with "Q", like "Q4321". Each `release` entity can be
-  associated with at most one Wikidata entity (this field is not an array), and
-  Wikidata entities should be associated with at most a single `release`. In
-  the future it may be possible to associate Wikidata entities with `work`
-  entities instead. See the "External Identifiers" section of style guide.
-- `isbn13` (string): external identifier for books. ISBN-9 and other formats
-  should be converted to canonical ISBN-13. See the "External Identifiers"
-  section of style guide.
-- `pmid` (string): external identifier for PubMed database. These are bare
-  integers, but stored in a string format. See the "External Identifiers"
-  section of style guide.
-- `pmcid` (string): external identifier for PubMed Central database. These are
-  integers prefixed with "PMC" (upper case), like "PMC4321". See the "External
-  Identifiers" section of style guide.
-- `core_id` (string): external identifier for the [CORE] open access
-  aggregator. These identifiers are integers, but stored in string format. See
-  the "External Identifiers" section of style guide.
-- `arxiv_id` (string) external identifier to a (version-specific) [arxiv.org]()
-  work
-- `jstor_id` (string) external identifier for works in JSTOR
-- `volume` (string): optionally, stores the specific volume of a serial
-  publication this release was published in.
-        type: string
-- `issue` (string): optionally, stores the specific issue of a serial
-  publication this release was published in.
-- `pages` (string): the pages (within a volume/issue of a publication) that
-  this release can be looked up under. This is a free-form string, and could
-  represent the first page, a range of pages, or even prefix pages (like
-  "xii-xxx").
-- `publisher` (string): name of the publishing entity. This does not need to be
-  populated if the associated `container` entity has the publisher field set,
-  though it is acceptable to duplicate, as the publishing entity of a container
-  may differ over time. Should be set for singleton releases, like books.
-- `language` (string, slug): the primary language used in this particular release of
-  the work. Only a single language can be specified; additional languages can
-  be stored in "extra" metadata (TODO: which field?). This field should be a
-  valid RFC1766/ISO639 language code (two letters). AKA, a controlled
-  vocabulary, not a free-form name of the language.
-- `license_slug` (string, slug): the license of this release. Usually a
-  creative commons short code (eg, `CC-BY`), though a small number of other
-  short names for publisher-specific licenses are included (TODO: list these).
-- `contribs` (array of objects): an array of authorship and other `creator` contributions to this
-  release. Contribution fields include:
-    - `index` (integer, optional): the (zero-indexed) order of this
-      author. Authorship order has significance in many fields. Non-author
-      contributions (illustration, translation, editorship) may or may not be
-      ordered, depending on context, but index numbers should be unique per
-      release (aka, there should not be "first author" and "first translator")
-    - `creator_id` (identifier): if known, a reference to a specific `creator`
-    - `raw_name` (string): the name of the contributor, as attributed in the
-      text of this work. If the `creator_id` is linked, this may be different
-      from the `display_name`; if a creator is not linked, this field is
-      particularly important. Syntax and name order is not specified, but most
-      often will be "display order", not index/alphabetical (in Western
-      tradition, surname followed by given name).
-    - `role` (string, of a set): the type of contribution, from a controlled
-      vocabulary. TODO: vocabulary needs review.
-    - `extra` (string): additional context can go here. For example, author
-      affiliation, "this is the corresponding author", etc.
-- `refs` (array of ident strings): references (aka, citations) to other releases. References
-  can only be linked to a specific target release (not a work), though it may
-  be ambiguous which release of a work is being referenced if the citation is
-  not specific enough. Reference fields include:
-    - `index` (integer, optional): reference lists and bibliographies almost
-      always have an implicit order. Zero-indexed. Note that this is distinct
-      from the `key` field.
-    - `target_release_id` (fatcat identifier): if known, and the release
-      exists, a cross-reference to the Fatcat entity
-    - `extra` (JSON, optional): additional citation format metadata can be
-      stored here, particularly if the citation schema does not align. Common
-      fields might be "volume", "authors", "issue", "publisher", "url", and
-      external identifiers ("doi", "isbn13").
-    - `key` (string): works often reference works with a short slug or index
-      number, which can be captured here. For example, "[BROWN2017]". Keys
-      generally supersede the `index` field, though both can/should be
-      supplied.
-    - `year` (integer): year of publication of the cited release.
-    - `container_title` (string): if applicable, the name of the container of
-      the release being cited, as written in the citation (usually an
-      abbreviation).
-    - `title` (string): the title of the work/release being cited, as written.
-    - `locator` (string): a more specific reference into the work/release being
-      cited, for example the page number(s). For web reference, store the URL
-      in "extra", not here.
-- `abstracts` (array of objects): see below
-  - `sha1` (string, hex, required): reference to the abstract content (string).
-    Example: "3f242a192acc258bdfdb151943419437f440c313"
-  - `content` (string): The abstract raw content itself. Example: `<jats:p>Some
-    abstract thing goes here</jats:p>`
-  - `mimetype` (string): not formally required, but should effectively always get
-    set. `text/plain` if the abstract doesn't have a structured format
-  - `lang` (string, controlled set): the human language this abstract is in. See
-    the `lang` field of release for format and vocabulary.
-
-[arxiv.org]: https://arxiv.org
-
-#### `extra` Fields
-
-- `crossref` (object), for extra crossref-specific metadata
-    - `subject` (array of strings) for subject/category of content
-    - `type` (string) raw/original Crossref type
-    - `alternative-id` (array of strings)
-    - `archive` (array of strings), indicating preservation services deposited
-    - `funder` (object/dictionary)
-- `aliases` (array of strings) for additional titles this release might be
-  known by
-- `container_name` (string) if not matched to a container entity
-- `subtitle` (string)
-- `group-title` (string) for releases within an collection/group
-- `translation_of` (release identifier) if this release is a translation of
-  another (usually under the same work)
-- `withdrawn_date` (string, ISO date format): if this release has been
-  retracted (post-publication) or withdrawn (pre- or post-publication), this is
-  the datetime of that event. Retractions also result in a `retraction` release
-  under the same `work` entity. This is intended to migrate from "extra" to a
-  full release entity field.
-
-#### `release_type` Vocabulary
-
-This vocabulary is based on the 
-[CSL types](http://docs.citationstyles.org/en/stable/specification.html#appendix-iii-types),
-with a small number of (proposed) extensions:
-
-- `article-magazine`
-- `article-journal`, including pre-prints and working papers
-- `book`
-- `chapter` is allowed as they are frequently referenced and read independent
-  of the entire book. The data model does not currently support linking a
-  subset of a release to an entity representing the entire release. The
-  release/work/file distinctions should not be used to group multiple chapters under
-  a single work; a book chapter can be it's own work. A paper which is
-  republished as a chapter (eg, in a collection, or "edited" book) can have
-  both releases under one work. The criteria of whether to "split" a book and
-  have release entities for each chapter is whether the chapter has been
-  cited/reference as such.
-- `dataset`
-- `entry`, which can be used for generic web resources like question/answer
-  site entries.
-- `entry-encyclopedia`
-- `manuscript`
-- `paper-conference`
-- `patent`
-- `post-weblog` for blog entries
-- `report`
-- `review`, for things like book reviews, not the "literature review" form of
-  `article-journal`, nor peer reviews (see `peer_review`)
-- `speech` can be used for eg, slides and recorded conference presentations
-  themselves, as distinct from `paper-conference`
-- `thesis`
-- `webpage`
-- `peer_review` (fatcat extension)
-- `software` (fatcat extension)
-- `standard` (fatcat extension), for technical standards like RFCs
-- `abstract` (fatcat extension), for releases that are only an abstract of a
-  larger work. In particular, translations. Many are granted DOIs.
-- `editorial` (custom extension) for columns, "in this issue", and other
-  content published along peer-reviewed content in journals. Many are granted DOIs.
-- `letter` for "letters to the editor", "authors respond", and
-  sub-article-length published content. Many are granted DOIs.
-- `stub` (fatcat extension) for releases which have notable external
-  identifiers, and thus are included "for completeness", but don't seem to
-  represent a "full work".
-  
-An example of a `stub` might be a paper that gets an extra DOI by accident; the
-primary DOI should be a full release, and the accidental DOI can be a `stub`
-release under the same work. `stub` releases shouldn't be considered full
-releases when counting or aggregating (though if technically difficult this may
-not always be implemented). Other things that can be categorized as stubs
-(which seem to often end up mis-categorized as full articles in bibliographic
-databases):
-
-- commercial advertisements
-- "trap" or "honey pot" works, which are fakes included in databases to
-  detect re-publishing without attribution
-- "This page is intentionally blank"
-- "About the author", "About the editors", "About the cover"
-- "Acknowledgments"
-- "Notices"
-
-All other CSL types are also allowed, though they are mostly out of scope:
-
-- `article` (generic; should usually be some other type)
-- `article-newspaper`
-- `bill`
-- `broadcast`
-- `entry-dictionary`
-- `figure`
-- `graphic`
-- `interview`
-- `legislation`
-- `legal_case`
-- `map`
-- `motion_picture`
-- `musical_score`
-- `pamphlet`
-- `personal_communication`
-- `post`
-- `review-book`
-- `song`
-- `treaty`
-
-For the purpose of statistics, the following release types are considered
-"papers":
-
-- `article-journal`
-- `chapter`
-- `paper-conference`
-- `thesis`
-
-#### `release_status` Vocabulary
-
-These roughly follow the [DRIVER](http://web.archive.org/web/20091109125137/http://www2.lse.ac.uk/library/versions/VERSIONS_Toolkit_v1_final.pdf) publication version guidelines, with the addition of a `retracted` status.
-
-- `draft` is an early version of a work which is not considered for peer
-  review. Sometimes these are posted to websites or repositories for early
-  comments and feedback.
-- `submitted` is the version that was submitted for publication. Also known as
-  "pre-print", "pre-review", "under review". Note that this doesn't imply that
-  the work was every actually submitted, reviewed, or accepted for publication,
-  just that this is the version that "would be". Most versions in pre-print
-  repositories are likely to have this status.
-- `accepted` is a version that has undergone peer review and accepted for
-  published, but has not gone through any publisher copy editing or
-  re-formatting. Also known as "post-print", "author's manuscript",
-  "publisher's proof".
-- `published` is the version that the publisher distributes. May include minor
-  (gramatical, typographical, broken link, aesthetic) corrections. Also known
-  as "version of record", "final publication version", "archival copy".
-- `updated`: post-publication significant updates (considered a separate release
-  in Fatcat). Also known as "correction" (in the context of either a published
-  "correction notice", or the full new version)
-- `retraction` for post-publication retraction notices (should be a release
-  under the same work as the `published` release)
-
-Note that in the case of a retraction, the original publication does not get
-status `retracted`, only the retraction notice does. The original publication
-does get a `widthdrawn_date` metadata field set.
-
-When blank, indicates status isn't known, and wasn't inferred at creation time.
-Can often be interpreted as `published`, but be careful!
-
-#### `contribs.role` Vocabulary
-
-- `author`
-- `translator`
-- `illustrator`
-- `editor`
-
-All other CSL role types are also allowed, though are mostly out of scope for
-Fatcat:
-
-- `collection-editor`
-- `composer`
-- `container-author`
-- `director`
-- `editorial-director`
-- `editortranslator`
-- `interviewer`
-- `original-author`
-- `recipient`
-- `reviewed-author`
-
-If blank, indicates that type of contribution is not known; this can often be
-interpreted as authorship.
-
-## Works
-
-Works have no fields! They just group releases.
-
diff --git a/guide/src/entity_file.md b/guide/src/entity_file.md
new file mode 100644
index 00000000..7719adfd
--- /dev/null
+++ b/guide/src/entity_file.md
@@ -0,0 +1,24 @@
+
+# File Entity Reference
+
+## Fields
+
+- `size` (integer, positive, non-zero): Size of file in bytes. Eg: 1048576.
+- `md5` (string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".
+- `sha1` (string): SHA-1 hash in lower-case hex. Not technically required, but
+  the most-used of the hash fields and should always be included. Eg:
+  "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
+- `sha256`: SHA-256 hash in lower-case hex. Eg:
+  "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
+- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
+  preserved.
+    - `url` (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
+    - `rel` (string, required): Eg: "webarchive".
+- `mimetype` (string): Format of the file. If XML, specific schema can be
+  included after a `+`. Example: "application/pdf"
+- `release_ids` (array of string identifiers): references to `release` entities
+  that this file represents a manifestation of. Note that a single file can
+  contain multiple release references (eg, a PDF containing a full issue with
+  many articles), and that a release will often have multiple files (differing
+  only by watermarks, or different digitizations of the same printed work, or
+  variant MIME/media types of the same published work).
diff --git a/guide/src/entity_fileset.md b/guide/src/entity_fileset.md
new file mode 100644
index 00000000..7e5ac757
--- /dev/null
+++ b/guide/src/entity_fileset.md
@@ -0,0 +1,21 @@
+
+# Fileset Entity Reference
+
+## Fields
+
+Warning: This schema is not yet stable.
+
+- `manifest` (array of objects): each entry represents a file
+  - `path` (string, required): relative path to file (including filename)
+  - `size` (integer, required): in bytes
+  - `md5` (string): MD5 hash in lower-case hex
+  - `sha1` (string): SHA-1 hash in lower-case hex
+  - `sha256` (string): SHA-256 hash in lower-case hex
+  - `extra` (object): any extra metadata about this specific file
+- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
+  preserved.
+    - `url` (string, required):
+            Eg: "https://example.edu/~frau/prcding.pdf".
+    - `rel` (string, required):
+            Eg: "webarchive".
+- `release_ids` (array of string identifiers): references to `release` entities
diff --git a/guide/src/entity_release.md b/guide/src/entity_release.md
new file mode 100644
index 00000000..709a020c
--- /dev/null
+++ b/guide/src/entity_release.md
@@ -0,0 +1,303 @@
+
+# Release Entity Reference
+
+## Fields
+
+- `title` (string, required): the display title of the release. May include subtitle.
+- `subtitle` (string): intended only to be used primarily with books, not
+  journal articles. Subtitle may also be appended to the `title` instead of
+  populating this field.
+- `original_title` (string): the full original language title, if `title` is translated
+- `work_id` (fatcat identifier; required): the (single) work that this release
+  is grouped under. If not specified in a creation (`POST`) action, the API
+  will auto-generate a work.
+- `container_id` (fatcat identifier): a (single) container that this release is
+  part of. When expanded the `container` field contains the full `container`
+  entity.
+- `release_type` (string, controlled set): represents the medium or form-factor
+  of this release; eg, "book" versus "journal article". Not necessarily
+  the same across all releases of a work. See definitions below.
+- `release_state` (string, controlled set): represents the publishing/review
+  lifecycle status of this particular release of the work. See definitions
+  below.
+- `release_date` (string, ISO date format): when this release was first made
+  publicly available. Blank if only year is known.
+- `release_year` (integer): year when this release was first made
+  publicly available; should match `release_date` if both are known.
+- `ext_ids` (key/value object of string-to-string mappings): external
+  identifiers. At least an empty `ext_ids` object is always required for
+  release entities, so individual identifiers can be accessed directly.
+- `volume` (string): optionally, stores the specific volume of a serial
+  publication this release was published in.
+        type: string
+- `issue` (string): optionally, stores the specific issue of a serial
+  publication this release was published in.
+- `pages` (string): the pages (within a volume/issue of a publication) that
+  this release can be looked up under. This is a free-form string, and could
+  represent the first page, a range of pages, or even prefix pages (like
+  "xii-xxx").
+- `publisher` (string): name of the publishing entity. This does not need to be
+  populated if the associated `container` entity has the publisher field set,
+  though it is acceptable to duplicate, as the publishing entity of a container
+  may differ over time. Should be set for singleton releases, like books.
+- `language` (string, slug): the primary language used in this particular release of
+  the work. Only a single language can be specified; additional languages can
+  be stored in "extra" metadata (TODO: which field?). This field should be a
+  valid RFC1766/ISO639 language code (two letters). AKA, a controlled
+  vocabulary, not a free-form name of the language.
+- `license_slug` (string, slug): the license of this release. Usually a
+  creative commons short code (eg, `CC-BY`), though a small number of other
+  short names for publisher-specific licenses are included (TODO: list these).
+- `contribs` (array of objects): an array of authorship and other `creator` contributions to this
+  release. Contribution fields include:
+    - `index` (integer, optional): the (zero-indexed) order of this
+      author. Authorship order has significance in many fields. Non-author
+      contributions (illustration, translation, editorship) may or may not be
+      ordered, depending on context, but index numbers should be unique per
+      release (aka, there should not be "first author" and "first translator")
+    - `creator_id` (identifier): if known, a reference to a specific `creator`
+    - `raw_name` (string): the name of the contributor, as attributed in the
+      text of this work. If the `creator_id` is linked, this may be different
+      from the `display_name`; if a creator is not linked, this field is
+      particularly important. Syntax and name order is not specified, but most
+      often will be "display order", not index/alphabetical (in Western
+      tradition, surname followed by given name).
+    - `role` (string, of a set): the type of contribution, from a controlled
+      vocabulary. TODO: vocabulary needs review.
+    - `extra` (string): additional context can go here. For example, author
+      affiliation, "this is the corresponding author", etc.
+- `refs` (array of ident strings): references (aka, citations) to other releases. References
+  can only be linked to a specific target release (not a work), though it may
+  be ambiguous which release of a work is being referenced if the citation is
+  not specific enough. Reference fields include:
+    - `index` (integer, optional): reference lists and bibliographies almost
+      always have an implicit order. Zero-indexed. Note that this is distinct
+      from the `key` field.
+    - `target_release_id` (fatcat identifier): if known, and the release
+      exists, a cross-reference to the Fatcat entity
+    - `extra` (JSON, optional): additional citation format metadata can be
+      stored here, particularly if the citation schema does not align. Common
+      fields might be "volume", "authors", "issue", "publisher", "url", and
+      external identifiers ("doi", "isbn13").
+    - `key` (string): works often reference works with a short slug or index
+      number, which can be captured here. For example, "[BROWN2017]". Keys
+      generally supersede the `index` field, though both can/should be
+      supplied.
+    - `year` (integer): year of publication of the cited release.
+    - `container_title` (string): if applicable, the name of the container of
+      the release being cited, as written in the citation (usually an
+      abbreviation).
+    - `title` (string): the title of the work/release being cited, as written.
+    - `locator` (string): a more specific reference into the work/release being
+      cited, for example the page number(s). For web reference, store the URL
+      in "extra", not here.
+- `abstracts` (array of objects): see below
+  - `sha1` (string, hex, required): reference to the abstract content (string).
+    Example: "3f242a192acc258bdfdb151943419437f440c313"
+  - `content` (string): The abstract raw content itself. Example: `<jats:p>Some
+    abstract thing goes here</jats:p>`
+  - `mimetype` (string): not formally required, but should effectively always get
+    set. `text/plain` if the abstract doesn't have a structured format
+  - `lang` (string, controlled set): the human language this abstract is in. See
+    the `lang` field of release for format and vocabulary.
+
+#### External Identifiers (`ext_ids`)
+
+The `ext_ids` object name-spaces external identifiers and makes it easier to
+add new identifiers to the schema in the future.
+
+- `doi` (string): full DOI number, lower-case. Example: "10.1234/abcde.789".
+  See the "External Identifiers" section of style guide for more notes
+  about DOIs specifically.
+- `wikidata_qid` (string): external identifier for Wikidata entities. These are
+  integers prefixed with "Q", like "Q4321". Each `release` entity can be
+  associated with at most one Wikidata entity (this field is not an array), and
+  Wikidata entities should be associated with at most a single `release`. In
+  the future it may be possible to associate Wikidata entities with `work`
+  entities instead.
+- `isbn13` (string): external identifier for books. ISBN-9 and other formats
+  should be converted to canonical ISBN-13.
+- `pmid` (string): external identifier for PubMed database. These are bare
+  integers, but stored in a string format.
+- `pmcid` (string): external identifier for PubMed Central database. These are
+  integers prefixed with "PMC" (upper case), like "PMC4321". Versioned PMCIDs
+  can also be stored (eg, "PMC4321.1"; future clarification of whether versions
+  should *always* be stored will be needed.
+- `core` (string): external identifier for the [CORE] open access
+  aggregator. These identifiers are integers, but stored in string format.
+- `arxiv` (string) external identifier to a (version-specific) [arxiv.org]()
+  work. For releases, must always include the `vN` suffix (eg, `v3`).
+- `jstor` (string) external identifier for works in JSTOR.
+- `ark` (string) ARK identifer
+- `mag` (string) Microsoft Academic Graph identifier
+
+[arxiv.org]: https://arxiv.org
+
+#### `extra` Fields
+
+- `crossref` (object), for extra crossref-specific metadata
+    - `subject` (array of strings) for subject/category of content
+    - `type` (string) raw/original Crossref type
+    - `alternative-id` (array of strings)
+    - `archive` (array of strings), indicating preservation services deposited
+    - `funder` (object/dictionary)
+- `aliases` (array of strings) for additional titles this release might be
+  known by
+- `container_name` (string) if not matched to a container entity
+- `subtitle` (string)
+- `group-title` (string) for releases within an collection/group
+- `translation_of` (release identifier) if this release is a translation of
+  another (usually under the same work)
+- `withdrawn_date` (string, ISO date format): if this release has been
+  retracted (post-publication) or withdrawn (pre- or post-publication), this is
+  the datetime of that event. Retractions also result in a `retraction` release
+  under the same `work` entity. This is intended to migrate from "extra" to a
+  full release entity field.
+
+#### `release_type` Vocabulary
+
+This vocabulary is based on the 
+[CSL types](http://docs.citationstyles.org/en/stable/specification.html#appendix-iii-types),
+with a small number of (proposed) extensions:
+
+- `article-magazine`
+- `article-journal`, including pre-prints and working papers
+- `book`
+- `chapter` is allowed as they are frequently referenced and read independent
+  of the entire book. The data model does not currently support linking a
+  subset of a release to an entity representing the entire release. The
+  release/work/file distinctions should not be used to group multiple chapters under
+  a single work; a book chapter can be it's own work. A paper which is
+  republished as a chapter (eg, in a collection, or "edited" book) can have
+  both releases under one work. The criteria of whether to "split" a book and
+  have release entities for each chapter is whether the chapter has been
+  cited/reference as such.
+- `dataset`
+- `entry`, which can be used for generic web resources like question/answer
+  site entries.
+- `entry-encyclopedia`
+- `manuscript`
+- `paper-conference`
+- `patent`
+- `post-weblog` for blog entries
+- `report`
+- `review`, for things like book reviews, not the "literature review" form of
+  `article-journal`, nor peer reviews (see `peer_review`)
+- `speech` can be used for eg, slides and recorded conference presentations
+  themselves, as distinct from `paper-conference`
+- `thesis`
+- `webpage`
+- `peer_review` (fatcat extension)
+- `software` (fatcat extension)
+- `standard` (fatcat extension), for technical standards like RFCs
+- `abstract` (fatcat extension), for releases that are only an abstract of a
+  larger work. In particular, translations. Many are granted DOIs.
+- `editorial` (custom extension) for columns, "in this issue", and other
+  content published along peer-reviewed content in journals. Many are granted DOIs.
+- `letter` for "letters to the editor", "authors respond", and
+  sub-article-length published content. Many are granted DOIs.
+- `stub` (fatcat extension) for releases which have notable external
+  identifiers, and thus are included "for completeness", but don't seem to
+  represent a "full work".
+  
+An example of a `stub` might be a paper that gets an extra DOI by accident; the
+primary DOI should be a full release, and the accidental DOI can be a `stub`
+release under the same work. `stub` releases shouldn't be considered full
+releases when counting or aggregating (though if technically difficult this may
+not always be implemented). Other things that can be categorized as stubs
+(which seem to often end up mis-categorized as full articles in bibliographic
+databases):
+
+- commercial advertisements
+- "trap" or "honey pot" works, which are fakes included in databases to
+  detect re-publishing without attribution
+- "This page is intentionally blank"
+- "About the author", "About the editors", "About the cover"
+- "Acknowledgments"
+- "Notices"
+
+All other CSL types are also allowed, though they are mostly out of scope:
+
+- `article` (generic; should usually be some other type)
+- `article-newspaper`
+- `bill`
+- `broadcast`
+- `entry-dictionary`
+- `figure`
+- `graphic`
+- `interview`
+- `legislation`
+- `legal_case`
+- `map`
+- `motion_picture`
+- `musical_score`
+- `pamphlet`
+- `personal_communication`
+- `post`
+- `review-book`
+- `song`
+- `treaty`
+
+For the purpose of statistics, the following release types are considered
+"papers":
+
+- `article-journal`
+- `chapter`
+- `paper-conference`
+- `thesis`
+
+#### `release_state` Vocabulary
+
+These roughly follow the [DRIVER](http://web.archive.org/web/20091109125137/http://www2.lse.ac.uk/library/versions/VERSIONS_Toolkit_v1_final.pdf) publication version guidelines, with the addition of a `retracted` status.
+
+- `draft` is an early version of a work which is not considered for peer
+  review. Sometimes these are posted to websites or repositories for early
+  comments and feedback.
+- `submitted` is the version that was submitted for publication. Also known as
+  "pre-print", "pre-review", "under review". Note that this doesn't imply that
+  the work was every actually submitted, reviewed, or accepted for publication,
+  just that this is the version that "would be". Most versions in pre-print
+  repositories are likely to have this status.
+- `accepted` is a version that has undergone peer review and accepted for
+  published, but has not gone through any publisher copy editing or
+  re-formatting. Also known as "post-print", "author's manuscript",
+  "publisher's proof".
+- `published` is the version that the publisher distributes. May include minor
+  (gramatical, typographical, broken link, aesthetic) corrections. Also known
+  as "version of record", "final publication version", "archival copy".
+- `updated`: post-publication significant updates (considered a separate release
+  in Fatcat). Also known as "correction" (in the context of either a published
+  "correction notice", or the full new version)
+- `retraction` for post-publication retraction notices (should be a release
+  under the same work as the `published` release)
+
+Note that in the case of a retraction, the original publication does not get
+state `retracted`, only the retraction notice does. The original publication
+does get a `withdrawn_status` metadata field set.
+
+When blank, indicates status isn't known, and wasn't inferred at creation time.
+Can often be interpreted as `published`, but be careful!
+
+#### `contribs.role` Vocabulary
+
+- `author`
+- `translator`
+- `illustrator`
+- `editor`
+
+All other CSL role types are also allowed, though are mostly out of scope for
+Fatcat:
+
+- `collection-editor`
+- `composer`
+- `container-author`
+- `director`
+- `editorial-director`
+- `editortranslator`
+- `interviewer`
+- `original-author`
+- `recipient`
+- `reviewed-author`
+
+If blank, indicates that type of contribution is not known; this can often be
+interpreted as authorship.
diff --git a/guide/src/entity_webcapture.md b/guide/src/entity_webcapture.md
new file mode 100644
index 00000000..8c5615fb
--- /dev/null
+++ b/guide/src/entity_webcapture.md
@@ -0,0 +1,32 @@
+
+# Web Capture Entity Reference
+
+## Fields
+
+Warning: This schema is not yet stable.
+
+- `cdx` (array of objects): each entry represents a distinct web resource
+  (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
+  - `surt` (string, required): sortable URL format
+  - `timestamp` (string, datetime, required): ISO format, UTC timezone, with
+    `Z` prefix required, with second (or finer) precision. Eg,
+    "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should
+    be converted naively.
+  - `url` (string, required): full URL
+  - `mimetype` (string): content type of the resource
+  - `status_code` (integer, signed): HTTP status code
+  - `sha1` (string, required): SHA-1 hash in lower-case hex
+  - `sha256` (string): SHA-256 hash in lower-case hex
+- `archive_urls`: An array of "typed" URLs where this snapshot can be found.
+  Can be wayback/memento instances, or direct links to a WARC file containing
+  all the capture resources.  Often will only be a single archive. Order is not
+  meaningful, and may not be preserved.
+    - `url` (string, required):
+            Eg: "https://example.edu/~frau/prcding.pdf".
+    - `rel` (string, required): Eg: "wayback" or "warc"
+- `original_url` (string): base URL of the resource. May reference a specific
+  CDX entry, or maybe in normalized form.
+- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).
+  Corresponds to the overall capture timestamp. Can be the earliest of CDX
+  timestamps if that makes sense
+- `release_ids` (array of string identifiers): references to `release` entities
diff --git a/guide/src/entity_work.md b/guide/src/entity_work.md
new file mode 100644
index 00000000..1bb88b06
--- /dev/null
+++ b/guide/src/entity_work.md
@@ -0,0 +1,4 @@
+
+# Work Entity Reference
+
+Works have no fields! They just group releases.
author	Bryan Newbold <bnewbold@robocracy.org>	2019-05-20 16:57:57 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2019-05-20 16:57:57 -0700
commit	cd829eedb5bfc7328ab5266650a625a6c88db6fa (patch)
tree	90aa164cbd7f4e86aadc25dbd036dab680c30e80 /guide
parent	eb31be2172264091e192bcb4f17ffd571253fffa (diff)
download	fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.tar.gz fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.zip