summaryrefslogtreecommitdiffstats
path: root/guide/src/entity_fields.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2018-09-20 20:20:43 -0700
committerBryan Newbold <bnewbold@robocracy.org>2018-09-20 20:20:43 -0700
commit182413ad4946d715aabf67c396d688fbb5d1c0eb (patch)
tree7f4c748b527c96d21fdd99a6c9f8a47908f076b7 /guide/src/entity_fields.md
parentda8911b029f06023d5d8f8aad3cc845583e6d708 (diff)
downloadfatcat-182413ad4946d715aabf67c396d688fbb5d1c0eb.tar.gz
fatcat-182413ad4946d715aabf67c396d688fbb5d1c0eb.zip
progress on guide
Diffstat (limited to 'guide/src/entity_fields.md')
-rw-r--r--guide/src/entity_fields.md302
1 files changed, 302 insertions, 0 deletions
diff --git a/guide/src/entity_fields.md b/guide/src/entity_fields.md
index 1a9e7bd4..0d0b2d6f 100644
--- a/guide/src/entity_fields.md
+++ b/guide/src/entity_fields.md
@@ -1 +1,303 @@
# Entity Field Reference
+
+All entities have:
+
+- `extra`: free-form JSON metadata
+
+The "extra" field is an "escape hatch" to include extra fields not in the
+regular schema. It is intented to enable gradual evolution of the schema, as
+well as accomodating niche or field-specific content. That being said,
+reasonable limits should be adhered to.
+
+## Containers
+
+- `name`: (string, required). The title of the publication, as used in
+ international indexing services. Eg, "Journal of Important Results". Not
+ necessarily in the native language, but also not necessarily in English.
+ Alternative titles (and translations) can be stored in "extra" metadata
+ (TODO: what field?).
+- `publisher` (string): The name of the publishing organization. Eg, "Society
+ of Curious Students".
+- `issnl` (string): an external identifier, with registration controlled by the
+ [ISSN organization](http://www.issn.org/). Registration is relatively
+ inexpensive and easy to obtain (depending on world region), so almost all
+ serial publications have one. The ISSN-L ("linking ISSN") is one of either
+ the print ("ISSNp") or electronic ("ISSNe") identifiers for a serial
+ publication; not all publications have both types of ISSN, but many do, which
+ can cause confusion. The ISSN master list is not gratis/public, but the
+ ISSN-L mapping is.
+- `wikidata_qid` (string): external linking identifier to a Wikidata entity.
+- `abbrev` (string): a commonly used abbreviation for the publication, as used
+ in citations, following the [ISO 4]() standard. Eg, "Journal of Polymer
+ Science Part A" -> "J. Polym. Sci. A". Alternative abbreviations can be
+ stored in "extra" metadata. (TODO: what field?)
+- `coden` (string): an external identifier, the [CODEN code](). 6 characters,
+ all upper-case.
+
+[CODEN]: https://en.wikipedia.org/wiki/CODEN
+
+## Creators
+
+See ["Human Names"](./style_guide.index##human-names) sub-section of style
+guide.
+
+- `display_name` (string, required): Eg, "Grace Hopper".
+- `given_name` (string): Eg, "Grace".
+- `surname` (string): Eg, "Hooper".
+- `orcid` (string): external identifier, as registered with ORCID.
+- `wikidata_qid` (string): external linking identifier to a Wikidata entity.
+
+## Files
+
+- `size` (positive, non-zero integer): Eg: 1048576.
+- `sha1` (string): Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
+- `md5`: Eg: "d41efcc592d1e40ac13905377399eb9b".
+- `sha256`: Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
+- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
+ preserved.
+ - `url` (string, required):
+ Eg: "https://example.edu/~frau/prcding.pdf".
+ - `rel` (string, required):
+ Eg: "webarchive".
+- `mimetype` (string):
+ example: "application/pdf"
+- `releases` (array of identifiers): references to `release` entities that this
+ file represents a manifestation of. Note that a single file can contain
+ multiple release references (eg, a PDF containing a full issue with many
+ articles), and that a release will often have multiple files (differing only
+ by watermarks, or different digitizations of the same printed work, or
+ variant MIME/media types of the same published work). See also
+ "Work/Release/File Distinctions".
+
+## Releases
+
+- `title: (required)
+ type: string
+- `work_id:
+ type: string
+ example: "q3nouwy3nnbsvo3h5klxsx4a7y"
+- `container:
+ $ref: "#/definitions/container_entity"
+ description: "Optional; GET-only"
+- `files:
+ description: "Optional; GET-only"
+ type: array
+ items:
+ $ref: "#/definitions/file_entity"
+- `container_id:
+ type: string
+ example: "q3nouwy3nnbsvo3h5klxsx4a7y"
+- `release_type:
+ type: string
+ example: "book"
+- `release_status:
+ type: string
+ example: "preprint"
+- `release_date:
+ type: string
+ format: date
+- `doi:
+ type: string
+ #format: custom
+ example: "10.1234/abcde.789" See the "External Identifiers" section of style guide.
+- `isbn13` (string): external identifer for books. ISBN-9 and other formats
+ should be converted to canonical ISBN-13. See the "External Identifiers"
+ section of style guide.
+- `core_id` (string): external identifier for the [CORE] open access
+ aggregator. These identifiers are integers, but stored in string format. See
+ the "External Identifiers" section of style guide.
+- `pmid` (string): external identifier for PubMed database. These are bare
+ integers, but stored in a string format. See the "External Identifiers"
+ section of style guide.
+- `pmcid` (string): external identifier for PubMed Central database. These are
+ integers prefixed with "PMC" (upper case), like "PMC4321". See the "External
+ Identifiers" section of style guide.
+- `wikidata_qid` (string): external identifier for Wikidata entities. These are
+ integers prefixed with "Q", like "Q4321". Each `release` entity can be
+ associated with at most one Wikidata entity (this field is not an array), and
+ Wikidata entities should be associated with at most a single `release`. In
+ the future it may be possible to associate Wikidata entities with `work`
+ entities instead. See the "External Identifiers" section of style guide.
+- `volume` (string): optionally, stores the specific volume of a serial
+ publication this release was published in.
+ type: string
+- `issue` (string): optionally, stores the specific issue of a serial
+ publication this release was published in.
+- `pages` (string): the pages (within a volume/issue of a publication) that
+ this release can be looked up under. This is a free-form string, and could
+ represent the first page, a range of pages, or even prefix pages (like
+ "xii-xxx").
+- `publisher` (string): name of the publishing entity. This does not need to be
+ populated if the associated `container` entity has the publisher field set,
+ though it is acceptable to duplicate, as the publishing entity of a container
+ may differ over time. Should be set for singleton releases, like books.
+- `language` (string): the primary language used in this particular release of
+ the work. Only a single language can be specified; additional languages can
+ be stored in "extra" metadata (TODO: which field?). This field should be a
+ valid RFC1766/ISO639-1 language code ("with extensions"), aka a controlled
+ vocabulary, not a free-form name of the language.
+- `contribs`: an array of authorship and other `creator` contributions to this
+ release. Contribution fields include:
+ - `index` (integer, optional): the (zero-indexed) order of this
+ author. Authorship order has significance in many fields. Non-author
+ contributions (illustration, translation, editorship) may or may not be
+ ordered, depending on context, but index numbers should be unique per
+ release (aka, there should not be "first author" and "first translator")
+ - `creator_id` (identifier): if known, a reference to a specific `creator`
+ - `raw_name` (string): the name of the contributor, as attributed in the
+ text of this work. If the `creator_id` is linked, this may be different
+ from the `display_name`; if a creator is not linked, this field is
+ particularly important. Syntax and name order is not specified, but most
+ often will be "display order", not index/alphabetical (in Western
+ tradition, surname followed by given name).
+ - `role` (string, of a set): the type of contribution, from a controlled
+ vocabulary. TODO: vocabulary needs review.
+ - `extra` (string): additional context can go here. For example, author
+ affiliation, "this is the corresponding author", etc.
+- `refs`: an array of references (aka, citations) to other releases. References
+ can only be linked to a specific target release (not a work), though it may
+ be ambugious which release of a work is being referenced if the citation is
+ not specific enough. Reference fields include:
+ - index:
+ type: integer
+ format: int64
+ - target_release_id:
+ type: string
+ #format: ident
+ - extra:
+ type: object
+ additionalProperties: {}
+ - key:
+ type: string
+ - year:
+ type: integer
+ format: int64
+ - container_title:
+ type: string
+ - title:
+ type: string
+ - locator:
+ type: string
+ example: "p123"
+
+Controlled vocabulary for `release_type` is derived from the Crossref `type`
+vocabulary:
+
+- `journal-article`
+- `proceedings-article`
+- `monograph`
+- `dissertation`
+- `book` (and `edited-book`, `reference-book`)
+- `book-chapter` (and `book-part`, `book-section`, though much rarer) is
+ allowed as these are frequently referenced and read independent of the entire
+ book. The data model does not currently support linking a subset of a release
+ to an entity representing the entire release. The release/work/file
+ distinctions should not be used to group chapters into complete work; a book
+ chapter can be it's own work. A paper which is republished as a chapter (eg,
+ in a collection, or "edited" book) can have both releases under one work. The
+ criteria of whether to "split" a book and have release entities for each
+ chapter is whether the chapter has been cited/reference as such.
+- `dissertation`
+- `dataset` (though representation with `file` entities is TBD).
+- `monograph`
+- `report`
+- `standard`
+- `posted-content` is allowed, but may be re-categorized. For crossref, this
+ seems to imply a journal article or report which is not published (pre-print)
+- `other` matches Crossref `other` works, which may (and generally should) have
+ a more specific type set.
+- `web-post` (custom extension) for blog posts, essays, and other individual
+ works on websites
+- `website` (custom extension) for entire web sites and wikis.
+- `presentation` (custom extension) for, eg, slides and recorded conference
+ presentations themselves, as distinct from `proceedings-article`
+- `editorial` (custom extension) for columns, "in this issue", and other
+ content published along peer-reviewed content in journals. Can bleed in to
+ "other" or "stub"
+- `book-review` (custom extension)
+- `letter` for "letters to the editor", "authors respond", and
+ sub-article-length published content
+- `example` (custom extension) for dummy or example releases that have valid
+ (registered) identifiers. Other metadata does not need to match "canonical"
+ examples.
+- `stub` (custom extension) for releases which have notable external
+ identifiers, and thus are included "for completeness", but don't seem to
+ represent a "full work". An example might be a paper that gets an extra DOI
+ by accident; the primary DOI should be a full release, and the accidental DOI
+ can be a `stub` release under the same work. `stub` releases shouldn't be
+ considered full releases when counting or aggregating (though if technically
+ difficult this may not always be implemented). Other things that can be
+ categorized as stubs (which seem to often end up miscategorized as full
+ articles in bibliographic databases):
+ - an abstract, which is only an abstract of a larger work
+ - commercial advertisements
+ - "trap" or "honey pot" works, which are fakes included in databases to
+ detect re-publishing without attribution
+ - "This page is intentionally blank"
+ - "About the author", "About the editors", "About the cover"
+ - "Acknowledgements"
+ - "Notices"
+
+Other types from Crossref (such as `component`, `reference-entry`) are valid,
+but are not actively solicited for inclusion, as they are not the current focus
+of the database.
+
+In the future, some types (like `journal`, `proceedings`, and `book-series`)
+will probably be represented as `container` entities. How to represent other
+container-like types (like `report-series` or `book-series`) is TBD.
+
+Controlled vocabulary for `release_status`:
+- `published` for any version of the work that was "formally published", or any
+ variant that can be considered a "proof", "camera ready", "archival",
+ "version of record" or "definitive" that have no meaningful differences from
+ the "published" version. Note that "meaningful" here will need to be
+ explored.
+- `corrected` for a version of a work that, after formal publication, has been
+ revised and updated. Could be the "version of record".
+- `pre-print`, for versions of a work which have not been submitted for peer
+ review or formal publication
+- `post-print`, often a post-peer-review version of a work that does not have
+ publisher-supplied copy-editing, typesetting, etc.
+- `draft` in the context of book publication or online content (shouldn't be
+ applied to journal articles), is an unpublished, but somehow notable version
+ of a work.
+- If blank, indicates status isn't known, and wasn't inferred at creation time.
+ Can often be interpreted as `published`.
+
+Controlled vocabulary for `role` field on `contribs`:
+- `author`
+- `translator`
+- `illustrator`
+- `editor`
+- If blank, indicates that type of contribution is not known; this can often be
+ interpreted as authorship.
+
+Current "extra" fields, flags, and content:
+- `crossref` (object), for extra crossref-specific metadata
+- `is_retracted` (boolean flag) if this work has been retracted
+- `translation_of` (release identifier) if this release is a translation of
+ another (usually under the same work)
+- `arxiv_id` (string) external identifier to a (version-specific) [arxiv.org]()
+ work
+
+[arxiv.org]: https://arxiv.org
+
+abstracts:
+ type: array
+ items:
+ type: object
+ properties:
+ sha1:
+ type: string
+ example: "3f242a192acc258bdfdb151943419437f440c313"
+ content:
+ type: string
+ example: "<jats:p>Some abstract thing goes here</jats:p>"
+ mimetype:
+ type: string
+ example: "application/xml+jats"
+ lang:
+ type: string
+ example: "en"
+## Works
+