diff options
Diffstat (limited to 'guide/src')
-rw-r--r-- | guide/src/data_model.md | 2 | ||||
-rw-r--r-- | guide/src/editing_quickstart.md | 2 | ||||
-rw-r--r-- | guide/src/entity_creator.md | 60 | ||||
-rw-r--r-- | guide/src/entity_release.md | 34 | ||||
-rw-r--r-- | guide/src/privacy_policy.md | 2 | ||||
-rw-r--r-- | guide/src/style_guide.md | 70 |
6 files changed, 92 insertions, 78 deletions
diff --git a/guide/src/data_model.md b/guide/src/data_model.md index 8ee3eaa1..bfa1891a 100644 --- a/guide/src/data_model.md +++ b/guide/src/data_model.md @@ -70,7 +70,7 @@ be reverted (even merges/redirects and "deletion"). "Work in progress" or "proposed" updates are staged as edit objects without updating the identifiers themselves. -## Controlled Vocabularies +## Controlled Vocabularies Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain diff --git a/guide/src/editing_quickstart.md b/guide/src/editing_quickstart.md index df413fcc..6f1673be 100644 --- a/guide/src/editing_quickstart.md +++ b/guide/src/editing_quickstart.md @@ -81,7 +81,7 @@ editgroups from the drop-down link in the upper right-hand corner of every page (your username, then "Edit History"). The editgroup page shows all the entities created, updated, or deleted, and allows you to make tweaks (re-edit) or remove changes. If the release/container update you made was bogus (just as -a learning exercize), you could remove it here. It's a good practice to group +a learning exercise), you could remove it here. It's a good practice to group related edits into the same editgroup, but only up to 50 or so edits at a time (more than that becomes difficult hard to review). diff --git a/guide/src/entity_creator.md b/guide/src/entity_creator.md index fded9e8d..7448fa4d 100644 --- a/guide/src/entity_creator.md +++ b/guide/src/entity_creator.md @@ -11,3 +11,63 @@ - `wikidata_qid` (string): external linking identifier to a Wikidata entity. See also ["Human Names"](./style_guide.md##human-names) sub-section of style guide. + +#### `extra` Fields + +All are optional. + +- `also-known-as` (list of objects): additional names that this creator may be + known under. For example, previous names, aliases, or names in different + scripts. Can include any or all of `display_name`, `given_name`, or `surname` + as keys. + +## Human Names + +Representing names of human beings in databases is a fraught subject. For some +background reading, see: + +- [Falsehoods Programmers Believe About Names](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/) (blog post) +- [Personal names around the world](https://www.w3.org/International/questions/qa-personal-names) (W3C informational) +- [Hubert Blaine Wolfeschlegelsteinhausenbergerdorff Sr.](https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff_Sr.) (Wikipedia article) + +Particular difficult issues in the context of a bibliographic database include: + +- the non-universal concept of "family" vs. "given" names and their + relationship to first and last names +- the inclusion of honorary titles and other suffixes and prefixes to a name +- the distinction between "preferred", "legal", and "bibliographic" names, or + other situations where a person may not wish to be known under the name they + are commonly referred +- language and character set issues +- different conventions for sorting and indexing names +- the sprawling world of citation styles +- name changes +- pseudonyms, anonymous publications, and fake personas (perhaps representing a + group, like Bourbaki) + +The general guidance for Fatcat is to: + +- not be a "source of truth" for representing a persona or human being; ORCID + and Wikidata are better suited to this task +- represent author personas, not necessarily 1-to-1 with human beings +- balance the concerns of readers with those of the author +- enable basic interoperability with external databases, file formats, schemas, + and style guides +- when possible, respect the wishes of individual authors + +The data model for the `creator` entity has three name fields: + +- `surname` and `given_name`: needed for "aligning" with external databases, + and to export metadata to many standard formats +- `display_name`: the "preferred" representation for display of the entire name, + in the context of international attribution of authorship of a written work + +Names to not necessarily need to expressed in a Latin character set, but also +does not necessarily need to be in the native language of the creator or the +language of their notable works + +Ideally all three fields are populated for all creators. + +It seems likely that this schema and guidance will need review. + + diff --git a/guide/src/entity_release.md b/guide/src/entity_release.md index 3815a544..842e9d55 100644 --- a/guide/src/entity_release.md +++ b/guide/src/entity_release.md @@ -126,8 +126,7 @@ to ensure they are properly formatted, though these checks aren't always complete or correct in more obscure cases. - `doi` (string): full DOI number, lower-case. Example: "10.1234/abcde.789". - See the "External Identifiers" section of style guide for more notes - about DOIs specifically. + See section below for more about DOIs specifically - `wikidata_qid` (string): external identifier for Wikidata entities. These are integers prefixed with "Q", like "Q4321". Each `release` entity can be associated with at most one Wikidata entity (this field is not an array), and @@ -180,11 +179,11 @@ complete or correct in more obscure cases. - `is_work_alias` (boolean): if true, then this release is an alias or pointer to the entire work, or the most recent version of the work. For example, some data repositories have separate DOIs for each version of the dataset, then an - additional DOI that points to the "lastest" version/DOI. + additional DOI that points to the "latest" version/DOI. #### `release_type` Vocabulary -This vocabulary is based on the +This vocabulary is based on the [CSL types](http://docs.citationstyles.org/en/stable/specification.html#appendix-iii-types), with a small number of (proposed) extensions: @@ -230,7 +229,7 @@ with a small number of (proposed) extensions: represent a "full work". - `component` (fatcat extension) for sub-components of a full paper or other work. Eg, tables, or individual files as part of a dataset. - + An example of a `stub` might be a paper that gets an extra DOI by accident; the primary DOI should be a full release, and the accidental DOI can be a `stub` release under the same work. `stub` releases shouldn't be considered full @@ -369,3 +368,28 @@ Fatcat: If blank, indicates that type of contribution is not known; this can often be interpreted as authorship. + +## More About DOIs + +All DOIs stored in an entity column should be registered (aka, should be +resolvable from `doi.org`). Invalid identifiers may be cleaned up or removed by +bots. + +DOIs should *always* be stored and transferred in lower-case form. Note that +there are almost no other constraints on DOIs (and handles in general): they +may have multiple forward slashes, whitespace, of arbitrary length, etc. +Crossref has a [number of examples][] of such "valid" but frustratingly +formatted strings. + +[number of examples]: https://www.crossref.org/blog/dois-unambiguously-and-persistently-identify-published-trustworthy-citable-online-scholarly-literature-right/ + +In the Fatcat ontology, DOIs and release entities are one-to-one. + +It is the intention to automatically (via bot) create a Fatcat release for +every Crossref-registered DOI from an allowlist of media types +("journal-article" etc, but not all), and it would be desirable to auto-create +entities for in-scope publications from all registrars. It is not the intention +to auto-create a release for every registered DOI. In particular, +"sub-component" DOIs (eg, for an individual figure or table from a publication) +aren't currently auto-created, but could be stored in "extra" metadata, or on a +case-by-case basis. diff --git a/guide/src/privacy_policy.md b/guide/src/privacy_policy.md index 05136f97..80799117 100644 --- a/guide/src/privacy_policy.md +++ b/guide/src/privacy_policy.md @@ -23,7 +23,7 @@ Exceptions will likely be made: - temporary caching of IP addresses may be necessary to implement rate-limiting and debug traffic spikes -- exception logging, abuse detection, and other exceptional +- exception logging, abuse detection, and other exceptional situations Some uncertain areas of privacy include: diff --git a/guide/src/style_guide.md b/guide/src/style_guide.md index 87d5e74a..de262549 100644 --- a/guide/src/style_guide.md +++ b/guide/src/style_guide.md @@ -24,76 +24,6 @@ stored in "extra" metadata. Crossref has [blogged][] about this distinction. [blogged]: https://www.crossref.org/blog/doi-like-strings-and-fake-dois/ -#### DOIs - -All DOIs stored in an entity column should be registered (aka, should be -resolvable from `doi.org`). Invalid identifiers may be cleaned up or removed by -bots. - -DOIs should *always* be stored and transferred in lower-case form. Note that -there are almost no other constraints on DOIs (and handles in general): they -may have multiple forward slashes, whitespace, of arbitrary length, etc. -Crossref has a [number of examples][] of such "valid" but frustratingly -formatted strings. - -[number of examples]: https://www.crossref.org/blog/dois-unambiguously-and-persistently-identify-published-trustworthy-citable-online-scholarly-literature-right/ - -In the Fatcat ontology, DOIs and release entities are one-to-one. - -It is the intention to automatically (via bot) create a Fatcat release for -every Crossref-registered DOI from a whitelist of media types -("journal-article" etc, but not all), and it would be desirable to auto-create -entities for in-scope publications from all registrars. It is not the intention -to auto-create a release for every registered DOI. In particular, -"sub-component" DOIs (eg, for an individual figure or table from a publication) -aren't currently auto-created, but could be stored in "extra" metadata, or on a -case-by-case basis. - -## Human Names - -Representing names of human beings in databases is a fraught subject. For some -background reading, see: - -- [Falsehoods Programmers Believe About Names](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/) (blog post) -- [Personal names around the world](https://www.w3.org/International/questions/qa-personal-names) (W3C informational) -- [Hubert Blaine Wolfeschlegelsteinhausenbergerdorff Sr.](https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff_Sr.) (Wikipedia article) - -Particular difficult issues in the context of a bibliographic database include -the non-universal concept of "family" vs. "given" names and their relationship -to first and last names; the inclusion of honorary titles and other suffixes -and prefixes to a name; the distinction between "preferred", "legal", and -"bibliographic" names, or other situations where a person may not wish to be -known under the name they are commonly referred to under; language and character -set issues; and pseudonyms, anonymous publications, and fake personas (perhaps -representing a group, like Bourbaki). - -The general guidance for Fatcat is to: - -- not be a "source of truth" for representing a persona or human being; ORCID - and Wikidata are better suited to this task -- represent author personas, not necessarily 1-to-1 with human beings -- prioritize the concerns of a reader or researcher over that of the author -- enable basic interoperability with external databases, file formats, schemas, - and style guides -- when possible, respect the wishes of individuals - -The data model for the `creator` entity has three name fields: - -- `surname` and `given_name`: needed for "aligning" with external databases, - and to export metadata to many standard formats -- `display_name`: the "preferred" representation for display of the entire name, - in the context of international attribution of authorship of a written work - -Names to not necessarily need to expressed in a Latin character set, but also -does not necessarily need to be in the native language of the creator or the -language of their notable works - -Ideally all three fields are populated for all creators. - -It seems likely that this schema and guidance will need review. "Extra" -metadata can be used to store aliases and alternative representations, which -may be useful for disambiguation and automated de-duplication. - ## Editgroups and Meta-Meta-Data Editors are expected to group their edits in semantically meaningful editgroups |