diff options
Diffstat (limited to 'guide/src')
| -rw-r--r-- | guide/src/entity_file.md | 40 | ||||
| -rw-r--r-- | guide/src/entity_fileset.md | 4 | ||||
| -rw-r--r-- | guide/src/entity_webcapture.md | 6 | 
3 files changed, 49 insertions, 1 deletions
| diff --git a/guide/src/entity_file.md b/guide/src/entity_file.md index 7429c982..84d9eac4 100644 --- a/guide/src/entity_file.md +++ b/guide/src/entity_file.md @@ -13,9 +13,13 @@  - `urls`: An array of "typed" URLs. Order is not meaningful, and may not be    preserved.      - `url` (string, required): Eg: "https://example.edu/~frau/prcding.pdf". -    - `rel` (string, required): Eg: "webarchive". +    - `rel` (string, required): Eg: "webarchive", see vocabulary below.  - `mimetype` (string): Format of the file. If XML, specific schema can be    included after a `+`. Example: "application/pdf" +- `content_scope` (string): for situations where the file does not simply +  contain the full representation of a work (eg, fulltext of an article, for an +  `article-journal` release), describes what that scope of coverage is. Eg, +  entire `issue`, `corrupt` file. See vocabulary below.  - `release_ids` (array of string identifiers): references to `release` entities    that this file represents a manifestation of. Note that a single file can    contain multiple release references (eg, a PDF containing a full issue with @@ -35,3 +39,37 @@    Scholar  - `dweb`: content hosted on distributed/decentralized web protocols, such as    `dat://` or `ipfs://` URLs + +#### `content_scope` Vocabulary + +This same vocabulary is shared between file, fileset, and webcapture entities; +not all the fields make sense for each entity type. + +- if not set, assume that the artifact entity is valid and represents a +  complete copy of the release +- `issue`: artifact contains an entire issue of a serial publication (eg, issue +  of a journal), representing several releases in full +- `abstract`: contains only an abstract (short description) of the release, not +  the release itself (unless the `release_type` itself is `abstract`, in which +  case it is the entire release) +- `index`: index of a journal, or series of abstracts from a conference +- `slides`: slide deck (usually in "landscape" orientation) +- `front-matter`: non-article content from a journal, such as editorial policies +- `supplement`: usually a file entity which is a supplement or appendix, not +  the entire work +- `component`: a sub-component of a release, which may or may not be associated +  with a `component` release entity. For example, a single figure or table as +  part of an article +- `poster`: digital copy of a poster, eg as displayed at conference poster sessions +- `sample`: a partial sample of the entire work. eg, just the first page of an +  article. distinct from `truncated` +- `truncated`: the file has been truncated at a binary level, and may also be +  corrupt or invalid. distinct from `sample` +- `corrupt`: broken, mangled, or corrupt file (at the binary level) +- `stub`: any other out-of-scope artifact situations, where the artifact +  represents something which would not link to any possible in-scope release in +  the catalog (except a `stub` release) +- `landing-page`: for webcapture, the landing page of a work, as opposed to the +  work itself +- `spam`: content is spam. articles, webpages, or issues which include +  incidental advertisements within them are not counted as `spam` diff --git a/guide/src/entity_fileset.md b/guide/src/entity_fileset.md index e1ac3e67..6083a09d 100644 --- a/guide/src/entity_fileset.md +++ b/guide/src/entity_fileset.md @@ -21,6 +21,10 @@      - `rel` (string, required):              Eg: "webarchive".  - `release_ids` (array of string identifiers): references to `release` entities +- `content_scope` (string): for situations where the fileset does not simply +  contain the full representation of a work (eg, all files in dataset, for a +  `dataset` release), describes what that scope of coverage is. Uses same +  vocabulary as File entity.  - `extra` (object with string keys): additional metadata about this group of    files, including upstream platform-specific metadata and identifiers diff --git a/guide/src/entity_webcapture.md b/guide/src/entity_webcapture.md index 8c5615fb..1b3cac55 100644 --- a/guide/src/entity_webcapture.md +++ b/guide/src/entity_webcapture.md @@ -29,4 +29,10 @@ Warning: This schema is not yet stable.  - `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).    Corresponds to the overall capture timestamp. Can be the earliest of CDX    timestamps if that makes sense +- `content_scope` (string): for situations where the webcapture does not simply +  contain the full representation of a work (eg, HTML fulltext, for an +  `article-journal` release), describes what that scope of coverage is. Eg, +  `landing-page` it doesn't contain the full content. Landing pages are +  out-of-scope for fatcat, but if they were accidentally imported, should mark +  them as such so they aren't re-imported. Uses same vocabulary as File entity.  - `release_ids` (array of string identifiers): references to `release` entities | 
