summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-17 16:23:09 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-17 16:23:09 -0800
commit1e0bf431fbd1ab00f27a305ff3492de8eac90ba6 (patch)
tree0dbeffe9eef5882eb3ced5b15d1137c569241b90
parentf64a469b8a8aa9319013d6099ad38e7cde495e18 (diff)
downloadfatcat-1e0bf431fbd1ab00f27a305ff3492de8eac90ba6.tar.gz
fatcat-1e0bf431fbd1ab00f27a305ff3492de8eac90ba6.zip
guide: document content_scope field
-rw-r--r--guide/src/entity_file.md40
-rw-r--r--guide/src/entity_fileset.md4
-rw-r--r--guide/src/entity_webcapture.md6
3 files changed, 49 insertions, 1 deletions
diff --git a/guide/src/entity_file.md b/guide/src/entity_file.md
index 7429c982..84d9eac4 100644
--- a/guide/src/entity_file.md
+++ b/guide/src/entity_file.md
@@ -13,9 +13,13 @@
- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
preserved.
- `url` (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
- - `rel` (string, required): Eg: "webarchive".
+ - `rel` (string, required): Eg: "webarchive", see vocabulary below.
- `mimetype` (string): Format of the file. If XML, specific schema can be
included after a `+`. Example: "application/pdf"
+- `content_scope` (string): for situations where the file does not simply
+ contain the full representation of a work (eg, fulltext of an article, for an
+ `article-journal` release), describes what that scope of coverage is. Eg,
+ entire `issue`, `corrupt` file. See vocabulary below.
- `release_ids` (array of string identifiers): references to `release` entities
that this file represents a manifestation of. Note that a single file can
contain multiple release references (eg, a PDF containing a full issue with
@@ -35,3 +39,37 @@
Scholar
- `dweb`: content hosted on distributed/decentralized web protocols, such as
`dat://` or `ipfs://` URLs
+
+#### `content_scope` Vocabulary
+
+This same vocabulary is shared between file, fileset, and webcapture entities;
+not all the fields make sense for each entity type.
+
+- if not set, assume that the artifact entity is valid and represents a
+ complete copy of the release
+- `issue`: artifact contains an entire issue of a serial publication (eg, issue
+ of a journal), representing several releases in full
+- `abstract`: contains only an abstract (short description) of the release, not
+ the release itself (unless the `release_type` itself is `abstract`, in which
+ case it is the entire release)
+- `index`: index of a journal, or series of abstracts from a conference
+- `slides`: slide deck (usually in "landscape" orientation)
+- `front-matter`: non-article content from a journal, such as editorial policies
+- `supplement`: usually a file entity which is a supplement or appendix, not
+ the entire work
+- `component`: a sub-component of a release, which may or may not be associated
+ with a `component` release entity. For example, a single figure or table as
+ part of an article
+- `poster`: digital copy of a poster, eg as displayed at conference poster sessions
+- `sample`: a partial sample of the entire work. eg, just the first page of an
+ article. distinct from `truncated`
+- `truncated`: the file has been truncated at a binary level, and may also be
+ corrupt or invalid. distinct from `sample`
+- `corrupt`: broken, mangled, or corrupt file (at the binary level)
+- `stub`: any other out-of-scope artifact situations, where the artifact
+ represents something which would not link to any possible in-scope release in
+ the catalog (except a `stub` release)
+- `landing-page`: for webcapture, the landing page of a work, as opposed to the
+ work itself
+- `spam`: content is spam. articles, webpages, or issues which include
+ incidental advertisements within them are not counted as `spam`
diff --git a/guide/src/entity_fileset.md b/guide/src/entity_fileset.md
index e1ac3e67..6083a09d 100644
--- a/guide/src/entity_fileset.md
+++ b/guide/src/entity_fileset.md
@@ -21,6 +21,10 @@
- `rel` (string, required):
Eg: "webarchive".
- `release_ids` (array of string identifiers): references to `release` entities
+- `content_scope` (string): for situations where the fileset does not simply
+ contain the full representation of a work (eg, all files in dataset, for a
+ `dataset` release), describes what that scope of coverage is. Uses same
+ vocabulary as File entity.
- `extra` (object with string keys): additional metadata about this group of
files, including upstream platform-specific metadata and identifiers
diff --git a/guide/src/entity_webcapture.md b/guide/src/entity_webcapture.md
index 8c5615fb..1b3cac55 100644
--- a/guide/src/entity_webcapture.md
+++ b/guide/src/entity_webcapture.md
@@ -29,4 +29,10 @@ Warning: This schema is not yet stable.
- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).
Corresponds to the overall capture timestamp. Can be the earliest of CDX
timestamps if that makes sense
+- `content_scope` (string): for situations where the webcapture does not simply
+ contain the full representation of a work (eg, HTML fulltext, for an
+ `article-journal` release), describes what that scope of coverage is. Eg,
+ `landing-page` it doesn't contain the full content. Landing pages are
+ out-of-scope for fatcat, but if they were accidentally imported, should mark
+ them as such so they aren't re-imported. Uses same vocabulary as File entity.
- `release_ids` (array of string identifiers): references to `release` entities