1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
|
# File Entity Reference
## Fields
- `size` (integer, positive, non-zero): Size of file in bytes. Eg: 1048576.
- `md5` (string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".
- `sha1` (string): SHA-1 hash in lower-case hex. Not technically required, but
the most-used of the hash fields and should always be included. Eg:
"f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
- `sha256`: SHA-256 hash in lower-case hex. Eg:
"a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
- `urls`: An array of "typed" URLs. Order is not meaningful, and may not be
preserved.
- `url` (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
- `rel` (string, required): Eg: "webarchive", see vocabulary below.
- `mimetype` (string): Format of the file. If XML, specific schema can be
included after a `+`. Example: "application/pdf"
- `content_scope` (string): for situations where the file does not simply
contain the full representation of a work (eg, fulltext of an article, for an
`article-journal` release), describes what that scope of coverage is. Eg,
entire `issue`, `corrupt` file. See vocabulary below.
- `release_ids` (array of string identifiers): references to `release` entities
that this file represents a manifestation of. Note that a single file can
contain multiple release references (eg, a PDF containing a full issue with
many articles), and that a release will often have multiple files (differing
only by watermarks, or different digitizations of the same printed work, or
variant MIME/media types of the same published work).
#### URL `rel` Vocabulary
- `web`: generic public web sites; for `http/https` URLs, this should be the default
- `webarchive`: full URL to a resource in a long-term web archive
- `repository`: direct URL to a resource stored in a repository (eg, an
institutional or field-specific research data repository)
- `academicsocial`: academic social networks (such as academia.edu or ResearchGate)
- `publisher`: resources hosted on publisher's website
- `aggregator`: fulltext aggregator or search engine, like CORE or Semantic
Scholar
- `dweb`: content hosted on distributed/decentralized web protocols, such as
`dat://` or `ipfs://` URLs
#### `content_scope` Vocabulary
This same vocabulary is shared between file, fileset, and webcapture entities;
not all the fields make sense for each entity type.
- if not set, assume that the artifact entity is valid and represents a
complete copy of the release
- `issue`: artifact contains an entire issue of a serial publication (eg, issue
of a journal), representing several releases in full
- `abstract`: contains only an abstract (short description) of the release, not
the release itself (unless the `release_type` itself is `abstract`, in which
case it is the entire release)
- `index`: index of a journal, or series of abstracts from a conference
- `slides`: slide deck (usually in "landscape" orientation)
- `front-matter`: non-article content from a journal, such as editorial policies
- `supplement`: usually a file entity which is a supplement or appendix, not
the entire work
- `component`: a sub-component of a release, which may or may not be associated
with a `component` release entity. For example, a single figure or table as
part of an article
- `poster`: digital copy of a poster, eg as displayed at conference poster sessions
- `sample`: a partial sample of the entire work. eg, just the first page of an
article. distinct from `truncated`
- `truncated`: the file has been truncated at a binary level, and may also be
corrupt or invalid. distinct from `sample`
- `corrupt`: broken, mangled, or corrupt file (at the binary level)
- `stub`: any other out-of-scope artifact situations, where the artifact
represents something which would not link to any possible in-scope release in
the catalog (except a `stub` release)
- `landing-page`: for webcapture, the landing page of a work, as opposed to the
work itself
- `spam`: content is spam. articles, webpages, or issues which include
incidental advertisements within them are not counted as `spam`
|