summaryrefslogtreecommitdiffstats
path: root/guide/src/entity_webcapture.md
blob: 1b3cac551d8bdd71393f152af5cdd0827d17c70c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

# Web Capture Entity Reference

## Fields

Warning: This schema is not yet stable.

- `cdx` (array of objects): each entry represents a distinct web resource
  (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
  - `surt` (string, required): sortable URL format
  - `timestamp` (string, datetime, required): ISO format, UTC timezone, with
    `Z` prefix required, with second (or finer) precision. Eg,
    "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should
    be converted naively.
  - `url` (string, required): full URL
  - `mimetype` (string): content type of the resource
  - `status_code` (integer, signed): HTTP status code
  - `sha1` (string, required): SHA-1 hash in lower-case hex
  - `sha256` (string): SHA-256 hash in lower-case hex
- `archive_urls`: An array of "typed" URLs where this snapshot can be found.
  Can be wayback/memento instances, or direct links to a WARC file containing
  all the capture resources.  Often will only be a single archive. Order is not
  meaningful, and may not be preserved.
    - `url` (string, required):
            Eg: "https://example.edu/~frau/prcding.pdf".
    - `rel` (string, required): Eg: "wayback" or "warc"
- `original_url` (string): base URL of the resource. May reference a specific
  CDX entry, or maybe in normalized form.
- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).
  Corresponds to the overall capture timestamp. Can be the earliest of CDX
  timestamps if that makes sense
- `content_scope` (string): for situations where the webcapture does not simply
  contain the full representation of a work (eg, HTML fulltext, for an
  `article-journal` release), describes what that scope of coverage is. Eg,
  `landing-page` it doesn't contain the full content. Landing pages are
  out-of-scope for fatcat, but if they were accidentally imported, should mark
  them as such so they aren't re-imported. Uses same vocabulary as File entity.
- `release_ids` (array of string identifiers): references to `release` entities