1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
# Web Capture Entity Reference
## Fields
Warning: This schema is not yet stable.
- `cdx` (array of objects): each entry represents a distinct web resource
(URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
- `surt` (string, required): sortable URL format
- `timestamp` (string, datetime, required): ISO format, UTC timezone, with
`Z` prefix required, with second (or finer) precision. Eg,
"2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should
be converted naively.
- `url` (string, required): full URL
- `mimetype` (string): content type of the resource
- `status_code` (integer, signed): HTTP status code
- `sha1` (string, required): SHA-1 hash in lower-case hex
- `sha256` (string): SHA-256 hash in lower-case hex
- `archive_urls`: An array of "typed" URLs where this snapshot can be found.
Can be wayback/memento instances, or direct links to a WARC file containing
all the capture resources. Often will only be a single archive. Order is not
meaningful, and may not be preserved.
- `url` (string, required):
Eg: "https://example.edu/~frau/prcding.pdf".
- `rel` (string, required): Eg: "wayback" or "warc"
- `original_url` (string): base URL of the resource. May reference a specific
CDX entry, or maybe in normalized form.
- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).
Corresponds to the overall capture timestamp. Can be the earliest of CDX
timestamps if that makes sense
- `release_ids` (array of string identifiers): references to `release` entities
|