diff options
Diffstat (limited to 'guide/src/entity_webcapture.md')
-rw-r--r-- | guide/src/entity_webcapture.md | 32 |
1 files changed, 32 insertions, 0 deletions
diff --git a/guide/src/entity_webcapture.md b/guide/src/entity_webcapture.md new file mode 100644 index 00000000..8c5615fb --- /dev/null +++ b/guide/src/entity_webcapture.md @@ -0,0 +1,32 @@ + +# Web Capture Entity Reference + +## Fields + +Warning: This schema is not yet stable. + +- `cdx` (array of objects): each entry represents a distinct web resource + (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema. + - `surt` (string, required): sortable URL format + - `timestamp` (string, datetime, required): ISO format, UTC timezone, with + `Z` prefix required, with second (or finer) precision. Eg, + "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should + be converted naively. + - `url` (string, required): full URL + - `mimetype` (string): content type of the resource + - `status_code` (integer, signed): HTTP status code + - `sha1` (string, required): SHA-1 hash in lower-case hex + - `sha256` (string): SHA-256 hash in lower-case hex +- `archive_urls`: An array of "typed" URLs where this snapshot can be found. + Can be wayback/memento instances, or direct links to a WARC file containing + all the capture resources. Often will only be a single archive. Order is not + meaningful, and may not be preserved. + - `url` (string, required): + Eg: "https://example.edu/~frau/prcding.pdf". + - `rel` (string, required): Eg: "wayback" or "warc" +- `original_url` (string): base URL of the resource. May reference a specific + CDX entry, or maybe in normalized form. +- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc). + Corresponds to the overall capture timestamp. Can be the earliest of CDX + timestamps if that makes sense +- `release_ids` (array of string identifiers): references to `release` entities |