diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-05-20 16:57:57 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-05-20 16:57:57 -0700 |
commit | cd829eedb5bfc7328ab5266650a625a6c88db6fa (patch) | |
tree | 90aa164cbd7f4e86aadc25dbd036dab680c30e80 /guide/src/entity_webcapture.md | |
parent | eb31be2172264091e192bcb4f17ffd571253fffa (diff) | |
download | fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.tar.gz fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.zip |
start refactoring guide (per-entity pages)
Diffstat (limited to 'guide/src/entity_webcapture.md')
-rw-r--r-- | guide/src/entity_webcapture.md | 32 |
1 files changed, 32 insertions, 0 deletions
diff --git a/guide/src/entity_webcapture.md b/guide/src/entity_webcapture.md new file mode 100644 index 00000000..8c5615fb --- /dev/null +++ b/guide/src/entity_webcapture.md @@ -0,0 +1,32 @@ + +# Web Capture Entity Reference + +## Fields + +Warning: This schema is not yet stable. + +- `cdx` (array of objects): each entry represents a distinct web resource + (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema. + - `surt` (string, required): sortable URL format + - `timestamp` (string, datetime, required): ISO format, UTC timezone, with + `Z` prefix required, with second (or finer) precision. Eg, + "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should + be converted naively. + - `url` (string, required): full URL + - `mimetype` (string): content type of the resource + - `status_code` (integer, signed): HTTP status code + - `sha1` (string, required): SHA-1 hash in lower-case hex + - `sha256` (string): SHA-256 hash in lower-case hex +- `archive_urls`: An array of "typed" URLs where this snapshot can be found. + Can be wayback/memento instances, or direct links to a WARC file containing + all the capture resources. Often will only be a single archive. Order is not + meaningful, and may not be preserved. + - `url` (string, required): + Eg: "https://example.edu/~frau/prcding.pdf". + - `rel` (string, required): Eg: "wayback" or "warc" +- `original_url` (string): base URL of the resource. May reference a specific + CDX entry, or maybe in normalized form. +- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc). + Corresponds to the overall capture timestamp. Can be the earliest of CDX + timestamps if that makes sense +- `release_ids` (array of string identifiers): references to `release` entities |