aboutsummaryrefslogtreecommitdiffstats
path: root/guide/src/entity_webcapture.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-05-20 16:57:57 -0700
committerBryan Newbold <bnewbold@robocracy.org>2019-05-20 16:57:57 -0700
commitcd829eedb5bfc7328ab5266650a625a6c88db6fa (patch)
tree90aa164cbd7f4e86aadc25dbd036dab680c30e80 /guide/src/entity_webcapture.md
parenteb31be2172264091e192bcb4f17ffd571253fffa (diff)
downloadfatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.tar.gz
fatcat-cd829eedb5bfc7328ab5266650a625a6c88db6fa.zip
start refactoring guide (per-entity pages)
Diffstat (limited to 'guide/src/entity_webcapture.md')
-rw-r--r--guide/src/entity_webcapture.md32
1 files changed, 32 insertions, 0 deletions
diff --git a/guide/src/entity_webcapture.md b/guide/src/entity_webcapture.md
new file mode 100644
index 00000000..8c5615fb
--- /dev/null
+++ b/guide/src/entity_webcapture.md
@@ -0,0 +1,32 @@
+
+# Web Capture Entity Reference
+
+## Fields
+
+Warning: This schema is not yet stable.
+
+- `cdx` (array of objects): each entry represents a distinct web resource
+ (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
+ - `surt` (string, required): sortable URL format
+ - `timestamp` (string, datetime, required): ISO format, UTC timezone, with
+ `Z` prefix required, with second (or finer) precision. Eg,
+ "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should
+ be converted naively.
+ - `url` (string, required): full URL
+ - `mimetype` (string): content type of the resource
+ - `status_code` (integer, signed): HTTP status code
+ - `sha1` (string, required): SHA-1 hash in lower-case hex
+ - `sha256` (string): SHA-256 hash in lower-case hex
+- `archive_urls`: An array of "typed" URLs where this snapshot can be found.
+ Can be wayback/memento instances, or direct links to a WARC file containing
+ all the capture resources. Often will only be a single archive. Order is not
+ meaningful, and may not be preserved.
+ - `url` (string, required):
+ Eg: "https://example.edu/~frau/prcding.pdf".
+ - `rel` (string, required): Eg: "wayback" or "warc"
+- `original_url` (string): base URL of the resource. May reference a specific
+ CDX entry, or maybe in normalized form.
+- `timestamp` (string, datetime): same format as CDX line timestamp (UTC, etc).
+ Corresponds to the overall capture timestamp. Can be the earliest of CDX
+ timestamps if that makes sense
+- `release_ids` (array of string identifiers): references to `release` entities