From 6c6b4855d7adc8c03dc8fd8c8966da48878b5c4b Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Fri, 11 Aug 2017 18:44:16 -0700 Subject: sleep paper breakout --- papers/buildpapers.sh | 9 + papers/dat-paper.bib | 6 + papers/dat-paper.md | 428 +--------------------------- papers/dat-paper.pdf | Bin 273875 -> 250116 bytes papers/dat-paper.txt | 745 +++++-------------------------------------------- papers/hyperdrive.md | 13 + papers/sleep.md | 424 ++++++++++++++++++++++++++++ papers/sleep.pdf | Bin 0 -> 202643 bytes papers/sleep.txt | 756 ++++++++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 1288 insertions(+), 1093 deletions(-) create mode 100755 papers/buildpapers.sh create mode 100644 papers/hyperdrive.md create mode 100644 papers/sleep.md create mode 100644 papers/sleep.pdf create mode 100644 papers/sleep.txt (limited to 'papers') diff --git a/papers/buildpapers.sh b/papers/buildpapers.sh new file mode 100755 index 0000000..420e916 --- /dev/null +++ b/papers/buildpapers.sh @@ -0,0 +1,9 @@ +#!/usr/bin/env sh + +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -t latex -o dat-paper.txt + +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -o dat-paper.pdf + +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -t latex -o sleep.txt + +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -o sleep.pdf \ No newline at end of file diff --git a/papers/dat-paper.bib b/papers/dat-paper.bib index e6c7d1d..87bb3fc 100644 --- a/papers/dat-paper.bib +++ b/papers/dat-paper.bib @@ -1,3 +1,9 @@ +@inproceedings{sleep, + title={SLEEP - The Dat Protocol On Disk Format}, + author={Ogden, Maxwell and Buus, Mathias}, + year={2017} +} + @inproceedings{aumasson2013blake2, title={BLAKE2: simpler, smaller, fast as MD5}, author={Aumasson, Jean-Philippe and Neves, Samuel and Wilcox-O’Hearn, Zooko and Winnerlein, Christian}, diff --git a/papers/dat-paper.md b/papers/dat-paper.md index dc156d5..556edae 100644 --- a/papers/dat-paper.md +++ b/papers/dat-paper.md @@ -4,7 +4,6 @@ date: "May 2017" author: "Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science" --- - # Abstract Dat is a protocol designed for syncing folders of data, even if they are large or changing constantly. Dat uses a cryptographically secure register of changes to prove that the requested data version is distributed. A byte range of any file's version can be efficiently streamed from a Dat repository over a network connection. Consumers can choose to fully or partially replicate the contents of a remote Dat repository, and can also subscribe to live changes. To ensure writer and reader privacy, Dat uses public key cryptography to encrypt network traffic. A group of Dat clients can connect to each other to form a public or private decentralized network to exchange data between each other. A reference implementation is provided in JavaScript. @@ -21,7 +20,7 @@ Distributed file sharing tools can become faster as files become more popular, r Dat is a dataset synchronization protocol that does not assume a dataset is static or that the entire dataset will be downloaded. The main reference implementation is available from npm as `npm install dat -g`. -The protocol is agnostic to the underlying transport e.g. you could implement Dat over carrier pigeon. The key properties of the Dat design are explained in this section. +The protocol is agnostic to the underlying transport e.g. you could implement Dat over carrier pigeon. Data is stored in a format called SLEEP [@sleep], described in it's own paper. The key properties of the Dat design are explained in this section. - 2.1 **Content Integrity** - Data and publisher integrity is verified through use of signed hashes of the content. - 2.2 **Decentralized Mirroring** - Users sharing the same Dat automatically discover each other and exchange data in a swarm. @@ -303,424 +302,7 @@ TODO TODO -## 3. SLEEP Specification - -This section is a technical description of the SLEEP format intended for implementers. SLEEP is the the on-disk format that Dat produces and uses. It is a set of 9 files that hold all of the metadata needed to list the contents of a Dat repository and verify the integrity of the data you receive. SLEEP is designed to work with REST, allowing servers to be plain HTTP file servers serving the static SLEEP files, meaning you can implement a Dat protocol client using HTTP with a static HTTP file server as the backend. - -SLEEP files contain metadata about the data inside a Dat repository, including cryptographic hashes, cryptographic signatures, filenames and file permissions. The SLEEP format is specifically engineered to allow efficient access to subsets of the metadata and/or data in the repository, even on very large repositories, which enables Dat's peer to peer networking to be fast. - -The acronym SLEEP is a slumber related pun on REST and stands for Syncable Lightweight Event Emitting Persistence. The Event Emitting part refers to how SLEEP files are append-only in nature, meaning they grow over time and new updates can be subscribed to as a realtime feed of events through the Dat protocol. - -The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. - -### SLEEP Files - -SLEEP is a set of 9 files that should be stored with the following names. In Dat, the files are stored in a folder called `.dat` in the top level of the repository. - -``` -metadata.key -metadata.signatures -metadata.bitfield -metadata.tree -metadata.data -content.key -content.signatures -content.bitfield -content.tree -``` - -The files prefixed with `content` store metadata about the primary data in a Dat repository, for example the raw binary contents of the files. The files prefixed with `metadata` store metadata about the files in the repository, for example the filenames, file sizes, and file permissions. The `content` and `metadata` files are both Hypercore registers, making SLEEP a set of two Hypercore registers. - -### SLEEP File Headers - -The following structured binary format is used for `signatures`, `bitfield`, and `tree` files. The header contains metadata as well as information needed to decode the rest of the files after the header. SLEEP files are designed to be easy to append new data, easy to read arbitrary byte offsets in the middle, and are relatively flat, simple files that rely on the filesystem for the heavy lifting. - -SLEEP files are laid out like this: - -``` -<32 byte header> - - - - -```` - -- 32 byte header - - 4 bytes - magic byte (value varies depending on which file, used to quickly identify which file type it is) - - 1 byte - version number of the file header protocol, current version is 0 - - 2 byte Uint16BE - entry size, describes how long each entry in the file is - - 1 byte - length prefix for body - - rest of 32 byte header - string describing key algorithm (in dat 'ed25519'). length of this string matches the length in the previous length prefix field. This string must fit within the 32 byte header limitation (24 bytes reserved for string). Unused bytes should be filled with zeroes. - -Possible values in the Dat implementation for the body field are: - -``` -Ed25519 -BLAKE2b -``` - -To calculate the offset of some entry position, first read the header and get the entry size, then do `32 + entrySize * entryIndex`. To calculate how many entries are in a file, you can use the entry size and the filesize on disk and do `(fileSize - 32) / entrySize`. - -As mentioned above, `signatures`, `bitfield` and `tree` are the three SLEEP files. There are two additional files, `key`, and `data`, which do not contain SLEEP file headers and store plain serialized data for easy access. `key` stores the public key that is described by the `signatures` file, and `data` stores the raw chunk data that the `tree` file contains the hashes and metadata for. - -### File Descriptions - -#### key - -The public key used to verify the signatures in the `signatures` file, stored in binary as a single buffer written to disk. To find out what format of key is stored in this file, read the header of `signatures`. In Dat, it's always a ed25519 public key, but other implementations can specify other key types using a string value in that header. - -#### tree - -A SLEEP formatted 32 byte header with data entries representing a serialized Merkle tree based on the data in the data storage layer. All the fixed size nodes written in in-order tree notation. The header algorithm string for `tree` files is `BLAKE2b`. The entry size is 40 bytes. Entries are formatted like this: - -``` -<32 byte header> - <4 byte magic string: 0x05025702> - <1 byte version number: 0> - <2 byte entry size: 40> - <1 byte algorithm name length prefix: 7> - <7 byte algorithm name: BLAKE2b> - <17 zeroes> -<40 byte entries> - <32 byte BLAKE2b hash> - <8 byte Uint64BE children leaf byte length> -``` - -The children leaf byte length is the byte size containing the sum byte length of all leaf nodes in the tree below this node. - -This file uses the in-order notation, meaning even entries are leaf nodes and odd entries are parent nodes (non-leaf). - -To prevent pre-image attacks, all hashes start with a one byte type descriptor: - -``` -0 - LEAF -1 - PARENT -2 - ROOT -``` - -To calculate leaf node entries (the hashes of the data entries) we hash this data: - -``` -BLAKE2b( - <1 byte type> - 0 - <8 bytes Uint64BE> - length of entry data - -) -``` - -Then we take this 32 byte hash and write it to the tree as 40 bytes like this: - -``` -<32 bytes> - BLAKE2b hash -<8 bytes Uint64BE> - length of data -``` - -Note that the Uint64 of length of data is included both in the hashed data and written at the end of the entry. This is to expose more metadata to Dat for advanced use cases such as verifying data length in sparse replication scenarios. - -To calculate parent node entries (the hashes of the leaf nodes) we hash this data: - -``` -BLAKE2b( - <1 byte> - 1 - <8 bytes Uint64BE> - left child length + right child length - <32 bytes> - left child hash - <32 bytes> - right child hash -) -``` - -Then we take this 32 byte hash and write it to the tree as 40 bytes like this: - -``` -<32 bytes> - BLAKE2b hash -<8 bytes Uint64BE> - left child length + right child length -``` - -The reason the tree entries contain data lengths is to allow for sparse mode replication. Encoding lengths (and including lengths in all hashes) means you can verify the Merkle subtrees independent of the rest of the tree, which happens during sparse replication scenarios. - -The tree file corresponds directly to the `data` file. - -#### data - -The `data` file is only included in the SLEEP format for the `metadata.*` prefixed files which contains filesystem metadata and not actual file data. For the `content.*` files, the data is stored externally (in Dat it is stored as normal files on the filesystem and not in a SLEEP file). However you can configure Dat to use a `content.data` file if you want and it will still work. If you want to store the full history of all versions of all files, using the `content.data` file would provide that guarantee, but would have the disadvantage of storing files as chunks merged into one huge file (not as user friendly). - -The `data` file does not contain a SLEEP file header. It just contains a bunch of concatenated data entries. Entries are written in the same order as they appear in the `tree` file. To read a `data` file, first decode the `tree` file and for every leaf in the `tree` file you can calculate a data offset for the data described by that leaf node in the `data` file. - -##### Index Lookup - -For example, if we wanted to seek to a specific entry offset (say entry 42): - -- First, read the header of the `tree` file and get the entry size, then do `32 + entrySize * 42` to get the raw tree index: `32 + (40 * 42)` -- Since we want the leaf entry (even node in the in-order layout), we multiply the entry index by 2: - `32 + (40 * (42 * 2))` -- Read the 40 bytes at that offset in the `tree` file to get the leaf node entry. -- Read the last 8 bytes of the entry to get the length of the data entry -- To calculate the offset of where in the `data` file your entry begins, you need to sum all the lengths of all the earlier entries in the tree. The most efficient way to do this is to sum all the previous parent node (non-leaf) entry lengths. You can also sum all leaf node lengths, but parent nodes contain the sum of their children's lengths so it's more efficient to use parents. During Dat replication, these nodes are fetched as part of the Merkle tree verification so you will already have them locally. This is a log(N) operation where N is the entry index. Entries are also small and therefore easily cacheable. -- Once you get the offset, you use the length you decoded above and read N bytes (where N is the decoded length) at the offset in the `data` file. You can verify the data integrity using the 32 byte hash from the `tree` entry. - -##### Byte Lookup - -The above method illustrates how to resolve a chunk position index to a byte offset. You can also do the reverse operation, resolving a byte offset to a chunk position index. This is used to stream arbitrary random access regions of files in sparse replication scenarios. - -- First, you start by calculating the current Merkle roots -- Each node in the tree (including these root nodes) stores the aggregate file size of all byte sizes of the nodes below it. So the roots cumulatively will describe all possible byte ranges for this repository. -- Find the root that contains the byte range of the offset you are looking for and get the node information for all of that nodes children using the Index Lookup method, and recursively repeat this step until you find the lowest down child node that describes this byte range. -- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` property in the `Stat` metadata object to seek into the right position in the content for the start of this chunk. - -##### Metadata Overhead - -Using this scheme, if you write 4GB of data using on average 64KB data chunks (note: chunks can be variable length and do not need to be the same size), your tree file will be around 5MB (0.0125% overhead). - -#### signatures - -A SLEEP formatted 32 byte header with data entries being 64 byte signatures. - -``` -<32 byte header> - <4 byte magic string: 0x05025701> - <1 byte version number: 0> - <2 byte entry size: 64> - <1 byte algorithm name length prefix: 7> - <7 byte algorithm name: Ed25519> - <17 zeroes> -<64 byte entries> - <64 byte Ed25519 signature> -``` - -Every time the tree is updated we sign the current roots of the Merkle tree, and append them to the signatures file. The signatures file starts with no entries. Each time a new leaf is appended to the `tree` file (aka whenever data is added to a Dat), we take all root hashes at the current state of the Merkle tree and hash and sign them, then append them as a new entry to the signatures file. - -``` -Ed25519 sign( - BLAKE2b( - <1 byte> - 2 // root type - for (every root node left-to-right) { - <32 byte root hash> - <8 byte Uint64BE root tree index> - <8 byte Uint64BE child byte lengths> - } - ) -) -``` - -The reason we hash all the root nodes is that the BLAKE2b hash above is only calculable if you have all of the pieces of data required to generate all the intermediate hashes. This is the crux of Dat's data integrity guarantees. - -#### bitfield - -A SLEEP formatted 32 byte header followed by a series of 3328 byte long entries. - -``` -<32 byte header> - <4 byte magic string: 0x05025700> - <1 byte version number: 0> - <2 byte entry size: 3328> - <1 byte algorithm name length: 0> - <1 byte algorithm name: 0> - <24 zeroes> -<3328 byte entries> // (2048 + 1024 + 256) -``` - -The bitfield describes which pieces of data you have, and which nodes in the `tree` file have been written. This file exists as an index of the `tree` and `data` to quickly figure out which pieces of data you have or are missing. This file can be regenerated if you delete it, so it is considered a materialized index. - -The `bitfield` file actually contains three bitfields of different sizes. A bitfield (AKA bitmap) is defined as a set of bits where each bit (0 or 1) represents if you have or do not have a piece of data at that bit index. So if there is a dataset of 10 cat pictures, and you have pictures 1, 3, and 5 but are missing the rest, your bitfield would look like `1010100000`. - -Each entry contains three objects: - -- Data Bitfield (1024 bytes) - 1 bit for for each data entry that you have synced (1 for every entry in `data`). -- Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in `tree`) -- Bitfield Index (256 bytes) - This is an index of the Data Bitfield that makes it efficient to figure out which pieces of data are missing from the Data Bitfield without having to do a linear scan. - -The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that because most filesystems work in 4Kb chunk sizes, we can fit the Data, Tree and Index in less then 4Kb of data for efficient writes to the filesystem. The Tree and Index sizes are based on the Data size (the Tree has twice the entries as the Data, odd and even nodes vs just even nodes in `tree`, and Index is always 1/4th the size). - -To generate the Index, you take pairs of 2 bytes at a time from the Data Bitfield, check if all bits in the 2 bytes are the same, and generate 4 bits of Index metadata for every 2 bytes of Data (hence how 1024 bytes of Data ends up as 256 bytes of Index). - -First you generate a 2 bit tuple for the 2 bytes of Data: - -``` -if (data is all 1's) then [1,1] -if (data is all 0's) then [0,0] -if (data is not all the same) then [1, 0] -``` - -The Index itself is an in-order binary tree, not a traditional bitfield. To generate the tree, you take the tuples you generate above and then write them into a tree like the following example, where non-leaf nodes are generated using the above scheme by looking at the results of the relative even child tuples for each odd parent tuple: - -``` -// for e.g. 16 bytes (8 tuples) of -// sparsely replicated data -0 - [00 00 00 00] -1 - [10 10 10 10] -2 - [11 11 11 11] -``` - -The tuples at entry `1` above are `[1,0]` because the relative child tuples are not uniform. In the following example, all non-leaf nodes are `[1,1]` because their relative children are all uniform (`[1,1]`) - -``` -// for e.g. 32 bytes (16 tuples) of -// fully replicated data (all 1's) -0 - [11 11 11 11] -1 - [11 11 11 11] -2 - [11 11 11 11] -3 - [11 11 11 11] -4 - [11 11 11 11] -5 - [11 11 11 11] -6 - [11 11 11 11] -``` - -Using this scheme, to represent 32 bytes of data it takes at most 8 bytes of Index. In this example it compresses nicely as its all contiguous ones on disk, similarly for an empty bitfield it would be all zeroes. - -If you write 4GB of data using on average 64KB data chunk size, your bitfield will be at most 32KB. - -#### metadata.data - -This file is used to store content described by the rest of the `metadata.*` hypercore SLEEP files. Whereas the `content.*` SLEEP files describe the data stored in the actual data cloned in the Dat repository filesystem, the `metadata` data feed is stored inside the `.dat` folder along with the rest of the SLEEP files. - -The contents of this file is a series of versions of the Dat filesystem tree. As this is a hypercore data feed, it's just an append only log of binary data entries. The challenge is representing a tree in a one-dimensional way to make it representable as a Hypercore register. For example, imagine three files: - -``` -~/dataset $ ls -figures - graph1.png - graph2.png -results.csv - -1 directory, 3 files -``` - -We want to take this structure and map it to a serialized representation that gets written into an append only log in a way that still allows for efficient random access by file path. - -To do this, we convert the filesystem metadata into entries in a feed like this: - -``` -{ - "path": "/results.csv", - trie: [[]], - sequence: 0 -} -{ - "path": "/figures/graph1.png", - trie: [[0], []], - sequence: 1 -} -{ - "path": "/figures/graph2.png", - trie: [[0], [1]], - sequence: 2 -} -``` - -##### Filename Resolution - -Each sequence represents adding one of the files to the register, so at sequence 0 the filesystem state only has a single file, `results.csv` in it. At sequence 1, there are only 2 files added to the register, and at sequence 3 all files are finally added. The `children` field represents a shorthand way of declaring which other files at every level of the directory hierarchy exist alongside the file being added at that revision. For example at the time of sequence 1, children is `[[0], []]`. The first sub-array, `[0]`, represents the first folder in the `path`, which is the root folder `/`. In this case `[0]` means the root folder at this point in time only has a single file, the file that is the subject of sequence `0`. The second subarray is empty `[]` because there are no other existing files in the second folder in the `path`, `figures`. - -To look up a file by filename, you fetch the latest entry in the log, then use the `children` metadata in that entry to look up the longest common ancestor based on the parent folders of the filename you are querying. You can then recursively repeat this operation until you find the `path` entry you are looking for (or you exhaust all options which means the file does not exist). This is a `O(number of slashes in your path)` operation. - -For example, if you wanted to look up `/results.csv` given the above register, you would start by grabbing the metadata at sequence 2. The longest common ancestor between `/results.csv` and `/figures/graph2` is `/`. You then grab the corresponding entry in the children array for `/`, which in this case is the first entry, `[0]`. You then repeat this with all of the children entries until you find a child that is closer to the entry you are looking for. In this example, the first entry happens to be the match we are looking for. - -You can also perform lookups relative to a point in time by starting from a specific sequence number in the register. For example to get the state of some file relative to an old sequence number, similar to checking out an old version of a repository in Git. - -##### Data Serialization - -The format of the `metadata.data` file is as follows: - -``` -
- - - - -``` - -Each entry in the file is encoded using Protocol Buffers [@varda2008protocol]. - -The first message we write to the file is of a type called Header which uses this schema: - -``` -message Header { - required string type = 1; - optional bytes content = 2; -} -``` - -This is used to declare two pieces of metadata used by Dat. It includes a `type` string with the value `hyperdrive` and `content` binary value that holds the public key of the content register that this metadata register represents. When you share a Dat, the metadata key is the main key that gets used, and the content register key is linked from here in the metadata. - -After the header the file will contain many filesystem `Node` entries: - -``` -message Node { - required string path = 1; - optional Stat value = 2; - optional bytes trie = 3; - repeated Writer writers = 4; - optional uint64 writersSequence = 5; -} - -message Writer { - required bytes publicKey = 1; - optional string permission = 2; -} -``` - -The `Node` object has five fields - - - `path` - the string of the absolute file path of this file. - - `Stat` - a Stat encoded object representing the file metadata - - `trie` - a compressed list of the sequence numbers as described earlier - - `writers` - a list of the writers who are allowed to write to this dat - - `writersSequence` - a reference to the last sequence where the writers array was modified. you can use this to quickly find the value of the writers keys. - -The `trie` value is encoded by starting with the nested array of sequence numbers, e.g. `[[[0, 3]], [[0, 2], [0, 1]]]`. Each entry is a tuple where the first item is the index of the feed in the `writers` array and the second value is the sequence number. Finally you prepend the trie value with a version number varint. - -To write these subarrays we use variable width integers (varints), using a repeating pattern like this, one for each array: - -``` - - - - - - -``` - -This encoding is designed for efficiency as it reduces the filesystem path + feed index metadata down to a series of small integers. - -The `Stat` objects use this encoding: - -``` -message Stat { - required uint32 mode = 1; - optional uint32 uid = 2; - optional uint32 gid = 3; - optional uint64 size = 4; - optional uint64 blocks = 5; - optional uint64 offset = 6; - optional uint64 byteOffset = 7; - optional uint64 mtime = 8; - optional uint64 ctime = 9; -} -``` - -These are the field definitions: - - - `mode` - POSIX file mode bitmask - - `uid` - POSIX user id - - `gid` - POSIX group id - - `size` - file size in bytes - - `blocks` - number of data chunks that make up this file - - `offset` - the data feed entry index for the first chunk in this file - - `byteOffset` - the data feed file byte offset for the first chunk in this file - - `mtime` - POSIX modified_at time - - `mtime` - POSIX created_at time - -## 4. Dat Network Protocol +## 3. Dat Network Protocol The SLEEP format is designed to allow for sparse replication, meaning you can efficiently download only the metadata and data required to resolve a single byte region of a single file, which makes Dat suitable for a wide variety of streaming, real time and large dataset use cases. @@ -915,7 +497,7 @@ message Data { } ``` -# 5. Multi-Writer +# 4. Multi-Writer The design of Dat up to this point assumes you have a single keyholder writing and signing data and appending it to the metadata and content feed. However having the ability for multiple keyholders to be able to write to a single repository allows for many interesting use cases such as forking and collaborative workflows. @@ -929,7 +511,7 @@ In case Bob and Alice write different values for the same file (e.g. Bob creates If a writer updates the value of a forked key with new value they are performing a merge. -# 6. Existing Work +# 5. Existing Work Dat is inspired by a number of features from existing systems. @@ -981,7 +563,7 @@ The UK Government Digital Service have developed the concept of a register which The design of registers was inspired by the infrastructure backing the Certificate Transparency [@laurie2013certificate] project, initiated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates". -# 7. Reference Implementation +# 6. Reference Implementation The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. It provides statistics such as how many sources are currently connected, how many good and bad behaving sources have been talked to, and it automatically handles connecting and reconnecting to sources. UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf index 5770be0..5a7e758 100644 Binary files a/papers/dat-paper.pdf and b/papers/dat-paper.pdf differ diff --git a/papers/dat-paper.txt b/papers/dat-paper.txt index ee310f8..f0e71c2 100644 --- a/papers/dat-paper.txt +++ b/papers/dat-paper.txt @@ -113,8 +113,9 @@ reference implementation is available from npm as \texttt{npm\ install\ dat\ -g}. The protocol is agnostic to the underlying transport e.g.~you could -implement Dat over carrier pigeon. The key properties of the Dat design -are explained in this section. +implement Dat over carrier pigeon. Data is stored in a format called +SLEEP (Ogden and Buus 2017), described in it's own paper. The key +properties of the Dat design are explained in this section. \begin{itemize} \tightlist @@ -130,6 +131,9 @@ are explained in this section. \item 2.4 \textbf{Incremental Versioning} - Datasets can be efficiently synced, even in real time, to other peers. +\item + 2.5 \textbf{Random Access} - Huge file hierarchies can be efficiently + traversed remotely. \end{itemize} \subsection{2.1 Content Integrity}\label{content-integrity} @@ -573,7 +577,7 @@ Merkle tree (the content register): 5 - hash(4 + 6) 6 - hash(cat-1) 8 - hash(cat-2) - 9 - hash() + 9 - hash(8 + 10) 10 - hash(cat-3) \end{verbatim} @@ -630,693 +634,95 @@ three \texttt{Data} messages with the data for each block, as well as the hashes needed to verify the content in a way similar to the process described above for the metadata feed. -\subsection{3. SLEEP Specification}\label{sleep-specification} - -This section is a technical description of the SLEEP format intended for -implementers. SLEEP is the the on-disk format that Dat produces and -uses. It is a set of 9 files that hold all of the metadata needed to -list the contents of a Dat repository and verify the integrity of the -data you receive. SLEEP is designed to work with REST, allowing servers -to be plain HTTP file servers serving the static SLEEP files, meaning -you can implement a Dat protocol client using HTTP with a static HTTP -file server as the backend. - -SLEEP files contain metadata about the data inside a Dat repository, -including cryptographic hashes, cryptographic signatures, filenames and -file permissions. The SLEEP format is specifically engineered to allow -efficient access to subsets of the metadata and/or data in the -repository, even on very large repositories, which enables Dat's peer to -peer networking to be fast. - -The acronym SLEEP is a slumber related pun on REST and stands for -Syncable Lightweight Event Emitting Persistence. The Event Emitting part -refers to how SLEEP files are append-only in nature, meaning they grow -over time and new updates can be subscribed to as a realtime feed of -events through the Dat protocol. - -The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. -SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. - -\subsubsection{SLEEP Files}\label{sleep-files} - -SLEEP is a set of 9 files that should be stored with the following -names. In Dat, the files are stored in a folder called \texttt{.dat} in -the top level of the repository. - -\begin{verbatim} -metadata.key -metadata.signatures -metadata.bitfield -metadata.tree -metadata.data -content.key -content.signatures -content.bitfield -content.tree -\end{verbatim} - -The files prefixed with \texttt{content} store metadata about the -primary data in a Dat repository, for example the raw binary contents of -the files. The files prefixed with \texttt{metadata} store metadata -about the files in the repository, for example the filenames, file -sizes, and file permissions. The \texttt{content} and \texttt{metadata} -files are both Hypercore registers, making SLEEP a set of two Hypercore -registers. - -\subsubsection{SLEEP File Headers}\label{sleep-file-headers} +\subsection{2.5 Random Access}\label{random-access} -The following structured binary format is used for \texttt{signatures}, -\texttt{bitfield}, and \texttt{tree} files. The header contains metadata -as well as information needed to decode the rest of the files after the -header. SLEEP files are designed to be easy to append new data, easy to -read arbitrary byte offsets in the middle, and are relatively flat, -simple files that rely on the filesystem for the heavy lifting. - -SLEEP files are laid out like this: - -\begin{verbatim} -<32 byte header> - - - - -\end{verbatim} +Dat pursues the following access capabilities: \begin{itemize} \tightlist \item - 32 byte header -\item - 4 bytes - magic byte (value varies depending on which file, used to - quickly identify which file type it is) -\item - 1 byte - version number of the file header protocol, current version - is 0 -\item - 2 byte Uint16BE - entry size, describes how long each entry in the - file is -\item - 1 byte - length prefix for body -\item - rest of 32 byte header - string describing key algorithm (in dat - `ed25519'). length of this string matches the length in the previous - length prefix field. This string must fit within the 32 byte header - limitation (24 bytes reserved for string). Unused bytes should be - filled with zeroes. -\end{itemize} - -Possible values in the Dat implementation for the body field are: - -\begin{verbatim} -Ed25519 -BLAKE2b -\end{verbatim} - -To calculate the offset of some entry position, first read the header -and get the entry size, then do -\texttt{32\ +\ entrySize\ *\ entryIndex}. To calculate how many entries -are in a file, you can use the entry size and the filesize on disk and -do \texttt{(fileSize\ -\ 32)\ /\ entrySize}. - -As mentioned above, \texttt{signatures}, \texttt{bitfield} and -\texttt{tree} are the three SLEEP files. There are two additional files, -\texttt{key}, and \texttt{data}, which do not contain SLEEP file headers -and store plain serialized data for easy access. \texttt{key} stores the -public key that is described by the \texttt{signatures} file, and -\texttt{data} stores the raw chunk data that the \texttt{tree} file -contains the hashes and metadata for. - -\subsubsection{File Descriptions}\label{file-descriptions} - -\paragraph{key}\label{key} - -The public key used to verify the signatures in the \texttt{signatures} -file, stored in binary as a single buffer written to disk. To find out -what format of key is stored in this file, read the header of -\texttt{signatures}. In Dat, it's always a ed25519 public key, but other -implementations can specify other key types using a string value in that -header. - -\paragraph{tree}\label{tree} - -A SLEEP formatted 32 byte header with data entries representing a -serialized Merkle tree based on the data in the data storage layer. All -the fixed size nodes written in in-order tree notation. The header -algorithm string for \texttt{tree} files is \texttt{BLAKE2b}. The entry -size is 40 bytes. Entries are formatted like this: - -\begin{verbatim} -<32 byte header> - <4 byte magic string: 0x05025702> - <1 byte version number: 0> - <2 byte entry size: 40> - <1 byte algorithm name length prefix: 7> - <7 byte algorithm name: BLAKE2b> - <17 zeroes> -<40 byte entries> - <32 byte BLAKE2b hash> - <8 byte Uint64BE children leaf byte length> -\end{verbatim} - -The children leaf byte length is the byte size containing the sum byte -length of all leaf nodes in the tree below this node. - -This file uses the in-order notation, meaning even entries are leaf -nodes and odd entries are parent nodes (non-leaf). - -To prevent pre-image attacks, all hashes start with a one byte type -descriptor: - -\begin{verbatim} -0 - LEAF -1 - PARENT -2 - ROOT -\end{verbatim} - -To calculate leaf node entries (the hashes of the data entries) we hash -this data: - -\begin{verbatim} -BLAKE2b( - <1 byte type> - 0 - <8 bytes Uint64BE> - length of entry data - -) -\end{verbatim} - -Then we take this 32 byte hash and write it to the tree as 40 bytes like -this: - -\begin{verbatim} -<32 bytes> - BLAKE2b hash -<8 bytes Uint64BE> - length of data -\end{verbatim} - -Note that the Uint64 of length of data is included both in the hashed -data and written at the end of the entry. This is to expose more -metadata to Dat for advanced use cases such as verifying data length in -sparse replication scenarios. - -To calculate parent node entries (the hashes of the leaf nodes) we hash -this data: - -\begin{verbatim} -BLAKE2b( - <1 byte> - 1 - <8 bytes Uint64BE> - left child length + right child length - <32 bytes> - left child hash - <32 bytes> - right child hash -) -\end{verbatim} - -Then we take this 32 byte hash and write it to the tree as 40 bytes like -this: - -\begin{verbatim} -<32 bytes> - BLAKE2b hash -<8 bytes Uint64BE> - left child length + right child length -\end{verbatim} - -The reason the tree entries contain data lengths is to allow for sparse -mode replication. Encoding lengths (and including lengths in all hashes) -means you can verify the Merkle subtrees independent of the rest of the -tree, which happens during sparse replication scenarios. - -The tree file corresponds directly to the \texttt{data} file. - -\paragraph{data}\label{data} - -The \texttt{data} file is only included in the SLEEP format for the -\texttt{metadata.*} prefixed files which contains filesystem metadata -and not actual file data. For the \texttt{content.*} files, the data is -stored externally (in Dat it is stored as normal files on the filesystem -and not in a SLEEP file). However you can configure Dat to use a -\texttt{content.data} file if you want and it will still work. If you -want to store the full history of all versions of all files, using the -\texttt{content.data} file would provide that guarantee, but would have -the disadvantage of storing files as chunks merged into one huge file -(not as user friendly). - -The \texttt{data} file does not contain a SLEEP file header. It just -contains a bunch of concatenated data entries. Entries are written in -the same order as they appear in the \texttt{tree} file. To read a -\texttt{data} file, first decode the \texttt{tree} file and for every -leaf in the \texttt{tree} file you can calculate a data offset for the -data described by that leaf node in the \texttt{data} file. - -\subparagraph{Index Lookup}\label{index-lookup} - -For example, if we wanted to seek to a specific entry offset (say entry -42): - -\begin{itemize} -\tightlist + Support large file hierachies (millions of files in a single + repository). \item - First, read the header of the \texttt{tree} file and get the entry - size, then do \texttt{32\ +\ entrySize\ *\ 42} to get the raw tree - index: \texttt{32\ +\ (40\ *\ 42)} + Support efficient traversal of the hierarchy (listing files in + arbitrary folders efficiently). \item - Since we want the leaf entry (even node in the in-order layout), we - multiply the entry index by 2: \texttt{32\ +\ (40\ *\ (42\ *\ 2))} + Store all changes to all files (metadata and/or content). \item - Read the 40 bytes at that offset in the \texttt{tree} file to get the - leaf node entry. + List all changes made to any single file. \item - Read the last 8 bytes of the entry to get the length of the data entry + View the state of all files relative to any point in time. \item - To calculate the offset of where in the \texttt{data} file your entry - begins, you need to sum all the lengths of all the earlier entries in - the tree. The most efficient way to do this is to sum all the previous - parent node (non-leaf) entry lengths. You can also sum all leaf node - lengths, but parent nodes contain the sum of their children's lengths - so it's more efficient to use parents. During Dat replication, these - nodes are fetched as part of the Merkle tree verification so you will - already have them locally. This is a log(N) operation where N is the - entry index. Entries are also small and therefore easily cacheable. + Subscribe live to all changes (any file). \item - Once you get the offset, you use the length you decoded above and read - N bytes (where N is the decoded length) at the offset in the - \texttt{data} file. You can verify the data integrity using the 32 - byte hash from the \texttt{tree} entry. -\end{itemize} - -\subparagraph{Byte Lookup}\label{byte-lookup} - -The above method illustrates how to resolve a chunk position index to a -byte offset. You can also do the reverse operation, resolving a byte -offset to a chunk position index. This is used to stream arbitrary -random access regions of files in sparse replication scenarios. - -\begin{itemize} -\tightlist + Subscribe live to changes to files under a specific path. \item - First, you start by calculating the current Merkle roots + Efficiently access any byte range of any version of any file. \item - Each node in the tree (including these root nodes) stores the - aggregate file size of all byte sizes of the nodes below it. So the - roots cumulatively will describe all possible byte ranges for this - repository. + Allow all of the above to happen remotely, only syncing the minimum + metadata necessary to perform any action. \item - Find the root that contains the byte range of the offset you are - looking for and get the node information for all of that nodes - children using the Index Lookup method, and recursively repeat this - step until you find the lowest down child node that describes this - byte range. + Allow efficient comparison of remote and local repository state to + request missing pieces during synchronization. \item - The chunk described by this child node will contain the byte range you - are looking for. You can use the \texttt{byteOffset} property in the - \texttt{Stat} metadata object to seek into the right position in the - content for the start of this chunk. + Allow entire remote archive to be synchronized, or just some subset of + files and/or versions. \end{itemize} -\subparagraph{Metadata Overhead}\label{metadata-overhead} +The way Dat accomplishes these is through a combination of storing all +changes in Hypercore feeds, but also using strategic metadata indexing +strategies that support certain queries efficiently to be performed by +traversing the Hypercore feeds. The protocol itself is specified in +Section 3 (SLEEP), but a scenario based summary follows here. -Using this scheme, if you write 4GB of data using on average 64KB data -chunks (note: chunks can be variable length and do not need to be the -same size), your tree file will be around 5MB (0.0125\% overhead). +\subsubsection{Scenario: Reading a file from a specific byte +offset}\label{scenario-reading-a-file-from-a-specific-byte-offset} -\paragraph{signatures}\label{signatures} +Alice has a dataset in Dat, Bob wants to access a 100MB CSV called +\texttt{cat\_dna.csv} stored in the remote repository, but only wants to +access the 10MB range of the CSV spanning from 30MB - 40MB. -A SLEEP formatted 32 byte header with data entries being 64 byte -signatures. - -\begin{verbatim} -<32 byte header> - <4 byte magic string: 0x05025701> - <1 byte version number: 0> - <2 byte entry size: 64> - <1 byte algorithm name length prefix: 7> - <7 byte algorithm name: Ed25519> - <17 zeroes> -<64 byte entries> - <64 byte Ed25519 signature> -\end{verbatim} +Bob has never communicated with Alice before, and is starting fresh with +no knowledge of this Dat repository other than that he knows he wants +\texttt{cat\_dna.csv} at a specific offset. -Every time the tree is updated we sign the current roots of the Merkle -tree, and append them to the signatures file. The signatures file starts -with no entries. Each time a new leaf is appended to the \texttt{tree} -file (aka whenever data is added to a Dat), we take all root hashes at -the current state of the Merkle tree and hash and sign them, then append -them as a new entry to the signatures file. +First, Bob asks Alice through the Dat protocol for the metadata he needs +to resolve \texttt{cat\_dna.csv} to the correct metadata feed entry that +represents the file he wants. Note: In this scenario we assume Bob wants +the latest version of \texttt{cat\_dna.csv}. It is also possible to do +this for a specific older version. -\begin{verbatim} -Ed25519 sign( - BLAKE2b( - <1 byte> - 2 // root type - for (every root node left-to-right) { - <32 byte root hash> - <8 byte Uint64BE root tree index> - <8 byte Uint64BE child byte lengths> - } - ) -) -\end{verbatim} +Bob first sends a \texttt{Request} message for the latest entry in the +metadata feed. Alice responds. Bob looks at the \texttt{trie} value, and +using the lookup algorithm described below sends another +\texttt{Request} message for the metadata node that is closer to the +filename he is looking for. This repeats until Alice sends Bob the +matching metadata entry. This is the un-optimized resolution that uses +\texttt{log(n)} round trips, though there are ways to optimize this by +having Alice send additional sequence numbers to Bob that help him +traverse in less round trips. -The reason we hash all the root nodes is that the BLAKE2b hash above is -only calculable if you have all of the pieces of data required to -generate all the intermediate hashes. This is the crux of Dat's data -integrity guarantees. +In the metadata record Bob recieved for \texttt{cat\_dna.csv} there is +the byte offset to the beginning of the file in the data feed. Bob adds +his +30MB offset to this value and starts requesting pieces of data +starting at that byte offset using the SLEEP protocol as described +below. -\paragraph{bitfield}\label{bitfield} +This method tries to allow any byte range of any file to be accessed +without the need to synchronize the full metadata for all files up +front. -A SLEEP formatted 32 byte header followed by a series of 3328 byte long -entries. +\subsubsection{Scenario: Syncing live changes to files at a specific +path}\label{scenario-syncing-live-changes-to-files-at-a-specific-path} -\begin{verbatim} -<32 byte header> - <4 byte magic string: 0x05025700> - <1 byte version number: 0> - <2 byte entry size: 3328> - <1 byte algorithm name length: 0> - <1 byte algorithm name: 0> - <24 zeroes> -<3328 byte entries> // (2048 + 1024 + 256) -\end{verbatim} +TODO -The bitfield describes which pieces of data you have, and which nodes in -the \texttt{tree} file have been written. This file exists as an index -of the \texttt{tree} and \texttt{data} to quickly figure out which -pieces of data you have or are missing. This file can be regenerated if -you delete it, so it is considered a materialized index. +\subsubsection{Scenario: Syncing an entire +archive}\label{scenario-syncing-an-entire-archive} -The \texttt{bitfield} file actually contains three bitfields of -different sizes. A bitfield (AKA bitmap) is defined as a set of bits -where each bit (0 or 1) represents if you have or do not have a piece of -data at that bit index. So if there is a dataset of 10 cat pictures, and -you have pictures 1, 3, and 5 but are missing the rest, your bitfield -would look like \texttt{1010100000}. +TODO -Each entry contains three objects: - -\begin{itemize} -\tightlist -\item - Data Bitfield (1024 bytes) - 1 bit for for each data entry that you - have synced (1 for every entry in \texttt{data}). -\item - Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in - \texttt{tree}) -\item - Bitfield Index (256 bytes) - This is an index of the Data Bitfield - that makes it efficient to figure out which pieces of data are missing - from the Data Bitfield without having to do a linear scan. -\end{itemize} - -The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that -because most filesystems work in 4Kb chunk sizes, we can fit the Data, -Tree and Index in less then 4Kb of data for efficient writes to the -filesystem. The Tree and Index sizes are based on the Data size (the -Tree has twice the entries as the Data, odd and even nodes vs just even -nodes in \texttt{tree}, and Index is always 1/4th the size). - -To generate the Index, you take pairs of 2 bytes at a time from the Data -Bitfield, check if all bits in the 2 bytes are the same, and generate 4 -bits of Index metadata~for every 2 bytes of Data (hence how 1024 bytes -of Data ends up as 256 bytes of Index). - -First you generate a 2 bit tuple for the 2 bytes of Data: - -\begin{verbatim} -if (data is all 1's) then [1,1] -if (data is all 0's) then [0,0] -if (data is not all the same) then [1, 0] -\end{verbatim} - -The Index itself is an in-order binary tree, not a traditional bitfield. -To generate the tree, you take the tuples you generate above and then -write them into a tree like the following example, where non-leaf nodes -are generated using the above scheme by looking at the results of the -relative even child tuples for each odd parent tuple: - -\begin{verbatim} -// for e.g. 16 bytes (8 tuples) of -// sparsely replicated data -0 - [00 00 00 00] -1 - [10 10 10 10] -2 - [11 11 11 11] -\end{verbatim} - -The tuples at entry \texttt{1} above are \texttt{{[}1,0{]}} because the -relative child tuples are not uniform. In the following example, all -non-leaf nodes are \texttt{{[}1,1{]}} because their relative children -are all uniform (\texttt{{[}1,1{]}}) - -\begin{verbatim} -// for e.g. 32 bytes (16 tuples) of -// fully replicated data (all 1's) -0 - [11 11 11 11] -1 - [11 11 11 11] -2 - [11 11 11 11] -3 - [11 11 11 11] -4 - [11 11 11 11] -5 - [11 11 11 11] -6 - [11 11 11 11] -\end{verbatim} - -Using this scheme, to represent 32 bytes of data it takes at most 8 -bytes of Index. In this example it compresses nicely as its all -contiguous ones on disk, similarly for an empty bitfield it would be all -zeroes. - -If you write 4GB of data using on average 64KB data chunk size, your -bitfield will be at most 32KB. - -\paragraph{metadata.data}\label{metadata.data} - -This file is used to store content described by the rest of the -\texttt{metadata.*} hypercore SLEEP files. Whereas the -\texttt{content.*} SLEEP files describe the data stored in the actual -data cloned in the Dat repository filesystem, the \texttt{metadata} data -feed is stored inside the \texttt{.dat} folder along with the rest of -the SLEEP files. - -The contents of this file is a series of versions of the Dat filesystem -tree. As this is a hypercore data feed, it's just an append only log of -binary data entries. The challenge is representing a tree in a -one-dimensional way to make it representable as a Hypercore register. -For example, imagine three files: - -\begin{verbatim} -~/dataset $ ls -figures - graph1.png - graph2.png -results.csv - -1 directory, 3 files -\end{verbatim} - -We want to take this structure and map it to a serialized representation -that gets written into an append only log in a way that still allows for -efficient random access by file path. - -To do this, we convert the filesystem metadata into entries in a feed -like this: - -\begin{verbatim} -{ - "path": "/results.csv", - children: [[]], - sequence: 0 -} -{ - "path": "/figures/graph1.png", - children: [[0], []], - sequence: 1 -} -{ - "path": "/figures/graph2.png", - children: [[0], [1]], - sequence: 2 -} -\end{verbatim} - -\subparagraph{Filename Resolution}\label{filename-resolution} - -Each sequence represents adding one of the files to the register, so at -sequence 0 the filesystem state only has a single file, -\texttt{results.csv} in it. At sequence 1, there are only 2 files added -to the register, and at sequence 3 all files are finally added. The -\texttt{children} field represents a shorthand way of declaring which -other files at every level of the directory hierarchy exist alongside -the file being added at that revision. For example at the time of -sequence 1, children is \texttt{{[}{[}0{]},\ {[}{]}{]}}. The first -sub-array, \texttt{{[}0{]}}, represents the first folder in the -\texttt{path}, which is the root folder \texttt{/}. In this case -\texttt{{[}0{]}} means the root folder at this point in time only has a -single file, the file that is the subject of sequence \texttt{0}. The -second subarray is empty \texttt{{[}{]}} because there are no other -existing files in the second folder in the \texttt{path}, -\texttt{figures}. - -To look up a file by filename, you fetch the latest entry in the log, -then use the \texttt{children} metadata in that entry to look up the -longest common ancestor based on the parent folders of the filename you -are querying. You can then recursively repeat this operation until you -find the \texttt{path} entry you are looking for (or you exhaust all -options which means the file does not exist). This is a -\texttt{O(number\ of\ slashes\ in\ your\ path)} operation. - -For example, if you wanted to look up \texttt{/results.csv} given the -above register, you would start by grabbing the metadata at sequence 2. -The longest common ancestor between \texttt{/results.csv} and -\texttt{/figures/graph2} is \texttt{/}. You then grab the corresponding -entry in the children array for \texttt{/}, which in this case is the -first entry, \texttt{{[}0{]}}. You then repeat this with all of the -children entries until you find a child that is closer to the entry you -are looking for. In this example, the first entry happens to be the -match we are looking for. - -You can also perform lookups relative to a point in time by starting -from a specific sequence number in the register. For example to get the -state of some file relative to an old sequence number, similar to -checking out an old version of a repository in Git. - -\subparagraph{Data Serialization}\label{data-serialization} - -The format of the \texttt{metadata.data} file is as follows: - -\begin{verbatim} -
- - - - -\end{verbatim} - -Each entry in the file is encoded using Protocol Buffers (Varda 2008). - -The first message we write to the file is of a type called Header which -uses this schema: - -\begin{verbatim} -message Header { - required string type = 1; - optional bytes content = 2; -} -\end{verbatim} - -This is used to declare two pieces of metadata used by Dat. It includes -a \texttt{type} string with the value \texttt{hyperdrive} and -\texttt{content} binary value that holds the public key of the content -register that this metadata register represents. When you share a Dat, -the metadata key is the main key that gets used, and the content -register key is linked from here in the metadata. - -After the header the file will contain many filesystem \texttt{Node} -entries: - -\begin{verbatim} -message Node { - required string path = 1; - optional Stat value = 2; - optional bytes children = 3; - repeated Writer writers = 4; - optional uint64 writersSequence = 5; -} - -message Writer { - required bytes publicKey = 1; - optional string permission = 2; -} -\end{verbatim} - -The \texttt{Node} object has five fields - -\begin{itemize} -\tightlist -\item - \texttt{path} - the string of the absolute file path of this file. -\item - \texttt{Stat} - a Stat encoded object representing the file metadata -\item - \texttt{children} - a compressed list of the sequence numbers as - described earlier -\item - \texttt{writers} - a list of the writers who are allowed to write to - this dat -\item - \texttt{writersSequence} - a reference to the last sequence where the - writers array was modified. you can use this to quickly find the value - of the writers keys. -\end{itemize} - -The \texttt{children} value is encoded by starting with the nested array -of sequence numbers, e.g. -\texttt{{[}{[}{[}0,\ 3{]}{]},\ {[}{[}0,\ 2{]},\ {[}0,\ 1{]}{]}{]}}. Each -entry is a tuple where the first item is the index of the feed in the -\texttt{writers} array and the second value is the sequence number. -Finally you prepend the children value with a version number varint. - -To write these subarrays we use variable width integers (varints), using -a repeating pattern like this, one for each array: - -\begin{verbatim} - - - - - - -\end{verbatim} - -This encoding is designed for efficiency as it reduces the filesystem -path + feed index metadata down to a series of small integers. - -The \texttt{Stat} objects use this encoding: - -\begin{verbatim} -message Stat { - required uint32 mode = 1; - optional uint32 uid = 2; - optional uint32 gid = 3; - optional uint64 size = 4; - optional uint64 blocks = 5; - optional uint64 offset = 6; - optional uint64 byteOffset = 7; - optional uint64 mtime = 8; - optional uint64 ctime = 9; -} -\end{verbatim} - -These are the field definitions: - -\begin{itemize} -\tightlist -\item - \texttt{mode} - POSIX file mode bitmask -\item - \texttt{uid} - POSIX user id -\item - \texttt{gid} - POSIX group id -\item - \texttt{size} - file size in bytes -\item - \texttt{blocks} - number of data chunks that make up this file -\item - \texttt{offset} - the data feed entry index for the first chunk in - this file -\item - \texttt{byteOffset} - the data feed file byte offset for the first - chunk in this file -\item - \texttt{mtime} - POSIX modified\_at time -\item - \texttt{mtime} - POSIX created\_at time -\end{itemize} - -\subsection{4. Dat Network Protocol}\label{dat-network-protocol} +\subsection{3. Dat Network Protocol}\label{dat-network-protocol} The SLEEP format is designed to allow for sparse replication, meaning you can efficiently download only the metadata and data required to @@ -1565,7 +971,7 @@ message Cancel { } \end{verbatim} -\subsubsection{Data}\label{data-1} +\subsubsection{Data}\label{data} Type 9. Sends a single chunk of data to the other peer. You can send it in response to a Request or unsolicited on its own as a friendly gift. @@ -1609,7 +1015,7 @@ message Data { } \end{verbatim} -\section{5. Multi-Writer}\label{multi-writer} +\section{4. Multi-Writer}\label{multi-writer} The design of Dat up to this point assumes you have a single keyholder writing and signing data and appending it to the metadata and content @@ -1647,7 +1053,7 @@ the latest value, or whatever other strategy they prefer. If a writer updates the value of a forked key with new value they are performing a merge. -\section{6. Existing Work}\label{existing-work} +\section{5. Existing Work}\label{existing-work} Dat is inspired by a number of features from existing systems. @@ -1802,7 +1208,7 @@ public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called ``rogue certificates''. -\section{7. Reference Implementation}\label{reference-implementation} +\section{6. Reference Implementation}\label{reference-implementation} The connection logic is implemented in a module called \href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. @@ -1864,14 +1270,13 @@ Mykletun, Einar, Maithili Narasimha, and Gene Tsudik. 2003. ``Providing Authentication and Integrity in Outsourced Databases Using Merkle Hash Trees.'' \emph{UCI-SCONCE Technical Report}. +\hypertarget{ref-sleep}{} +Ogden, Maxwell, and Mathias Buus. 2017. ``SLEEP - the Dat Protocol on +Disk Format.'' In. + \hypertarget{ref-rossi2010ledbat}{} Rossi, Dario, Claudio Testa, Silvio Valenti, and Luca Muscariello. 2010. ``LEDBAT: The New Bittorrent Congestion Control Protocol.'' In \emph{ICCCN}, 1--6. -\hypertarget{ref-varda2008protocol}{} -Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange -Format.'' \emph{Google Open Source Blog, Available at Least as Early as -Jul}. - \end{document} diff --git a/papers/hyperdrive.md b/papers/hyperdrive.md new file mode 100644 index 0000000..a0d70b3 --- /dev/null +++ b/papers/hyperdrive.md @@ -0,0 +1,13 @@ +--- +title: "Hyperdrive - A distributed web filesystem with incremental sync" +date: "August 2017" +author: "Mathias Buus Madsen, Maxwell Ogden, Code for Science" +--- + +# Abstract + +# Acknowledgements + +This work was made possible through grants from the John S. and James L. Knight and Alfred P. Sloan Foundations. + +# References diff --git a/papers/sleep.md b/papers/sleep.md new file mode 100644 index 0000000..63dbe48 --- /dev/null +++ b/papers/sleep.md @@ -0,0 +1,424 @@ +--- +title: "SLEEP - Syncable Ledger of Exact Events Protocol" +date: "August 2017" +author: "Mathias Buus Madsen, Maxwell Ogden, Code for Science" +--- + +## SLEEP + +This document is a technical description of the SLEEP format intended for implementers. SLEEP is the the on-disk format that Dat produces and uses. It is a set of 9 files that hold all of the metadata needed to list the contents of a Dat repository and verify the integrity of the data you receive. SLEEP is designed to work with REST, allowing servers to be plain HTTP file servers serving the static SLEEP files, meaning you can implement a Dat protocol client using HTTP with a static HTTP file server as the backend. + +SLEEP files contain metadata about the data inside a Dat repository, including cryptographic hashes, cryptographic signatures, filenames and file permissions. The SLEEP format is specifically designed to allow efficient access to subsets of the metadata and/or data in the repository, even on very large repositories, which enables Dat's peer to peer networking to be fast. + +The acronym SLEEP is a slumber related pun on REST and stands for Syncable Ledger of Exact Events Protocol. The Syncable part refers to how SLEEP files are append-only in nature, meaning they grow over time and new updates can be subscribed to as a realtime feed of events through the Dat protocol. + +The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. + +### SLEEP Files + +SLEEP is a set of 9 files that should be stored with the following names. In Dat, the files are stored in a folder called `.dat` in the top level of the repository. + +``` +metadata.key +metadata.signatures +metadata.bitfield +metadata.tree +metadata.data +content.key +content.signatures +content.bitfield +content.tree +``` + +The files prefixed with `content` store metadata about the primary data in a Dat repository, for example the raw binary contents of the files. The files prefixed with `metadata` store metadata about the files in the repository, for example the filenames, file sizes, and file permissions. The `content` and `metadata` files are both Hypercore registers, making SLEEP a set of two Hypercore registers. + +### SLEEP File Headers + +The following structured binary format is used for `signatures`, `bitfield`, and `tree` files. The header contains metadata as well as information needed to decode the rest of the files after the header. SLEEP files are designed to be easy to append new data, easy to read arbitrary byte offsets in the middle, and are relatively flat, simple files that rely on the filesystem for the heavy lifting. + +SLEEP files are laid out like this: + +``` +<32 byte header> + + + + +```` + +- 32 byte header + - 4 bytes - magic byte (value varies depending on which file, used to quickly identify which file type it is) + - 1 byte - version number of the file header protocol, current version is 0 + - 2 byte Uint16BE - entry size, describes how long each entry in the file is + - 1 byte - length prefix for body + - rest of 32 byte header - string describing key algorithm (in dat 'ed25519'). length of this string matches the length in the previous length prefix field. This string must fit within the 32 byte header limitation (24 bytes reserved for string). Unused bytes should be filled with zeroes. + +Possible values in the Dat implementation for the body field are: + +``` +Ed25519 +BLAKE2b +``` + +To calculate the offset of some entry position, first read the header and get the entry size, then do `32 + entrySize * entryIndex`. To calculate how many entries are in a file, you can use the entry size and the filesize on disk and do `(fileSize - 32) / entrySize`. + +As mentioned above, `signatures`, `bitfield` and `tree` are the three SLEEP files. There are two additional files, `key`, and `data`, which do not contain SLEEP file headers and store plain serialized data for easy access. `key` stores the public key that is described by the `signatures` file, and `data` stores the raw chunk data that the `tree` file contains the hashes and metadata for. + +### File Descriptions + +#### key + +The public key used to verify the signatures in the `signatures` file, stored in binary as a single buffer written to disk. To find out what format of key is stored in this file, read the header of `signatures`. In Dat, it's always a ed25519 public key, but other implementations can specify other key types using a string value in that header. + +#### tree + +A SLEEP formatted 32 byte header with data entries representing a serialized Merkle tree based on the data in the data storage layer. All the fixed size nodes written in in-order tree notation. The header algorithm string for `tree` files is `BLAKE2b`. The entry size is 40 bytes. Entries are formatted like this: + +``` +<32 byte header> + <4 byte magic string: 0x05025702> + <1 byte version number: 0> + <2 byte entry size: 40> + <1 byte algorithm name length prefix: 7> + <7 byte algorithm name: BLAKE2b> + <17 zeroes> +<40 byte entries> + <32 byte BLAKE2b hash> + <8 byte Uint64BE children leaf byte length> +``` + +The children leaf byte length is the byte size containing the sum byte length of all leaf nodes in the tree below this node. + +This file uses the in-order notation, meaning even entries are leaf nodes and odd entries are parent nodes (non-leaf). + +To prevent pre-image attacks, all hashes start with a one byte type descriptor: + +``` +0 - LEAF +1 - PARENT +2 - ROOT +``` + +To calculate leaf node entries (the hashes of the data entries) we hash this data: + +``` +BLAKE2b( + <1 byte type> + 0 + <8 bytes Uint64BE> + length of entry data + +) +``` + +Then we take this 32 byte hash and write it to the tree as 40 bytes like this: + +``` +<32 bytes> + BLAKE2b hash +<8 bytes Uint64BE> + length of data +``` + +Note that the Uint64 of length of data is included both in the hashed data and written at the end of the entry. This is to expose more metadata to Dat for advanced use cases such as verifying data length in sparse replication scenarios. + +To calculate parent node entries (the hashes of the leaf nodes) we hash this data: + +``` +BLAKE2b( + <1 byte> + 1 + <8 bytes Uint64BE> + left child length + right child length + <32 bytes> + left child hash + <32 bytes> + right child hash +) +``` + +Then we take this 32 byte hash and write it to the tree as 40 bytes like this: + +``` +<32 bytes> + BLAKE2b hash +<8 bytes Uint64BE> + left child length + right child length +``` + +The reason the tree entries contain data lengths is to allow for sparse mode replication. Encoding lengths (and including lengths in all hashes) means you can verify the Merkle subtrees independent of the rest of the tree, which happens during sparse replication scenarios. + +The tree file corresponds directly to the `data` file. + +#### data + +The `data` file is only included in the SLEEP format for the `metadata.*` prefixed files which contains filesystem metadata and not actual file data. For the `content.*` files, the data is stored externally (in Dat it is stored as normal files on the filesystem and not in a SLEEP file). However you can configure Dat to use a `content.data` file if you want and it will still work. If you want to store the full history of all versions of all files, using the `content.data` file would provide that guarantee, but would have the disadvantage of storing files as chunks merged into one huge file (not as user friendly). + +The `data` file does not contain a SLEEP file header. It just contains a bunch of concatenated data entries. Entries are written in the same order as they appear in the `tree` file. To read a `data` file, first decode the `tree` file and for every leaf in the `tree` file you can calculate a data offset for the data described by that leaf node in the `data` file. + +##### Index Lookup + +For example, if we wanted to seek to a specific entry offset (say entry 42): + +- First, read the header of the `tree` file and get the entry size, then do `32 + entrySize * 42` to get the raw tree index: `32 + (40 * 42)` +- Since we want the leaf entry (even node in the in-order layout), we multiply the entry index by 2: + `32 + (40 * (42 * 2))` +- Read the 40 bytes at that offset in the `tree` file to get the leaf node entry. +- Read the last 8 bytes of the entry to get the length of the data entry +- To calculate the offset of where in the `data` file your entry begins, you need to sum all the lengths of all the earlier entries in the tree. The most efficient way to do this is to sum all the previous parent node (non-leaf) entry lengths. You can also sum all leaf node lengths, but parent nodes contain the sum of their children's lengths so it's more efficient to use parents. During Dat replication, these nodes are fetched as part of the Merkle tree verification so you will already have them locally. This is a log(N) operation where N is the entry index. Entries are also small and therefore easily cacheable. +- Once you get the offset, you use the length you decoded above and read N bytes (where N is the decoded length) at the offset in the `data` file. You can verify the data integrity using the 32 byte hash from the `tree` entry. + +##### Byte Lookup + +The above method illustrates how to resolve a chunk position index to a byte offset. You can also do the reverse operation, resolving a byte offset to a chunk position index. This is used to stream arbitrary random access regions of files in sparse replication scenarios. + +- First, you start by calculating the current Merkle roots +- Each node in the tree (including these root nodes) stores the aggregate file size of all byte sizes of the nodes below it. So the roots cumulatively will describe all possible byte ranges for this repository. +- Find the root that contains the byte range of the offset you are looking for and get the node information for all of that nodes children using the Index Lookup method, and recursively repeat this step until you find the lowest down child node that describes this byte range. +- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` property in the `Stat` metadata object to seek into the right position in the content for the start of this chunk. + +##### Metadata Overhead + +Using this scheme, if you write 4GB of data using on average 64KB data chunks (note: chunks can be variable length and do not need to be the same size), your tree file will be around 5MB (0.0125% overhead). + +#### signatures + +A SLEEP formatted 32 byte header with data entries being 64 byte signatures. + +``` +<32 byte header> + <4 byte magic string: 0x05025701> + <1 byte version number: 0> + <2 byte entry size: 64> + <1 byte algorithm name length prefix: 7> + <7 byte algorithm name: Ed25519> + <17 zeroes> +<64 byte entries> + <64 byte Ed25519 signature> +``` + +Every time the tree is updated we sign the current roots of the Merkle tree, and append them to the signatures file. The signatures file starts with no entries. Each time a new leaf is appended to the `tree` file (aka whenever data is added to a Dat), we take all root hashes at the current state of the Merkle tree and hash and sign them, then append them as a new entry to the signatures file. + +``` +Ed25519 sign( + BLAKE2b( + <1 byte> + 2 // root type + for (every root node left-to-right) { + <32 byte root hash> + <8 byte Uint64BE root tree index> + <8 byte Uint64BE child byte lengths> + } + ) +) +``` + +The reason we hash all the root nodes is that the BLAKE2b hash above is only calculable if you have all of the pieces of data required to generate all the intermediate hashes. This is the crux of Dat's data integrity guarantees. + +#### bitfield + +A SLEEP formatted 32 byte header followed by a series of 3328 byte long entries. + +``` +<32 byte header> + <4 byte magic string: 0x05025700> + <1 byte version number: 0> + <2 byte entry size: 3328> + <1 byte algorithm name length: 0> + <1 byte algorithm name: 0> + <24 zeroes> +<3328 byte entries> // (2048 + 1024 + 256) +``` + +The bitfield describes which pieces of data you have, and which nodes in the `tree` file have been written. This file exists as an index of the `tree` and `data` to quickly figure out which pieces of data you have or are missing. This file can be regenerated if you delete it, so it is considered a materialized index. + +The `bitfield` file actually contains three bitfields of different sizes. A bitfield (AKA bitmap) is defined as a set of bits where each bit (0 or 1) represents if you have or do not have a piece of data at that bit index. So if there is a dataset of 10 cat pictures, and you have pictures 1, 3, and 5 but are missing the rest, your bitfield would look like `1010100000`. + +Each entry contains three objects: + +- Data Bitfield (1024 bytes) - 1 bit for for each data entry that you have synced (1 for every entry in `data`). +- Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in `tree`) +- Bitfield Index (256 bytes) - This is an index of the Data Bitfield that makes it efficient to figure out which pieces of data are missing from the Data Bitfield without having to do a linear scan. + +The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that because most filesystems work in 4Kb chunk sizes, we can fit the Data, Tree and Index in less then 4Kb of data for efficient writes to the filesystem. The Tree and Index sizes are based on the Data size (the Tree has twice the entries as the Data, odd and even nodes vs just even nodes in `tree`, and Index is always 1/4th the size). + +To generate the Index, you take pairs of 2 bytes at a time from the Data Bitfield, check if all bits in the 2 bytes are the same, and generate 4 bits of Index metadata for every 2 bytes of Data (hence how 1024 bytes of Data ends up as 256 bytes of Index). + +First you generate a 2 bit tuple for the 2 bytes of Data: + +``` +if (data is all 1's) then [1,1] +if (data is all 0's) then [0,0] +if (data is not all the same) then [1, 0] +``` + +The Index itself is an in-order binary tree, not a traditional bitfield. To generate the tree, you take the tuples you generate above and then write them into a tree like the following example, where non-leaf nodes are generated using the above scheme by looking at the results of the relative even child tuples for each odd parent tuple: + +``` +// for e.g. 16 bytes (8 tuples) of +// sparsely replicated data +0 - [00 00 00 00] +1 - [10 10 10 10] +2 - [11 11 11 11] +``` + +The tuples at entry `1` above are `[1,0]` because the relative child tuples are not uniform. In the following example, all non-leaf nodes are `[1,1]` because their relative children are all uniform (`[1,1]`) + +``` +// for e.g. 32 bytes (16 tuples) of +// fully replicated data (all 1's) +0 - [11 11 11 11] +1 - [11 11 11 11] +2 - [11 11 11 11] +3 - [11 11 11 11] +4 - [11 11 11 11] +5 - [11 11 11 11] +6 - [11 11 11 11] +``` + +Using this scheme, to represent 32 bytes of data it takes at most 8 bytes of Index. In this example it compresses nicely as its all contiguous ones on disk, similarly for an empty bitfield it would be all zeroes. + +If you write 4GB of data using on average 64KB data chunk size, your bitfield will be at most 32KB. + +#### metadata.data + +This file is used to store content described by the rest of the `metadata.*` hypercore SLEEP files. Whereas the `content.*` SLEEP files describe the data stored in the actual data cloned in the Dat repository filesystem, the `metadata` data feed is stored inside the `.dat` folder along with the rest of the SLEEP files. + +The contents of this file is a series of versions of the Dat filesystem tree. As this is a hypercore data feed, it's just an append only log of binary data entries. The challenge is representing a tree in a one-dimensional way to make it representable as a Hypercore register. For example, imagine three files: + +``` +~/dataset $ ls +figures + graph1.png + graph2.png +results.csv + +1 directory, 3 files +``` + +We want to take this structure and map it to a serialized representation that gets written into an append only log in a way that still allows for efficient random access by file path. + +To do this, we convert the filesystem metadata into entries in a feed like this: + +``` +{ + "path": "/results.csv", + trie: [[]], + sequence: 0 +} +{ + "path": "/figures/graph1.png", + trie: [[0], []], + sequence: 1 +} +{ + "path": "/figures/graph2.png", + trie: [[0], [1]], + sequence: 2 +} +``` + +##### Filename Resolution + +Each sequence represents adding one of the files to the register, so at sequence 0 the filesystem state only has a single file, `results.csv` in it. At sequence 1, there are only 2 files added to the register, and at sequence 3 all files are finally added. The `children` field represents a shorthand way of declaring which other files at every level of the directory hierarchy exist alongside the file being added at that revision. For example at the time of sequence 1, children is `[[0], []]`. The first sub-array, `[0]`, represents the first folder in the `path`, which is the root folder `/`. In this case `[0]` means the root folder at this point in time only has a single file, the file that is the subject of sequence `0`. The second subarray is empty `[]` because there are no other existing files in the second folder in the `path`, `figures`. + +To look up a file by filename, you fetch the latest entry in the log, then use the `children` metadata in that entry to look up the longest common ancestor based on the parent folders of the filename you are querying. You can then recursively repeat this operation until you find the `path` entry you are looking for (or you exhaust all options which means the file does not exist). This is a `O(number of slashes in your path)` operation. + +For example, if you wanted to look up `/results.csv` given the above register, you would start by grabbing the metadata at sequence 2. The longest common ancestor between `/results.csv` and `/figures/graph2` is `/`. You then grab the corresponding entry in the children array for `/`, which in this case is the first entry, `[0]`. You then repeat this with all of the children entries until you find a child that is closer to the entry you are looking for. In this example, the first entry happens to be the match we are looking for. + +You can also perform lookups relative to a point in time by starting from a specific sequence number in the register. For example to get the state of some file relative to an old sequence number, similar to checking out an old version of a repository in Git. + +##### Data Serialization + +The format of the `metadata.data` file is as follows: + +``` +
+ + + + +``` + +Each entry in the file is encoded using Protocol Buffers [@varda2008protocol]. + +The first message we write to the file is of a type called Header which uses this schema: + +``` +message Header { + required string type = 1; + optional bytes content = 2; +} +``` + +This is used to declare two pieces of metadata used by Dat. It includes a `type` string with the value `hyperdrive` and `content` binary value that holds the public key of the content register that this metadata register represents. When you share a Dat, the metadata key is the main key that gets used, and the content register key is linked from here in the metadata. + +After the header the file will contain many filesystem `Node` entries: + +``` +message Node { + required string path = 1; + optional Stat value = 2; + optional bytes trie = 3; + repeated Writer writers = 4; + optional uint64 writersSequence = 5; +} + +message Writer { + required bytes publicKey = 1; + optional string permission = 2; +} +``` + +The `Node` object has five fields + + - `path` - the string of the absolute file path of this file. + - `Stat` - a Stat encoded object representing the file metadata + - `trie` - a compressed list of the sequence numbers as described earlier + - `writers` - a list of the writers who are allowed to write to this dat + - `writersSequence` - a reference to the last sequence where the writers array was modified. you can use this to quickly find the value of the writers keys. + +The `trie` value is encoded by starting with the nested array of sequence numbers, e.g. `[[[0, 3]], [[0, 2], [0, 1]]]`. Each entry is a tuple where the first item is the index of the feed in the `writers` array and the second value is the sequence number. Finally you prepend the trie value with a version number varint. + +To write these subarrays we use variable width integers (varints), using a repeating pattern like this, one for each array: + +``` + + + + + + +``` + +This encoding is designed for efficiency as it reduces the filesystem path + feed index metadata down to a series of small integers. + +The `Stat` objects use this encoding: + +``` +message Stat { + required uint32 mode = 1; + optional uint32 uid = 2; + optional uint32 gid = 3; + optional uint64 size = 4; + optional uint64 blocks = 5; + optional uint64 offset = 6; + optional uint64 byteOffset = 7; + optional uint64 mtime = 8; + optional uint64 ctime = 9; +} +``` + +These are the field definitions: + + - `mode` - POSIX file mode bitmask + - `uid` - POSIX user id + - `gid` - POSIX group id + - `size` - file size in bytes + - `blocks` - number of data chunks that make up this file + - `offset` - the data feed entry index for the first chunk in this file + - `byteOffset` - the data feed file byte offset for the first chunk in this file + - `mtime` - POSIX modified_at time + - `mtime` - POSIX created_at time + +## References \ No newline at end of file diff --git a/papers/sleep.pdf b/papers/sleep.pdf new file mode 100644 index 0000000..a4281e8 Binary files /dev/null and b/papers/sleep.pdf differ diff --git a/papers/sleep.txt b/papers/sleep.txt new file mode 100644 index 0000000..241347f --- /dev/null +++ b/papers/sleep.txt @@ -0,0 +1,756 @@ +\documentclass[a4paperpaper,twocolumn]{article} +\usepackage{lmodern} +\usepackage{amssymb,amsmath} +\usepackage{ifxetex,ifluatex} +\usepackage{fixltx2e} % provides \textsubscript +\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex + \usepackage[T1]{fontenc} + \usepackage[utf8]{inputenc} +\else % if luatex or xelatex + \ifxetex + \usepackage{mathspec} + \else + \usepackage{fontspec} + \fi + \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} +\fi +% use upquote if available, for straight quotes in verbatim environments +\IfFileExists{upquote.sty}{\usepackage{upquote}}{} +% use microtype if available +\IfFileExists{microtype.sty}{% +\usepackage{microtype} +\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts +}{} +\usepackage[unicode=true]{hyperref} +\hypersetup{ + pdftitle={SLEEP - Syncable Ledger of Exact Events Protocol}, + pdfauthor={Mathias Buus Madsen, Maxwell Ogden, Code for Science}, + pdfborder={0 0 0}, + breaklinks=true} +\urlstyle{same} % don't use monospace font for urls +\IfFileExists{parskip.sty}{% +\usepackage{parskip} +}{% else +\setlength{\parindent}{0pt} +\setlength{\parskip}{6pt plus 2pt minus 1pt} +} +\setlength{\emergencystretch}{3em} % prevent overfull lines +\providecommand{\tightlist}{% + \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} +\setcounter{secnumdepth}{0} +% Redefines (sub)paragraphs to behave more like sections +\ifx\paragraph\undefined\else +\let\oldparagraph\paragraph +\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} +\fi +\ifx\subparagraph\undefined\else +\let\oldsubparagraph\subparagraph +\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} +\fi + +% set default figure placement to htbp +\makeatletter +\def\fps@figure{htbp} +\makeatother + + +\title{SLEEP - Syncable Ledger of Exact Events Protocol} +\author{Mathias Buus Madsen, Maxwell Ogden, Code for Science} +\date{August 2017} + +\begin{document} +\maketitle + +\subsection{SLEEP}\label{sleep} + +This document is a technical description of the SLEEP format intended +for implementers. SLEEP is the the on-disk format that Dat produces and +uses. It is a set of 9 files that hold all of the metadata needed to +list the contents of a Dat repository and verify the integrity of the +data you receive. SLEEP is designed to work with REST, allowing servers +to be plain HTTP file servers serving the static SLEEP files, meaning +you can implement a Dat protocol client using HTTP with a static HTTP +file server as the backend. + +SLEEP files contain metadata about the data inside a Dat repository, +including cryptographic hashes, cryptographic signatures, filenames and +file permissions. The SLEEP format is specifically designed to allow +efficient access to subsets of the metadata and/or data in the +repository, even on very large repositories, which enables Dat's peer to +peer networking to be fast. + +The acronym SLEEP is a slumber related pun on REST and stands for +Syncable Ledger of Exact Events Protocol. The Syncable part refers to +how SLEEP files are append-only in nature, meaning they grow over time +and new updates can be subscribed to as a realtime feed of events +through the Dat protocol. + +The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. +SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. + +\subsubsection{SLEEP Files}\label{sleep-files} + +SLEEP is a set of 9 files that should be stored with the following +names. In Dat, the files are stored in a folder called \texttt{.dat} in +the top level of the repository. + +\begin{verbatim} +metadata.key +metadata.signatures +metadata.bitfield +metadata.tree +metadata.data +content.key +content.signatures +content.bitfield +content.tree +\end{verbatim} + +The files prefixed with \texttt{content} store metadata about the +primary data in a Dat repository, for example the raw binary contents of +the files. The files prefixed with \texttt{metadata} store metadata +about the files in the repository, for example the filenames, file +sizes, and file permissions. The \texttt{content} and \texttt{metadata} +files are both Hypercore registers, making SLEEP a set of two Hypercore +registers. + +\subsubsection{SLEEP File Headers}\label{sleep-file-headers} + +The following structured binary format is used for \texttt{signatures}, +\texttt{bitfield}, and \texttt{tree} files. The header contains metadata +as well as information needed to decode the rest of the files after the +header. SLEEP files are designed to be easy to append new data, easy to +read arbitrary byte offsets in the middle, and are relatively flat, +simple files that rely on the filesystem for the heavy lifting. + +SLEEP files are laid out like this: + +\begin{verbatim} +<32 byte header> + + + + +\end{verbatim} + +\begin{itemize} +\tightlist +\item + 32 byte header +\item + 4 bytes - magic byte (value varies depending on which file, used to + quickly identify which file type it is) +\item + 1 byte - version number of the file header protocol, current version + is 0 +\item + 2 byte Uint16BE - entry size, describes how long each entry in the + file is +\item + 1 byte - length prefix for body +\item + rest of 32 byte header - string describing key algorithm (in dat + `ed25519'). length of this string matches the length in the previous + length prefix field. This string must fit within the 32 byte header + limitation (24 bytes reserved for string). Unused bytes should be + filled with zeroes. +\end{itemize} + +Possible values in the Dat implementation for the body field are: + +\begin{verbatim} +Ed25519 +BLAKE2b +\end{verbatim} + +To calculate the offset of some entry position, first read the header +and get the entry size, then do +\texttt{32\ +\ entrySize\ *\ entryIndex}. To calculate how many entries +are in a file, you can use the entry size and the filesize on disk and +do \texttt{(fileSize\ -\ 32)\ /\ entrySize}. + +As mentioned above, \texttt{signatures}, \texttt{bitfield} and +\texttt{tree} are the three SLEEP files. There are two additional files, +\texttt{key}, and \texttt{data}, which do not contain SLEEP file headers +and store plain serialized data for easy access. \texttt{key} stores the +public key that is described by the \texttt{signatures} file, and +\texttt{data} stores the raw chunk data that the \texttt{tree} file +contains the hashes and metadata for. + +\subsubsection{File Descriptions}\label{file-descriptions} + +\paragraph{key}\label{key} + +The public key used to verify the signatures in the \texttt{signatures} +file, stored in binary as a single buffer written to disk. To find out +what format of key is stored in this file, read the header of +\texttt{signatures}. In Dat, it's always a ed25519 public key, but other +implementations can specify other key types using a string value in that +header. + +\paragraph{tree}\label{tree} + +A SLEEP formatted 32 byte header with data entries representing a +serialized Merkle tree based on the data in the data storage layer. All +the fixed size nodes written in in-order tree notation. The header +algorithm string for \texttt{tree} files is \texttt{BLAKE2b}. The entry +size is 40 bytes. Entries are formatted like this: + +\begin{verbatim} +<32 byte header> + <4 byte magic string: 0x05025702> + <1 byte version number: 0> + <2 byte entry size: 40> + <1 byte algorithm name length prefix: 7> + <7 byte algorithm name: BLAKE2b> + <17 zeroes> +<40 byte entries> + <32 byte BLAKE2b hash> + <8 byte Uint64BE children leaf byte length> +\end{verbatim} + +The children leaf byte length is the byte size containing the sum byte +length of all leaf nodes in the tree below this node. + +This file uses the in-order notation, meaning even entries are leaf +nodes and odd entries are parent nodes (non-leaf). + +To prevent pre-image attacks, all hashes start with a one byte type +descriptor: + +\begin{verbatim} +0 - LEAF +1 - PARENT +2 - ROOT +\end{verbatim} + +To calculate leaf node entries (the hashes of the data entries) we hash +this data: + +\begin{verbatim} +BLAKE2b( + <1 byte type> + 0 + <8 bytes Uint64BE> + length of entry data + +) +\end{verbatim} + +Then we take this 32 byte hash and write it to the tree as 40 bytes like +this: + +\begin{verbatim} +<32 bytes> + BLAKE2b hash +<8 bytes Uint64BE> + length of data +\end{verbatim} + +Note that the Uint64 of length of data is included both in the hashed +data and written at the end of the entry. This is to expose more +metadata to Dat for advanced use cases such as verifying data length in +sparse replication scenarios. + +To calculate parent node entries (the hashes of the leaf nodes) we hash +this data: + +\begin{verbatim} +BLAKE2b( + <1 byte> + 1 + <8 bytes Uint64BE> + left child length + right child length + <32 bytes> + left child hash + <32 bytes> + right child hash +) +\end{verbatim} + +Then we take this 32 byte hash and write it to the tree as 40 bytes like +this: + +\begin{verbatim} +<32 bytes> + BLAKE2b hash +<8 bytes Uint64BE> + left child length + right child length +\end{verbatim} + +The reason the tree entries contain data lengths is to allow for sparse +mode replication. Encoding lengths (and including lengths in all hashes) +means you can verify the Merkle subtrees independent of the rest of the +tree, which happens during sparse replication scenarios. + +The tree file corresponds directly to the \texttt{data} file. + +\paragraph{data}\label{data} + +The \texttt{data} file is only included in the SLEEP format for the +\texttt{metadata.*} prefixed files which contains filesystem metadata +and not actual file data. For the \texttt{content.*} files, the data is +stored externally (in Dat it is stored as normal files on the filesystem +and not in a SLEEP file). However you can configure Dat to use a +\texttt{content.data} file if you want and it will still work. If you +want to store the full history of all versions of all files, using the +\texttt{content.data} file would provide that guarantee, but would have +the disadvantage of storing files as chunks merged into one huge file +(not as user friendly). + +The \texttt{data} file does not contain a SLEEP file header. It just +contains a bunch of concatenated data entries. Entries are written in +the same order as they appear in the \texttt{tree} file. To read a +\texttt{data} file, first decode the \texttt{tree} file and for every +leaf in the \texttt{tree} file you can calculate a data offset for the +data described by that leaf node in the \texttt{data} file. + +\subparagraph{Index Lookup}\label{index-lookup} + +For example, if we wanted to seek to a specific entry offset (say entry +42): + +\begin{itemize} +\tightlist +\item + First, read the header of the \texttt{tree} file and get the entry + size, then do \texttt{32\ +\ entrySize\ *\ 42} to get the raw tree + index: \texttt{32\ +\ (40\ *\ 42)} +\item + Since we want the leaf entry (even node in the in-order layout), we + multiply the entry index by 2: \texttt{32\ +\ (40\ *\ (42\ *\ 2))} +\item + Read the 40 bytes at that offset in the \texttt{tree} file to get the + leaf node entry. +\item + Read the last 8 bytes of the entry to get the length of the data entry +\item + To calculate the offset of where in the \texttt{data} file your entry + begins, you need to sum all the lengths of all the earlier entries in + the tree. The most efficient way to do this is to sum all the previous + parent node (non-leaf) entry lengths. You can also sum all leaf node + lengths, but parent nodes contain the sum of their children's lengths + so it's more efficient to use parents. During Dat replication, these + nodes are fetched as part of the Merkle tree verification so you will + already have them locally. This is a log(N) operation where N is the + entry index. Entries are also small and therefore easily cacheable. +\item + Once you get the offset, you use the length you decoded above and read + N bytes (where N is the decoded length) at the offset in the + \texttt{data} file. You can verify the data integrity using the 32 + byte hash from the \texttt{tree} entry. +\end{itemize} + +\subparagraph{Byte Lookup}\label{byte-lookup} + +The above method illustrates how to resolve a chunk position index to a +byte offset. You can also do the reverse operation, resolving a byte +offset to a chunk position index. This is used to stream arbitrary +random access regions of files in sparse replication scenarios. + +\begin{itemize} +\tightlist +\item + First, you start by calculating the current Merkle roots +\item + Each node in the tree (including these root nodes) stores the + aggregate file size of all byte sizes of the nodes below it. So the + roots cumulatively will describe all possible byte ranges for this + repository. +\item + Find the root that contains the byte range of the offset you are + looking for and get the node information for all of that nodes + children using the Index Lookup method, and recursively repeat this + step until you find the lowest down child node that describes this + byte range. +\item + The chunk described by this child node will contain the byte range you + are looking for. You can use the \texttt{byteOffset} property in the + \texttt{Stat} metadata object to seek into the right position in the + content for the start of this chunk. +\end{itemize} + +\subparagraph{Metadata Overhead}\label{metadata-overhead} + +Using this scheme, if you write 4GB of data using on average 64KB data +chunks (note: chunks can be variable length and do not need to be the +same size), your tree file will be around 5MB (0.0125\% overhead). + +\paragraph{signatures}\label{signatures} + +A SLEEP formatted 32 byte header with data entries being 64 byte +signatures. + +\begin{verbatim} +<32 byte header> + <4 byte magic string: 0x05025701> + <1 byte version number: 0> + <2 byte entry size: 64> + <1 byte algorithm name length prefix: 7> + <7 byte algorithm name: Ed25519> + <17 zeroes> +<64 byte entries> + <64 byte Ed25519 signature> +\end{verbatim} + +Every time the tree is updated we sign the current roots of the Merkle +tree, and append them to the signatures file. The signatures file starts +with no entries. Each time a new leaf is appended to the \texttt{tree} +file (aka whenever data is added to a Dat), we take all root hashes at +the current state of the Merkle tree and hash and sign them, then append +them as a new entry to the signatures file. + +\begin{verbatim} +Ed25519 sign( + BLAKE2b( + <1 byte> + 2 // root type + for (every root node left-to-right) { + <32 byte root hash> + <8 byte Uint64BE root tree index> + <8 byte Uint64BE child byte lengths> + } + ) +) +\end{verbatim} + +The reason we hash all the root nodes is that the BLAKE2b hash above is +only calculable if you have all of the pieces of data required to +generate all the intermediate hashes. This is the crux of Dat's data +integrity guarantees. + +\paragraph{bitfield}\label{bitfield} + +A SLEEP formatted 32 byte header followed by a series of 3328 byte long +entries. + +\begin{verbatim} +<32 byte header> + <4 byte magic string: 0x05025700> + <1 byte version number: 0> + <2 byte entry size: 3328> + <1 byte algorithm name length: 0> + <1 byte algorithm name: 0> + <24 zeroes> +<3328 byte entries> // (2048 + 1024 + 256) +\end{verbatim} + +The bitfield describes which pieces of data you have, and which nodes in +the \texttt{tree} file have been written. This file exists as an index +of the \texttt{tree} and \texttt{data} to quickly figure out which +pieces of data you have or are missing. This file can be regenerated if +you delete it, so it is considered a materialized index. + +The \texttt{bitfield} file actually contains three bitfields of +different sizes. A bitfield (AKA bitmap) is defined as a set of bits +where each bit (0 or 1) represents if you have or do not have a piece of +data at that bit index. So if there is a dataset of 10 cat pictures, and +you have pictures 1, 3, and 5 but are missing the rest, your bitfield +would look like \texttt{1010100000}. + +Each entry contains three objects: + +\begin{itemize} +\tightlist +\item + Data Bitfield (1024 bytes) - 1 bit for for each data entry that you + have synced (1 for every entry in \texttt{data}). +\item + Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in + \texttt{tree}) +\item + Bitfield Index (256 bytes) - This is an index of the Data Bitfield + that makes it efficient to figure out which pieces of data are missing + from the Data Bitfield without having to do a linear scan. +\end{itemize} + +The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that +because most filesystems work in 4Kb chunk sizes, we can fit the Data, +Tree and Index in less then 4Kb of data for efficient writes to the +filesystem. The Tree and Index sizes are based on the Data size (the +Tree has twice the entries as the Data, odd and even nodes vs just even +nodes in \texttt{tree}, and Index is always 1/4th the size). + +To generate the Index, you take pairs of 2 bytes at a time from the Data +Bitfield, check if all bits in the 2 bytes are the same, and generate 4 +bits of Index metadata~for every 2 bytes of Data (hence how 1024 bytes +of Data ends up as 256 bytes of Index). + +First you generate a 2 bit tuple for the 2 bytes of Data: + +\begin{verbatim} +if (data is all 1's) then [1,1] +if (data is all 0's) then [0,0] +if (data is not all the same) then [1, 0] +\end{verbatim} + +The Index itself is an in-order binary tree, not a traditional bitfield. +To generate the tree, you take the tuples you generate above and then +write them into a tree like the following example, where non-leaf nodes +are generated using the above scheme by looking at the results of the +relative even child tuples for each odd parent tuple: + +\begin{verbatim} +// for e.g. 16 bytes (8 tuples) of +// sparsely replicated data +0 - [00 00 00 00] +1 - [10 10 10 10] +2 - [11 11 11 11] +\end{verbatim} + +The tuples at entry \texttt{1} above are \texttt{{[}1,0{]}} because the +relative child tuples are not uniform. In the following example, all +non-leaf nodes are \texttt{{[}1,1{]}} because their relative children +are all uniform (\texttt{{[}1,1{]}}) + +\begin{verbatim} +// for e.g. 32 bytes (16 tuples) of +// fully replicated data (all 1's) +0 - [11 11 11 11] +1 - [11 11 11 11] +2 - [11 11 11 11] +3 - [11 11 11 11] +4 - [11 11 11 11] +5 - [11 11 11 11] +6 - [11 11 11 11] +\end{verbatim} + +Using this scheme, to represent 32 bytes of data it takes at most 8 +bytes of Index. In this example it compresses nicely as its all +contiguous ones on disk, similarly for an empty bitfield it would be all +zeroes. + +If you write 4GB of data using on average 64KB data chunk size, your +bitfield will be at most 32KB. + +\paragraph{metadata.data}\label{metadata.data} + +This file is used to store content described by the rest of the +\texttt{metadata.*} hypercore SLEEP files. Whereas the +\texttt{content.*} SLEEP files describe the data stored in the actual +data cloned in the Dat repository filesystem, the \texttt{metadata} data +feed is stored inside the \texttt{.dat} folder along with the rest of +the SLEEP files. + +The contents of this file is a series of versions of the Dat filesystem +tree. As this is a hypercore data feed, it's just an append only log of +binary data entries. The challenge is representing a tree in a +one-dimensional way to make it representable as a Hypercore register. +For example, imagine three files: + +\begin{verbatim} +~/dataset $ ls +figures + graph1.png + graph2.png +results.csv + +1 directory, 3 files +\end{verbatim} + +We want to take this structure and map it to a serialized representation +that gets written into an append only log in a way that still allows for +efficient random access by file path. + +To do this, we convert the filesystem metadata into entries in a feed +like this: + +\begin{verbatim} +{ + "path": "/results.csv", + trie: [[]], + sequence: 0 +} +{ + "path": "/figures/graph1.png", + trie: [[0], []], + sequence: 1 +} +{ + "path": "/figures/graph2.png", + trie: [[0], [1]], + sequence: 2 +} +\end{verbatim} + +\subparagraph{Filename Resolution}\label{filename-resolution} + +Each sequence represents adding one of the files to the register, so at +sequence 0 the filesystem state only has a single file, +\texttt{results.csv} in it. At sequence 1, there are only 2 files added +to the register, and at sequence 3 all files are finally added. The +\texttt{children} field represents a shorthand way of declaring which +other files at every level of the directory hierarchy exist alongside +the file being added at that revision. For example at the time of +sequence 1, children is \texttt{{[}{[}0{]},\ {[}{]}{]}}. The first +sub-array, \texttt{{[}0{]}}, represents the first folder in the +\texttt{path}, which is the root folder \texttt{/}. In this case +\texttt{{[}0{]}} means the root folder at this point in time only has a +single file, the file that is the subject of sequence \texttt{0}. The +second subarray is empty \texttt{{[}{]}} because there are no other +existing files in the second folder in the \texttt{path}, +\texttt{figures}. + +To look up a file by filename, you fetch the latest entry in the log, +then use the \texttt{children} metadata in that entry to look up the +longest common ancestor based on the parent folders of the filename you +are querying. You can then recursively repeat this operation until you +find the \texttt{path} entry you are looking for (or you exhaust all +options which means the file does not exist). This is a +\texttt{O(number\ of\ slashes\ in\ your\ path)} operation. + +For example, if you wanted to look up \texttt{/results.csv} given the +above register, you would start by grabbing the metadata at sequence 2. +The longest common ancestor between \texttt{/results.csv} and +\texttt{/figures/graph2} is \texttt{/}. You then grab the corresponding +entry in the children array for \texttt{/}, which in this case is the +first entry, \texttt{{[}0{]}}. You then repeat this with all of the +children entries until you find a child that is closer to the entry you +are looking for. In this example, the first entry happens to be the +match we are looking for. + +You can also perform lookups relative to a point in time by starting +from a specific sequence number in the register. For example to get the +state of some file relative to an old sequence number, similar to +checking out an old version of a repository in Git. + +\subparagraph{Data Serialization}\label{data-serialization} + +The format of the \texttt{metadata.data} file is as follows: + +\begin{verbatim} +
+ + + + +\end{verbatim} + +Each entry in the file is encoded using Protocol Buffers (Varda 2008). + +The first message we write to the file is of a type called Header which +uses this schema: + +\begin{verbatim} +message Header { + required string type = 1; + optional bytes content = 2; +} +\end{verbatim} + +This is used to declare two pieces of metadata used by Dat. It includes +a \texttt{type} string with the value \texttt{hyperdrive} and +\texttt{content} binary value that holds the public key of the content +register that this metadata register represents. When you share a Dat, +the metadata key is the main key that gets used, and the content +register key is linked from here in the metadata. + +After the header the file will contain many filesystem \texttt{Node} +entries: + +\begin{verbatim} +message Node { + required string path = 1; + optional Stat value = 2; + optional bytes trie = 3; + repeated Writer writers = 4; + optional uint64 writersSequence = 5; +} + +message Writer { + required bytes publicKey = 1; + optional string permission = 2; +} +\end{verbatim} + +The \texttt{Node} object has five fields + +\begin{itemize} +\tightlist +\item + \texttt{path} - the string of the absolute file path of this file. +\item + \texttt{Stat} - a Stat encoded object representing the file metadata +\item + \texttt{trie} - a compressed list of the sequence numbers as described + earlier +\item + \texttt{writers} - a list of the writers who are allowed to write to + this dat +\item + \texttt{writersSequence} - a reference to the last sequence where the + writers array was modified. you can use this to quickly find the value + of the writers keys. +\end{itemize} + +The \texttt{trie} value is encoded by starting with the nested array of +sequence numbers, e.g. +\texttt{{[}{[}{[}0,\ 3{]}{]},\ {[}{[}0,\ 2{]},\ {[}0,\ 1{]}{]}{]}}. Each +entry is a tuple where the first item is the index of the feed in the +\texttt{writers} array and the second value is the sequence number. +Finally you prepend the trie value with a version number varint. + +To write these subarrays we use variable width integers (varints), using +a repeating pattern like this, one for each array: + +\begin{verbatim} + + + + + + +\end{verbatim} + +This encoding is designed for efficiency as it reduces the filesystem +path + feed index metadata down to a series of small integers. + +The \texttt{Stat} objects use this encoding: + +\begin{verbatim} +message Stat { + required uint32 mode = 1; + optional uint32 uid = 2; + optional uint32 gid = 3; + optional uint64 size = 4; + optional uint64 blocks = 5; + optional uint64 offset = 6; + optional uint64 byteOffset = 7; + optional uint64 mtime = 8; + optional uint64 ctime = 9; +} +\end{verbatim} + +These are the field definitions: + +\begin{itemize} +\tightlist +\item + \texttt{mode} - POSIX file mode bitmask +\item + \texttt{uid} - POSIX user id +\item + \texttt{gid} - POSIX group id +\item + \texttt{size} - file size in bytes +\item + \texttt{blocks} - number of data chunks that make up this file +\item + \texttt{offset} - the data feed entry index for the first chunk in + this file +\item + \texttt{byteOffset} - the data feed file byte offset for the first + chunk in this file +\item + \texttt{mtime} - POSIX modified\_at time +\item + \texttt{mtime} - POSIX created\_at time +\end{itemize} + +\hypertarget{refs}{} +\hypertarget{ref-varda2008protocol}{} +Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange +Format.'' \emph{Google Open Source Blog, Available at Least as Early as +Jul}. + +\end{document} -- cgit v1.2.3