diff options
-rw-r--r-- | papers/dat-paper.latex (renamed from papers/dat-paper.txt) | 84 | ||||
-rw-r--r-- | papers/dat-paper.pdf | bin | 250116 -> 248958 bytes | |||
-rw-r--r-- | papers/sleep.latex (renamed from papers/sleep.txt) | 35 | ||||
-rw-r--r-- | papers/sleep.pdf | bin | 202643 -> 207531 bytes |
4 files changed, 30 insertions, 89 deletions
diff --git a/papers/dat-paper.txt b/papers/dat-paper.latex index f0e71c2..5d90b07 100644 --- a/papers/dat-paper.txt +++ b/papers/dat-paper.latex @@ -48,15 +48,9 @@ \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} \fi -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - \title{Dat - Distributed Dataset Synchronization And Versioning} \author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science} -\date{May 2017} +\date{May 2017 (last updated: Jan 2018)} \begin{document} \maketitle @@ -103,7 +97,7 @@ backup sources can be automatically discovered. However these file sharing tools today are not supported by Web browsers, do not have good privacy guarantees, and do not provide a mechanism for updating files without redistributing a new dataset which could mean entirely -redownloading data you already have. +re-downloading data you already have. \section{2. Dat}\label{dat} @@ -114,7 +108,7 @@ reference implementation is available from npm as The protocol is agnostic to the underlying transport e.g.~you could implement Dat over carrier pigeon. Data is stored in a format called -SLEEP (Ogden and Buus 2017), described in it's own paper. The key +SLEEP (Ogden and Buus 2017), described in its own paper. The key properties of the Dat design are explained in this section. \begin{itemize} @@ -319,7 +313,7 @@ sources to try and contact. Dat uses either TCP, HTTP or \href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP} (Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up all available bandwidth on a network (e.g.~so that other people sharing -wifi can still use the Internet), and is still based on UDP so works +WiFi can still use the Internet), and is still based on UDP so works with NAT traversal techniques like UDP hole punching. HTTP is supported for compatibility with static file servers and web browser clients. Note that these are the protocols we support in the reference Dat @@ -495,16 +489,16 @@ For example a register with two data entries would look something like this (pseudocode): \begin{verbatim} -0. hash(value0) +0. hash(chunk0) 1. hash(hash(chunk0) + hash(chunk1)) -2. hash(value1) +2. hash(chunk1) \end{verbatim} It is possible for the in-order Merkle tree to have multiple roots at once. A root is defined as a parent node with a full set of child node slots filled below it. -For example, this tree hash 2 roots (1 and 4) +For example, this tree has 2 roots (1 and 4) \begin{verbatim} 0 @@ -514,7 +508,7 @@ For example, this tree hash 2 roots (1 and 4) 4 \end{verbatim} -This tree hash one root (3): +This tree has one root (3): \begin{verbatim} 0 @@ -560,7 +554,7 @@ list like this: bat-1 bat-2 bat-3 -cat-1 +cat-1 cat-2 cat-3 \end{verbatim} @@ -702,7 +696,7 @@ matching metadata entry. This is the un-optimized resolution that uses having Alice send additional sequence numbers to Bob that help him traverse in less round trips. -In the metadata record Bob recieved for \texttt{cat\_dna.csv} there is +In the metadata record Bob received for \texttt{cat\_dna.csv} there is the byte offset to the beginning of the file in the data feed. Bob adds his +30MB offset to this value and starts requesting pieces of data starting at that byte offset using the SLEEP protocol as described @@ -712,16 +706,6 @@ This method tries to allow any byte range of any file to be accessed without the need to synchronize the full metadata for all files up front. -\subsubsection{Scenario: Syncing live changes to files at a specific -path}\label{scenario-syncing-live-changes-to-files-at-a-specific-path} - -TODO - -\subsubsection{Scenario: Syncing an entire -archive}\label{scenario-syncing-an-entire-archive} - -TODO - \subsection{3. Dat Network Protocol}\label{dat-network-protocol} The SLEEP format is designed to allow for sparse replication, meaning @@ -768,8 +752,8 @@ Type 0. Should be the first message sent on a channel. \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore' using the public key of the metadata register as the key. \item - \texttt{nonce} - 32 bytes of random binary data, used in our - encryption scheme + \texttt{nonce} - 24 bytes (192 bits) of random binary data, used in + our encryption scheme \end{itemize} \begin{verbatim} @@ -1006,7 +990,7 @@ message Data { optional bytes value = 2; repeated Node nodes = 3; optional bytes signature = 4; - + message Node { required uint64 index = 1; required bytes hash = 2; @@ -1015,45 +999,7 @@ message Data { } \end{verbatim} -\section{4. Multi-Writer}\label{multi-writer} - -The design of Dat up to this point assumes you have a single keyholder -writing and signing data and appending it to the metadata and content -feed. However having the ability for multiple keyholders to be able to -write to a single repository allows for many interesting use cases such -as forking and collaborative workflows. - -In order to do this, we use one \texttt{metadata.data} feed for each -writer. Each writer kets their own keypair. Each writer is responsible -for storing their private key. To add a new writer to your feed, you -include their key in a metadata feed entry. - -For example, if Alice wants to add Bob to have write access to a Dat -repository, Alice would take Bob's public key and writes it to the -`local' metadata feed (the feed that Alice owns, e.g.~the original -feed). Now anyone else who replicates from Alice will find Bob's key in -the history. If in the future Bob distributes a version of the Dat that -he added new data to, everyone who has a copy of the Dat from Alice will -have a copy of Bob's key that they can use to verify that Bob's writes -are valid. - -On disk, each users feed is stored in a separate hyperdrive. The -original hyperdrive (owned by Alice) is called the `local' hyperdrive. -Bob's hyperdrive would be stored separately in the SLEEP folder -addressed by Bob's public key. - -In case Bob and Alice write different values for the same file (e.g.~Bob -creates a ``fork''), when they sync up with each other replication will -still work, but for the forked value the Dat client will return an array -of values for that key instead of just one value. The values are linked -to the writer that wrote them, so in the case of receiving multiple -values, clients can choose to choose the value from Alice, or Bob, or -the latest value, or whatever other strategy they prefer. - -If a writer updates the value of a forked key with new value they are -performing a merge. - -\section{5. Existing Work}\label{existing-work} +\section{4. Existing Work}\label{existing-work} Dat is inspired by a number of features from existing systems. @@ -1208,7 +1154,7 @@ public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called ``rogue certificates''. -\section{6. Reference Implementation}\label{reference-implementation} +\section{5. Reference Implementation}\label{reference-implementation} The connection logic is implemented in a module called \href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf Binary files differindex 5a7e758..500427e 100644 --- a/papers/dat-paper.pdf +++ b/papers/dat-paper.pdf diff --git a/papers/sleep.txt b/papers/sleep.latex index 241347f..cd695d0 100644 --- a/papers/sleep.txt +++ b/papers/sleep.latex @@ -48,12 +48,6 @@ \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} \fi -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - \title{SLEEP - Syncable Ledger of Exact Events Protocol} \author{Mathias Buus Madsen, Maxwell Ogden, Code for Science} \date{August 2017} @@ -138,8 +132,8 @@ SLEEP files are laid out like this: \item 32 byte header \item - 4 bytes - magic byte (value varies depending on which file, used to - quickly identify which file type it is) + 4 bytes Uint32BE (``Big-Endian'') - magic byte (value varies depending + on which file, used to quickly identify which file type it is) \item 1 byte - version number of the file header protocol, current version is 0 @@ -149,11 +143,10 @@ SLEEP files are laid out like this: \item 1 byte - length prefix for body \item - rest of 32 byte header - string describing key algorithm (in dat - `ed25519'). length of this string matches the length in the previous - length prefix field. This string must fit within the 32 byte header - limitation (24 bytes reserved for string). Unused bytes should be - filled with zeroes. + rest of 32 byte header - string describing key or hash algorithm. + length of this string matches the length in the previous length prefix + field. This string must fit within the 32 byte header limitation (24 + bytes reserved for string). Unused bytes should be filled with zeroes. \end{itemize} Possible values in the Dat implementation for the body field are: @@ -365,9 +358,9 @@ random access regions of files in sparse replication scenarios. byte range. \item The chunk described by this child node will contain the byte range you - are looking for. You can use the \texttt{byteOffset} property in the - \texttt{Stat} metadata object to seek into the right position in the - content for the start of this chunk. + are looking for. You can use the \texttt{byteOffset} field in the + \texttt{Stat} metadata object to seek to the correct position in the + content file for the start of this chunk. \end{itemize} \subparagraph{Metadata Overhead}\label{metadata-overhead} @@ -515,10 +508,9 @@ are all uniform (\texttt{{[}1,1{]}}) 6 - [11 11 11 11] \end{verbatim} -Using this scheme, to represent 32 bytes of data it takes at most 8 -bytes of Index. In this example it compresses nicely as its all -contiguous ones on disk, similarly for an empty bitfield it would be all -zeroes. +Using this scheme, it takes at most 8 bytes of Index to represent 32 +bytes of data. In this example the Index can compresses well because it +consists of all one bits. Similarly, an empty bitfield is all zero bits. If you write 4GB of data using on average 64KB data chunk size, your bitfield will be at most 32KB. @@ -747,6 +739,9 @@ These are the field definitions: \texttt{mtime} - POSIX created\_at time \end{itemize} +\subsection*{References}\label{references} +\addcontentsline{toc}{subsection}{References} + \hypertarget{refs}{} \hypertarget{ref-varda2008protocol}{} Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange diff --git a/papers/sleep.pdf b/papers/sleep.pdf Binary files differindex a4281e8..1c59c91 100644 --- a/papers/sleep.pdf +++ b/papers/sleep.pdf |