From 487549daa72c8eb433f2f161def811df44347ba2 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 10 Jan 2018 20:18:22 -0800 Subject: papers: re-build PDFs --- papers/dat-paper.latex | 1228 ++++++++++++++++++++++++++++++++++++++++++++++ papers/dat-paper.pdf | Bin 250116 -> 248958 bytes papers/dat-paper.txt | 1282 ------------------------------------------------ papers/sleep.latex | 751 ++++++++++++++++++++++++++++ papers/sleep.pdf | Bin 202643 -> 207531 bytes papers/sleep.txt | 756 ---------------------------- 6 files changed, 1979 insertions(+), 2038 deletions(-) create mode 100644 papers/dat-paper.latex delete mode 100644 papers/dat-paper.txt create mode 100644 papers/sleep.latex delete mode 100644 papers/sleep.txt diff --git a/papers/dat-paper.latex b/papers/dat-paper.latex new file mode 100644 index 0000000..5d90b07 --- /dev/null +++ b/papers/dat-paper.latex @@ -0,0 +1,1228 @@ +\documentclass[a4paperpaper,twocolumn]{article} +\usepackage{lmodern} +\usepackage{amssymb,amsmath} +\usepackage{ifxetex,ifluatex} +\usepackage{fixltx2e} % provides \textsubscript +\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex + \usepackage[T1]{fontenc} + \usepackage[utf8]{inputenc} +\else % if luatex or xelatex + \ifxetex + \usepackage{mathspec} + \else + \usepackage{fontspec} + \fi + \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} +\fi +% use upquote if available, for straight quotes in verbatim environments +\IfFileExists{upquote.sty}{\usepackage{upquote}}{} +% use microtype if available +\IfFileExists{microtype.sty}{% +\usepackage{microtype} +\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts +}{} +\usepackage[unicode=true]{hyperref} +\hypersetup{ + pdftitle={Dat - Distributed Dataset Synchronization And Versioning}, + pdfauthor={Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science}, + pdfborder={0 0 0}, + breaklinks=true} +\urlstyle{same} % don't use monospace font for urls +\IfFileExists{parskip.sty}{% +\usepackage{parskip} +}{% else +\setlength{\parindent}{0pt} +\setlength{\parskip}{6pt plus 2pt minus 1pt} +} +\setlength{\emergencystretch}{3em} % prevent overfull lines +\providecommand{\tightlist}{% + \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} +\setcounter{secnumdepth}{0} +% Redefines (sub)paragraphs to behave more like sections +\ifx\paragraph\undefined\else +\let\oldparagraph\paragraph +\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} +\fi +\ifx\subparagraph\undefined\else +\let\oldsubparagraph\subparagraph +\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} +\fi + +\title{Dat - Distributed Dataset Synchronization And Versioning} +\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science} +\date{May 2017 (last updated: Jan 2018)} + +\begin{document} +\maketitle + +\section{Abstract}\label{abstract} + +Dat is a protocol designed for syncing folders of data, even if they are +large or changing constantly. Dat uses a cryptographically secure +register of changes to prove that the requested data version is +distributed. A byte range of any file's version can be efficiently +streamed from a Dat repository over a network connection. Consumers can +choose to fully or partially replicate the contents of a remote Dat +repository, and can also subscribe to live changes. To ensure writer and +reader privacy, Dat uses public key cryptography to encrypt network +traffic. A group of Dat clients can connect to each other to form a +public or private decentralized network to exchange data between each +other. A reference implementation is provided in JavaScript. + +\section{1. Background}\label{background} + +Many datasets are shared online today using HTTP and FTP, which lack +built in support for version control or content addressing of data. This +results in link rot and content drift as files are moved, updated or +deleted, leading to an alarming rate of disappearing data references in +areas such as +\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}{published +scientific literature}. + +Cloud storage services like S3 ensure availability of data, but they +have a centralized hub-and-spoke networking model and are therefore +limited by their bandwidth, meaning popular files can become very +expensive to share. Services like Dropbox and Google Drive provide +version control and synchronization on top of cloud storage services +which fixes many issues with broken links but rely on proprietary code +and services requiring users to store their data on centralized cloud +infrastructure which has implications on cost, transfer speeds, vendor +lock-in and user privacy. + +Distributed file sharing tools can become faster as files become more +popular, removing the bandwidth bottleneck and making file distribution +cheaper. They also use link resolution and discovery systems which can +prevent broken links meaning if the original source goes offline other +backup sources can be automatically discovered. However these file +sharing tools today are not supported by Web browsers, do not have good +privacy guarantees, and do not provide a mechanism for updating files +without redistributing a new dataset which could mean entirely +re-downloading data you already have. + +\section{2. Dat}\label{dat} + +Dat is a dataset synchronization protocol that does not assume a dataset +is static or that the entire dataset will be downloaded. The main +reference implementation is available from npm as +\texttt{npm\ install\ dat\ -g}. + +The protocol is agnostic to the underlying transport e.g.~you could +implement Dat over carrier pigeon. Data is stored in a format called +SLEEP (Ogden and Buus 2017), described in its own paper. The key +properties of the Dat design are explained in this section. + +\begin{itemize} +\tightlist +\item + 2.1 \textbf{Content Integrity} - Data and publisher integrity is + verified through use of signed hashes of the content. +\item + 2.2 \textbf{Decentralized Mirroring} - Users sharing the same Dat + automatically discover each other and exchange data in a swarm. +\item + 2.3 \textbf{Network Privacy} - Dat provides certain privacy guarantees + including end-to-end encryption. +\item + 2.4 \textbf{Incremental Versioning} - Datasets can be efficiently + synced, even in real time, to other peers. +\item + 2.5 \textbf{Random Access} - Huge file hierarchies can be efficiently + traversed remotely. +\end{itemize} + +\subsection{2.1 Content Integrity}\label{content-integrity} + +Content integrity means being able to verify the data you received is +the exact same version of the data that you expected. This is important +in a distributed system as this mechanism will catch incorrect data sent +by bad peers. It also has implications for reproducibility as it lets +you refer to a specific version of a dataset. + +Link rot, when links online stop resolving, and content drift, when data +changes but the link to the data remains the same, are two common issues +in data analysis. For example, one day a file called data.zip might +change, but a typical HTTP link to the file does not include a hash of +the content, or provide a way to get updated metadata, so clients that +only have the HTTP link have no way to check if the file changed without +downloading the entire file again. Referring to a file by the hash of +its content is called content addressability, and lets users not only +verify that the data they receive is the version of the data they want, +but also lets people cite specific versions of the data by referring to +a specific hash. + +Dat uses BLAKE2b (Aumasson et al. 2013) cryptographically secure hashes +to address content. Hashes are arranged in a Merkle tree (Mykletun, +Narasimha, and Tsudik 2003), a tree where each non-leaf node is the hash +of all child nodes. Leaf nodes contain pieces of the dataset. Due to the +properties of secure cryptographic hashes the top hash can only be +produced if all data below it matches exactly. If two trees have +matching top hashes then you know that all other nodes in the tree must +match as well, and you can conclude that your dataset is synchronized. +Trees are chosen as the primary data structure in Dat as they have a +number of properties that allow for efficient access to subsets of the +metadata, which allows Dat to work efficiently over a network +connection. + +\subsubsection{Dat Links}\label{dat-links} + +Dat links are Ed25519 (Bernstein et al. 2012) public keys which have a +length of 32 bytes (64 characters when Hex encoded). You can represent +your Dat link in the following ways and Dat clients will be able to +understand them: + +\begin{itemize} +\tightlist +\item + The standalone public key: +\end{itemize} + +\texttt{8e1c7189b1b2dbb5c4ec2693787884771201da9...} + +\begin{itemize} +\tightlist +\item + Using the dat:// protocol: +\end{itemize} + +\texttt{dat://8e1c7189b1b2dbb5c4ec2693787884771...} + +\begin{itemize} +\tightlist +\item + As part of an HTTP URL: +\end{itemize} + +\texttt{https://datproject.org/8e1c7189b1b2dbb5...} + +All messages in the Dat protocol are encrypted and signed using the +public key during transport. This means that unless you know the public +key (e.g.~unless the Dat link was shared with you) then you will not be +able to discover or communicate with any member of the swarm for that +Dat. Anyone with the public key can verify that messages (such as +entries in a Dat Stream) were created by a holder of the private key. + +Every Dat repository has a corresponding private key which is kept in +your home folder and never shared. Dat never exposes either the public +or private key over the network. During the discovery phase the BLAKE2b +hash of the public key is used as the discovery key. This means that the +original key is impossible to discover (unless it was shared publicly +through a separate channel) since only the hash of the key is exposed +publicly. + +Dat does not provide an authentication mechanism at this time. Instead +it provides a capability system. Anyone with the Dat link is currently +considered able to discover and access data. Do not share your Dat links +publicly if you do not want them to be accessed. + +\subsubsection{Hypercore and Hyperdrive}\label{hypercore-and-hyperdrive} + +The Dat storage, content integrity, and networking protocols are +implemented in a module called +\href{https://npmjs.org/hypercore}{Hypercore}. Hypercore is agnostic to +the format of the input data, it operates on any stream of binary data. +For the Dat use case of synchronizing datasets we use a file system +module on top of Hypercore called +\href{https://npmjs.org/hyperdrive}{Hyperdrive}. + +Dat has a layered abstraction so that users can use Hypercore directly +to have full control over how they model their data. Hyperdrive works +well when your data can be represented as files on a filesystem, which +is the main use case with Dat. + +\subsubsection{Hypercore Registers}\label{hypercore-registers} + +Hypercore Registers are the core mechanism used in Dat. They are binary +append-only streams whose contents are cryptographically hashed and +signed and therefore can be verified by anyone with access to the public +key of the writer. They are an implementation of the concept known as a +register, a digital ledger you can trust. + +Dat uses two registers, \texttt{content} and \texttt{metadata}. The +\texttt{content} register contains the files in your repository and +\texttt{metadata} contains the metadata about the files including name, +size, last modified time, etc. Dat replicates them both when +synchronizing with another peer. + +When files are added to Dat, each file gets split up into some number of +chunks, and the chunks are then arranged into a Merkle tree, which is +used later for version control and replication processes. + +\subsection{2.2 Decentralized Mirroring}\label{decentralized-mirroring} + +Dat is a peer to peer protocol designed to exchange pieces of a dataset +amongst a swarm of peers. As soon as a peer acquires their first piece +of data in the dataset they can choose to become a partial mirror for +the dataset. If someone else contacts them and needs the piece they +have, they can choose to share it. This can happen simultaneously while +the peer is still downloading the pieces they want from others. + +\subsubsection{Source Discovery}\label{source-discovery} + +An important aspect of mirroring is source discovery, the techniques +that peers use to find each other. Source discovery means finding the IP +and port of data sources online that have a copy of that data you are +looking for. You can then connect to them and begin exchanging data. By +using source discovery techniques Dat is able to create a network where +data can be discovered even if the original data source disappears. + +Source discovery can happen over many kinds of networks, as long as you +can model the following actions: + +\begin{itemize} +\tightlist +\item + \texttt{join(key,\ {[}port{]})} - Begin performing regular lookups on + an interval for \texttt{key}. Specify \texttt{port} if you want to + announce that you share \texttt{key} as well. +\item + \texttt{leave(key,\ {[}port{]})} - Stop looking for \texttt{key}. + Specify \texttt{port} to stop announcing that you share \texttt{key} + as well. +\item + \texttt{foundpeer(key,\ ip,\ port)} - Called when a peer is found by a + lookup. +\end{itemize} + +In the Dat implementation we implement the above actions on top of three +types of discovery networks: + +\begin{itemize} +\tightlist +\item + DNS name servers - An Internet standard mechanism for resolving keys + to addresses +\item + Multicast DNS - Useful for discovering peers on local networks +\item + Kademlia Mainline Distributed Hash Table - Less central points of + failure, increases probability of Dat working even if DNS servers are + unreachable +\end{itemize} + +Additional discovery networks can be implemented as needed. We chose the +above three as a starting point to have a complementary mix of +strategies to increase the probability of source discovery. Additionally +you can specify a Dat via HTTPS link, which runs the Dat protocol in +``single-source'' mode, meaning the above discovery networks are not +used, and instead only that one HTTPS server is used as the only peer. + +\subsubsection{Peer Connections}\label{peer-connections} + +After the discovery phase, Dat should have a list of potential data +sources to try and contact. Dat uses either TCP, HTTP or +\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP} +(Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up +all available bandwidth on a network (e.g.~so that other people sharing +WiFi can still use the Internet), and is still based on UDP so works +with NAT traversal techniques like UDP hole punching. HTTP is supported +for compatibility with static file servers and web browser clients. Note +that these are the protocols we support in the reference Dat +implementation, but the Dat protocol itself is transport agnostic. + +If an HTTP source is specified Dat will prefer that one over other +sources. Otherwise when Dat gets the IP and port for a potential TCP or +UTP source it tries to connect using both protocols. If one connects +first, Dat aborts the other one. If none connect, Dat will try again +until it decides that source is offline or unavailable and then stops +trying to connect to them. Sources Dat is able to connect to go into a +list of known good sources, so that if/when the Internet connection goes +down Dat can use that list to reconnect to known good sources again +quickly. + +If Dat gets a lot of potential sources it picks a handful at random to +try and connect to and keeps the rest around as additional sources to +use later in case it decides it needs more sources. + +Once a duplex binary connection to a remote source is open Dat then +layers on the Hypercore protocol, a message-based replication protocol +that allows two peers to communicate over a stateless channel to request +and exchange data. You open separate replication channels with many +peers at once which allows clients to parallelize data requests across +the entire pool of peers they have established connections with. + +\subsection{2.3 Network Privacy}\label{network-privacy} + +On the Web today, with SSL, there is a guarantee that the traffic +between your computer and the server is private. As long as you trust +the server to not leak your logs, attackers who intercept your network +traffic will not be able to read the HTTP traffic exchanged between you +and the server. This is a fairly straightforward model as clients only +have to trust a single server for some domain. + +There is an inherent tradeoff in peer to peer systems of source +discovery vs.~user privacy. The more sources you contact and ask for +some data, the more sources you trust to keep what you asked for +private. Our goal is to have Dat be configurable in respect to this +tradeoff to allow application developers to meet their own privacy +guidelines. + +It is up to client programs to make design decisions around which +discovery networks they trust. For example if a Dat client decides to +use the BitTorrent DHT to discover peers, and they are searching for a +publicly shared Dat key (e.g.~a key cited publicly in a published +scientific paper) with known contents, then because of the privacy +design of the BitTorrent DHT it becomes public knowledge what key that +client is searching for. + +A client could choose to only use discovery networks with certain +privacy guarantees. For example a client could only connect to an +approved list of sources that they trust, similar to SSL. As long as +they trust each source, the encryption built into the Dat network +protocol will prevent the Dat key they are looking for from being +leaked. + +\subsection{2.4 Incremental Versioning}\label{incremental-versioning} + +Given a stream of binary data, Dat splits the stream into chunks, hashes +each chunk, and arranges the hashes in a specific type of Merkle tree +that allows for certain replication properties. + +Dat is also able to fully or partially synchronize streams in a +distributed setting even if the stream is being appended to. This is +accomplished by using the messaging protocol to traverse the Merkle tree +of remote sources and fetch a strategic set of nodes. Due to the +low-level, message-oriented design of the replication protocol, +different node traversal strategies can be implemented. + +There are two types of versioning performed automatically by Dat. +Metadata is stored in a folder called \texttt{.dat} in the root folder +of a repository, and data is stored as normal files in the root folder. + +\subsubsection{Metadata Versioning}\label{metadata-versioning} + +Dat tries as much as possible to act as a one-to-one mirror of the state +of a folder and all its contents. When importing files, Dat uses a +sorted, depth-first recursion to list all the files in the tree. For +each file it finds, it grabs the filesystem metadata (filename, Stat +object, etc) and checks if there is already an entry for this filename +with this exact metadata already represented in the Dat repository +metadata. If the file with this metadata matches exactly the newest +version of the file metadata stored in Dat, then this file will be +skipped (no change). + +If the metadata differs from the current existing one (or there are no +entries for this filename at all in the history), then this new metadata +entry will be appended as the new `latest' version for this file in the +append-only SLEEP metadata content register (described below). + +\subsubsection{Content Versioning}\label{content-versioning} + +In addition to storing a historical record of filesystem metadata, the +content of the files themselves are also capable of being stored in a +version controlled manner. The default storage system used in Dat stores +the files as files. This has the advantage of being very straightforward +for users to understand, but the downside of not storing old versions of +content by default. + +In contrast to other version control systems like Git, Dat by default +only stores the current set of checked out files on disk in the +repository folder, not old versions. It does store all previous metadata +for old versions in \texttt{.dat}. Git for example stores all previous +content versions and all previous metadata versions in the \texttt{.git} +folder. Because Dat is designed for larger datasets, if it stored all +previous file versions in \texttt{.dat}, then the \texttt{.dat} folder +could easily fill up the users hard drive inadvertently. Therefore Dat +has multiple storage modes based on usage. + +Hypercore registers include an optional \texttt{data} file that stores +all chunks of data. In Dat, only the \texttt{metadata.data} file is +used, but the \texttt{content.data} file is not used. The default +behavior is to store the current files only as normal files. If you want +to run an `archival' node that keeps all previous versions, you can +configure Dat to use the \texttt{content.data} file instead. For +example, on a shared server with lots of storage you probably want to +store all versions. However on a workstation machine that is only +accessing a subset of one version, the default mode of storing all +metadata plus the current set of downloaded files is acceptable, because +you know the server has the full history. + +\subsubsection{Merkle Trees}\label{merkle-trees} + +Registers in Dat use a specific method of encoding a Merkle tree where +hashes are positioned by a scheme called binary in-order interval +numbering or just ``bin'' numbering. This is just a specific, +deterministic way of laying out the nodes in a tree. For example a tree +with 7 nodes will always be arranged like this: + +\begin{verbatim} +0 + 1 +2 + 3 +4 + 5 +6 +\end{verbatim} + +In Dat, the hashes of the chunks of files are always even numbers, at +the wide end of the tree. So the above tree had four original values +that become the even numbers: + +\begin{verbatim} +chunk0 -> 0 +chunk1 -> 2 +chunk2 -> 4 +chunk3 -> 6 +\end{verbatim} + +In the resulting Merkle tree, the even and odd nodes store different +information: + +\begin{itemize} +\tightlist +\item + Evens - List of data hashes {[}chunk0, chunk1, chunk2, \ldots{}{]} +\item + Odds - List of Merkle hashes (hashes of child even nodes) {[}hash0, + hash1, hash2, \ldots{}{]} +\end{itemize} + +These two lists get interleaved into a single register such that the +indexes (position) in the register are the same as the bin numbers from +the Merkle tree. + +All odd hashes are derived by hashing the two child nodes, e.g.~given +hash0 is \texttt{hash(chunk0)} and hash2 is \texttt{hash(chunk1)}, hash1 +is \texttt{hash(hash0\ +\ hash2)}. + +For example a register with two data entries would look something like +this (pseudocode): + +\begin{verbatim} +0. hash(chunk0) +1. hash(hash(chunk0) + hash(chunk1)) +2. hash(chunk1) +\end{verbatim} + +It is possible for the in-order Merkle tree to have multiple roots at +once. A root is defined as a parent node with a full set of child node +slots filled below it. + +For example, this tree has 2 roots (1 and 4) + +\begin{verbatim} +0 + 1 +2 + +4 +\end{verbatim} + +This tree has one root (3): + +\begin{verbatim} +0 + 1 +2 + 3 +4 + 5 +6 +\end{verbatim} + +This one has one root (1): + +\begin{verbatim} +0 + 1 +2 +\end{verbatim} + +\subsubsection{Replication Example}\label{replication-example} + +This section describes in high level the replication flow of a Dat. Note +that the low level details are available by reading the SLEEP section +below. For the sake of illustrating how this works in practice in a +networked replication scenario, consider a folder with two files: + +\begin{verbatim} +bat.jpg +cat.jpg +\end{verbatim} + +To send these files to another machine using Dat, you would first add +them to a Dat repository by splitting them into chunks and constructing +SLEEP files representing the chunks and filesystem metadata. + +Let's assume \texttt{bat.jpg} and \texttt{cat.jpg} both produce three +chunks, each around 64KB. Dat stores in a representation called SLEEP, +but here we will show a pseudo-representation for the purposes of +illustrating the replication process. The six chunks get sorted into a +list like this: + +\begin{verbatim} +bat-1 +bat-2 +bat-3 +cat-1 +cat-2 +cat-3 +\end{verbatim} + +These chunks then each get hashed, and the hashes get arranged into a +Merkle tree (the content register): + +\begin{verbatim} +0 - hash(bat-1) + 1 - hash(0 + 2) +2 - hash(bat-2) + 3 - hash(1 + 5) +4 - hash(bat-3) + 5 - hash(4 + 6) +6 - hash(cat-1) +8 - hash(cat-2) + 9 - hash(8 + 10) +10 - hash(cat-3) +\end{verbatim} + +Next we calculate the root hashes of our tree, in this case 3 and 9. We +then hash them together, and cryptographically sign the hash. This +signed hash now can be used to verify all nodes in the tree, and the +signature proves it was produced by us, the holder of the private key +for this Dat. + +This tree is for the hashes of the contents of the photos. There is also +a second Merkle tree that Dat generates that represents the list of +files and their metadata and looks something like this (the metadata +register): + +\begin{verbatim} +0 - hash({contentRegister: '9e29d624...'}) + 1 - hash(0 + 2) +2 - hash({"bat.jpg", first: 0, length: 3}) +4 - hash({"cat.jpg", first: 3, length: 3}) +\end{verbatim} + +The first entry in this feed is a special metadata entry that tells Dat +the address of the second feed (the content register). Note that node 3 +is not included yet, because 3 is the hash of \texttt{1\ +\ 5}, but 5 +does not exist yet, so will be written at a later update. + +Now we're ready to send our metadata to the other peer. The first +message is a \texttt{Register} message with the key that was shared for +this Dat. Let's call ourselves Alice and the other peer Bob. Alice sends +Bob a \texttt{Want} message that declares they want all nodes in the +file list (the metadata register). Bob replies with a single +\texttt{Have} message that indicates he has 2 nodes of data. Alice sends +three \texttt{Request} messages, one for each leaf node +(\texttt{0,\ 2,\ 4}). Bob sends back three \texttt{Data} messages. The +first \texttt{Data} message contains the content register key, the hash +of the sibling, in this case node \texttt{2}, the hash of the uncle root +\texttt{4}, as well as a signature for the root hashes (in this case +\texttt{1,\ 4}). Alice verifies the integrity of this first +\texttt{Data} message by hashing the metadata received for the content +register metadata to produce the hash for node \texttt{0}. They then +hash the hash \texttt{0} with the hash \texttt{2} that was included to +reproduce hash \texttt{1}, and hashes their \texttt{1} with the value +for \texttt{4} they received, which they can use the received signature +to verify it was the same data. When the next \texttt{Data} message is +received, a similar process is performed to verify the content. + +Now Alice has the full list of files in the Dat, but decides they only +want to download \texttt{cat.png}. Alice knows they want blocks 3 +through 6 from the content register. First Alice sends another +\texttt{Register} message with the content key to open a new replication +channel over the connection. Then Alice sends three \texttt{Request} +messages, one for each of blocks \texttt{4,\ 5,\ 6}. Bob sends back +three \texttt{Data} messages with the data for each block, as well as +the hashes needed to verify the content in a way similar to the process +described above for the metadata feed. + +\subsection{2.5 Random Access}\label{random-access} + +Dat pursues the following access capabilities: + +\begin{itemize} +\tightlist +\item + Support large file hierachies (millions of files in a single + repository). +\item + Support efficient traversal of the hierarchy (listing files in + arbitrary folders efficiently). +\item + Store all changes to all files (metadata and/or content). +\item + List all changes made to any single file. +\item + View the state of all files relative to any point in time. +\item + Subscribe live to all changes (any file). +\item + Subscribe live to changes to files under a specific path. +\item + Efficiently access any byte range of any version of any file. +\item + Allow all of the above to happen remotely, only syncing the minimum + metadata necessary to perform any action. +\item + Allow efficient comparison of remote and local repository state to + request missing pieces during synchronization. +\item + Allow entire remote archive to be synchronized, or just some subset of + files and/or versions. +\end{itemize} + +The way Dat accomplishes these is through a combination of storing all +changes in Hypercore feeds, but also using strategic metadata indexing +strategies that support certain queries efficiently to be performed by +traversing the Hypercore feeds. The protocol itself is specified in +Section 3 (SLEEP), but a scenario based summary follows here. + +\subsubsection{Scenario: Reading a file from a specific byte +offset}\label{scenario-reading-a-file-from-a-specific-byte-offset} + +Alice has a dataset in Dat, Bob wants to access a 100MB CSV called +\texttt{cat\_dna.csv} stored in the remote repository, but only wants to +access the 10MB range of the CSV spanning from 30MB - 40MB. + +Bob has never communicated with Alice before, and is starting fresh with +no knowledge of this Dat repository other than that he knows he wants +\texttt{cat\_dna.csv} at a specific offset. + +First, Bob asks Alice through the Dat protocol for the metadata he needs +to resolve \texttt{cat\_dna.csv} to the correct metadata feed entry that +represents the file he wants. Note: In this scenario we assume Bob wants +the latest version of \texttt{cat\_dna.csv}. It is also possible to do +this for a specific older version. + +Bob first sends a \texttt{Request} message for the latest entry in the +metadata feed. Alice responds. Bob looks at the \texttt{trie} value, and +using the lookup algorithm described below sends another +\texttt{Request} message for the metadata node that is closer to the +filename he is looking for. This repeats until Alice sends Bob the +matching metadata entry. This is the un-optimized resolution that uses +\texttt{log(n)} round trips, though there are ways to optimize this by +having Alice send additional sequence numbers to Bob that help him +traverse in less round trips. + +In the metadata record Bob received for \texttt{cat\_dna.csv} there is +the byte offset to the beginning of the file in the data feed. Bob adds +his +30MB offset to this value and starts requesting pieces of data +starting at that byte offset using the SLEEP protocol as described +below. + +This method tries to allow any byte range of any file to be accessed +without the need to synchronize the full metadata for all files up +front. + +\subsection{3. Dat Network Protocol}\label{dat-network-protocol} + +The SLEEP format is designed to allow for sparse replication, meaning +you can efficiently download only the metadata and data required to +resolve a single byte region of a single file, which makes Dat suitable +for a wide variety of streaming, real time and large dataset use cases. + +To take advantage of this, Dat includes a network protocol. It is +message-based and stateless, making it possible to implement on a +variety of network transport protocols including UDP and TCP. Both +metadata and content registers in SLEEP share the exact same replication +protocol. + +Individual messages are encoded using Protocol Buffers and there are ten +message types using the following schema: + +\subsubsection{Wire Protocol}\label{wire-protocol} + +Over the wire messages are packed in the following lightweight container +format + +\begin{verbatim} + + + +\end{verbatim} + +The \texttt{header} value is a single varint that has two pieces of +information: the integer \texttt{type} that declares a 4-bit message +type (used below), and a channel identifier, \texttt{0} for metadata and +\texttt{1} for content. + +To generate this varint, you bitshift the 4-bit type integer onto the +end of the channel identifier, e.g. +\texttt{channel\ \textless{}\textless{}\ 4\ \textbar{}\ \textless{}4-bit-type\textgreater{}}. + +\subsubsection{Feed}\label{feed} + +Type 0. Should be the first message sent on a channel. + +\begin{itemize} +\tightlist +\item + \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore' + using the public key of the metadata register as the key. +\item + \texttt{nonce} - 24 bytes (192 bits) of random binary data, used in + our encryption scheme +\end{itemize} + +\begin{verbatim} +message Feed { + required bytes discoveryKey = 1; + optional bytes nonce = 2; +} +\end{verbatim} + +\subsubsection{Handshake}\label{handshake} + +Type 1. Overall connection handshake. Should be sent just after the feed +message on the first channel only (metadata). + +\begin{itemize} +\tightlist +\item + \texttt{id} - 32 byte random data used as a identifier for this peer + on the network, useful for checking if you are connected to yourself + or another peer more than once +\item + \texttt{live} - Whether or not you want to operate in live + (continuous) replication mode or end after the initial sync +\item + \texttt{userData} - User-specific metadata encoded as a byte sequence +\item + \texttt{extensions} - List of extensions that are supported on this + Feed +\end{itemize} + +\begin{verbatim} +message Handshake { + optional bytes id = 1; + optional bool live = 2; + optional bytes userData = 3; + repeated string extensions = 4; +} +\end{verbatim} + +\subsubsection{Info}\label{info} + +Type 2. Message indicating state changes. Used to indicate whether you +are uploading and/or downloading. + +Initial state for uploading/downloading is true. If both ends are not +downloading and not live it is safe to consider the stream ended. + +\begin{verbatim} +message Info { + optional bool uploading = 1; + optional bool downloading = 2; +} +\end{verbatim} + +\subsubsection{Have}\label{have} + +Type 3. How you tell the other peer what chunks of data you have or +don't have. You should only send Have messages to peers who have +expressed interest in this region with Want messages. + +\begin{itemize} +\tightlist +\item + \texttt{start} - If you only specify \texttt{start}, it means you are + telling the other side you only have 1 chunk at the position at the + value in \texttt{start}. +\item + \texttt{length} - If you specify length, you can describe a range of + values that you have all of, starting from \texttt{start}. +\item + \texttt{bitfield} - If you would like to send a range of sparse data + about haves/don't haves via bitfield, relative to \texttt{start}. +\end{itemize} + +\begin{verbatim} +message Have { + required uint64 start = 1; + optional uint64 length = 2 [default = 1]; + optional bytes bitfield = 3; +} +\end{verbatim} + +When sending bitfields you must run length encode them. The encoded +bitfield is a series of compressed and uncompressed bit sequences. All +sequences start with a header that is a varint. + +If the last bit is set in the varint (it is an odd number) then a header +represents a compressed bit sequence. + +\begin{verbatim} +compressed-sequence = varint( + byte-length-of-sequence + << 2 | bit << 1 | 1 +) +\end{verbatim} + +If the last bit is \emph{not} set then a header represents a +non-compressed sequence. + +\begin{verbatim} +uncompressed-sequence = varint( + byte-length-of-bitfield << 1 | 0 +) + (bitfield) +\end{verbatim} + +\subsubsection{Unhave}\label{unhave} + +Type 4. How you communicate that you deleted or removed a chunk you used +to have. + +\begin{verbatim} +message Unhave { + required uint64 start = 1; + optional uint64 length = 2 [default = 1]; +} +\end{verbatim} + +\subsubsection{Want}\label{want} + +Type 5. How you ask the other peer to subscribe you to Have messages for +a region of chunks. The \texttt{length} value defaults to Infinity or +feed.length (if not live). + +\begin{verbatim} +message Want { + required uint64 start = 1; + optional uint64 length = 2; +} +\end{verbatim} + +\subsubsection{Unwant}\label{unwant} + +Type 6. How you ask to unsubscribe from Have messages for a region of +chunks from the other peer. You should only Unwant previously Wanted +regions, but if you do Unwant something that hasn't been Wanted it won't +have any effect. The \texttt{length} value defaults to Infinity or +feed.length (if not live). + +\begin{verbatim} +message Unwant { + required uint64 start = 1; + optional uint64 length = 2; +} +\end{verbatim} + +\subsubsection{Request}\label{request} + +Type 7. Request a single chunk of data. + +\begin{itemize} +\tightlist +\item + \texttt{index} - The chunk index for the chunk you want. You should + only ask for indexes that you have received the Have messages for. +\item + \texttt{bytes} - You can also optimistically specify a byte offset, + and in the case the remote is able to resolve the chunk for this byte + offset depending on their Merkle tree state, they will ignore the + \texttt{index} and send the chunk that resolves for this byte offset + instead. But if they cannot resolve the byte request, \texttt{index} + will be used. +\item + \texttt{hash} - If you only want the hash of the chunk and not the + chunk data itself. +\item + \texttt{nodes} - A 64 bit long bitfield representing which parent + nodes you have. +\end{itemize} + +The \texttt{nodes} bitfield is an optional optimization to reduce the +amount of duplicate nodes exchanged during the replication lifecycle. It +indicates which parents you have or don't have. You have a maximum of 64 +parents you can specify. Because \texttt{uint64} in Protocol Buffers is +implemented as a varint, over the wire this does not take up 64 bits in +most cases. The first bit is reserved to signify whether or not you need +a signature in response. The rest of the bits represent whether or not +you have (\texttt{1}) or don't have (\texttt{0}) the information at this +node already. The ordering is determined by walking parent, sibling up +the tree all the way to the root. + +\begin{verbatim} +message Request { + required uint64 index = 1; + optional uint64 bytes = 2; + optional bool hash = 3; + optional uint64 nodes = 4; +} +\end{verbatim} + +\subsubsection{Cancel}\label{cancel} + +Type 8. Cancel a previous Request message that you haven't received yet. + +\begin{verbatim} +message Cancel { + required uint64 index = 1; + optional uint64 bytes = 2; + optional bool hash = 3; +} +\end{verbatim} + +\subsubsection{Data}\label{data} + +Type 9. Sends a single chunk of data to the other peer. You can send it +in response to a Request or unsolicited on its own as a friendly gift. +The data includes all of the Merkle tree parent nodes needed to verify +the hash chain all the way up to the Merkle roots for this chunk. +Because you can produce the direct parents by hashing the chunk, only +the roots and `uncle' hashes are included (the siblings to all of the +parent nodes). + +\begin{itemize} +\tightlist +\item + \texttt{index} - The chunk position for this chunk. +\item + \texttt{value} - The chunk binary data. Empty if you are sending only + the hash. +\item + \texttt{Node.index} - The index for this chunk in in-order notation +\item + \texttt{Node.hash} - The hash of this chunk +\item + \texttt{Node.size}- The aggregate chunk size for all children below + this node (The sum of all chunk sizes of all children) +\item + \texttt{signature} - If you are sending a root node, all root nodes + must have the signature included. +\end{itemize} + +\begin{verbatim} +message Data { + required uint64 index = 1; + optional bytes value = 2; + repeated Node nodes = 3; + optional bytes signature = 4; + + message Node { + required uint64 index = 1; + required bytes hash = 2; + required uint64 size = 3; + } +} +\end{verbatim} + +\section{4. Existing Work}\label{existing-work} + +Dat is inspired by a number of features from existing systems. + +\subsection{Git}\label{git} + +Git popularized the idea of a directed acyclic graph (DAG) combined with +a Merkle tree, a way to represent changes to data where each change is +addressed by the secure hash of the change plus all ancestor hashes in a +graph. This provides a way to trust data integrity, as the only way a +specific hash could be derived by another peer is if they have the same +data and change history required to reproduce that hash. This is +important for reproducibility as it lets you trust that a specific git +commit hash refers to a specific source code state. + +Decentralized version control tools for source code like Git provide a +protocol for efficiently downloading changes to a set of files, but are +optimized for text files and have issues with large files. Solutions +like Git-LFS solve this by using HTTP to download large files, rather +than the Git protocol. GitHub offers Git-LFS hosting but charges +repository owners for bandwidth on popular files. Building a distributed +distribution layer for files in a Git repository is difficult due to +design of Git Packfiles which are delta compressed repository states +that do not easily support random access to byte ranges in previous file +versions. + +\subsection{BitTorrent}\label{bittorrent} + +BitTorrent implements a swarm based file sharing protocol for static +datasets. Data is split into fixed sized chunks, hashed, and then that +hash is used to discover peers that have the same data. An advantage of +using BitTorrent for dataset transfers is that download bandwidth can be +fully saturated. Since the file is split into pieces, and peers can +efficiently discover which pieces each of the peers they are connected +to have, it means one peer can download non-overlapping regions of the +dataset from many peers at the same time in parallel, maximizing network +throughput. + +Fixed sized chunking has drawbacks for data that changes. BitTorrent +assumes all metadata will be transferred up front which makes it +impractical for streaming or updating content. Most BitTorrent clients +divide data into 1024 pieces meaning large datasets could have a very +large chunk size which impacts random access performance (e.g.~for +streaming video). + +Another drawback of BitTorrent is due to the way clients advertise and +discover other peers in absence of any protocol level privacy or trust. +From a user privacy standpoint, BitTorrent leaks what users are +accessing or attempting to access, and does not provide the same +browsing privacy functions as systems like SSL. + +\subsection{Kademlia Distributed Hash +Table}\label{kademlia-distributed-hash-table} + +Kademlia (Maymounkov and Mazieres 2002) is a distributed hash table, a +distributed key/value store that can serve a similar purpose to DNS +servers but has no hard coded server addresses. All clients in Kademlia +are also servers. As long as you know at least one address of another +peer in the network, you can ask them for the key you are trying to find +and they will either have it or give you some other people to talk to +that are more likely to have it. + +If you don't have an initial peer to talk to you, most clients use a +bootstrap server that randomly gives you a peer in the network to start +with. If the bootstrap server goes down, the network still functions as +long as other methods can be used to bootstrap new peers (such as +sending them peer addresses through side channels like how .torrent +files include tracker addresses to try in case Kademlia finds no peers). + +Kademlia is distinct from previous DHT designs due to its simplicity. It +uses a very simple XOR operation between two keys as its ``distance'' +metric to decide which peers are closer to the data being searched for. +On paper it seems like it wouldn't work as it doesn't take into account +things like ping speed or bandwidth. Instead its design is very simple +on purpose to minimize the amount of control/gossip messages and to +minimize the amount of complexity required to implement it. In practice +Kademlia has been extremely successful and is widely deployed as the +``Mainline DHT'' for BitTorrent, with support in all popular BitTorrent +clients today. + +Due to the simplicity in the original Kademlia design a number of +attacks such as DDOS and/or sybil have been demonstrated. There are +protocol extensions (BEPs) which in certain cases mitigate the effects +of these attacks, such as BEP 44 which includes a DDOS mitigation +technique. Nonetheless anyone using Kademlia should be aware of the +limitations. + +\subsection{Peer to Peer Streaming Peer Protocol +(PPSPP)}\label{peer-to-peer-streaming-peer-protocol-ppspp} + +PPSPP +(\href{https://datatracker.ietf.org/doc/rfc7574/?include_text=1}{IETF +RFC 7574}, (Bakker, Petrocco, and Grishchenko 2015)) is a protocol for +live streaming content over a peer to peer network. In it they define a +specific type of Merkle Tree that allows for subsets of the hashes to be +requested by a peer in order to reduce the time-till-playback for end +users. BitTorrent for example transfers all hashes up front, which is +not suitable for live streaming. + +Their Merkle trees are ordered using a scheme they call ``bin +numbering'', which is a method for deterministically arranging an +append-only log of leaf nodes into an in-order layout tree where +non-leaf nodes are derived hashes. If you want to verify a specific +node, you only need to request its sibling's hash and all its uncle +hashes. PPSPP is very concerned with reducing round trip time and +time-till-playback by allowing for many kinds of optimizations, such as +to pack as many hashes into datagrams as possible when exchanging tree +information with peers. + +Although PPSPP was designed with streaming video in mind, the ability to +request a subset of metadata from a large and/or streaming dataset is +very desirable for many other types of datasets. + +\subsection{WebTorrent}\label{webtorrent} + +With WebRTC, browsers can now make peer to peer connections directly to +other browsers. BitTorrent uses UDP sockets which aren't available to +browser JavaScript, so can't be used as-is on the Web. + +WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC +as the transport. This includes the BitTorrent block exchange protocol +as well as the tracker protocol implemented in a way that can enable +hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent +swarms (if a client is capable of making both UDP sockets as well as +WebRTC sockets, such as Node.js). Trackers are exposed to web clients +over HTTP or WebSockets. + +\subsection{InterPlanetary File +System}\label{interplanetary-file-system} + +IPFS is a family of application and network protocols that have peer to +peer file sharing and data permanence baked in. IPFS abstracts network +protocols and naming systems to provide an alternative application +delivery platform to today's Web. For example, instead of using HTTP and +DNS directly, in IPFS you would use LibP2P streams and IPNS in order to +gain access to the features of the IPFS platform. + +\subsection{Certificate Transparency/Secure +Registers}\label{certificate-transparencysecure-registers} + +The UK Government Digital Service have developed the concept of a +register which they define as a digital public ledger you can trust. In +the UK government registers are beginning to be piloted as a way to +expose essential open data sets in a way where consumers can verify the +data has not been tampered with, and allows the data publishers to +update their data sets over time. + +The design of registers was inspired by the infrastructure backing the +Certificate Transparency (Laurie, Langley, and Kasper 2013) project, +initiated at Google, which provides a service on top of SSL certificates +that enables service providers to write certificates to a distributed +public ledger. Any client or service provider can verify if a +certificate they received is in the ledger, which protects against so +called ``rogue certificates''. + +\section{5. Reference Implementation}\label{reference-implementation} + +The connection logic is implemented in a module called +\href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. +This builds on discovery-channel and adds connection establishment, +management and statistics. It provides statistics such as how many +sources are currently connected, how many good and bad behaving sources +have been talked to, and it automatically handles connecting and +reconnecting to sources. UTP support is implemented in the module +\href{https://www.npmjs.com/package/utp-native}{utp-native}. + +Our implementation of source discovery is called +\href{https://npmjs.org/discovery-channel}{discovery-channel}. We also +run a \href{https://www.npmjs.com/package/dns-discovery}{custom DNS +server} that Dat clients use (in addition to specifying their own if +they need to), as well as a +\href{https://github.com/bittorrent/bootstrap-dht}{DHT bootstrap} +server. These discovery servers are the only centralized infrastructure +we need for Dat to work over the Internet, but they are redundant, +interchangeable, never see the actual data being shared, anyone can run +their own and Dat will still work even if they all are unavailable. If +this happens discovery will just be manual (e.g.~manually sharing +IP/ports). + +\section{Acknowledgements}\label{acknowledgements} + +This work was made possible through grants from the John S. and James L. +Knight and Alfred P. Sloan Foundations. + +\section*{References}\label{references} +\addcontentsline{toc}{section}{References} + +\hypertarget{refs}{} +\hypertarget{ref-aumasson2013blake2}{} +Aumasson, Jean-Philippe, Samuel Neves, Zooko Wilcox-O'Hearn, and +Christian Winnerlein. 2013. ``BLAKE2: Simpler, Smaller, Fast as Md5.'' +In \emph{International Conference on Applied Cryptography and Network +Security}, 119--35. Springer. + +\hypertarget{ref-bakker2015peer}{} +Bakker, A, R Petrocco, and V Grishchenko. 2015. ``Peer-to-Peer Streaming +Peer Protocol (Ppspp).'' + +\hypertarget{ref-bernstein2012high}{} +Bernstein, Daniel J, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin +Yang. 2012. ``High-Speed High-Security Signatures.'' \emph{Journal of +Cryptographic Engineering}. Springer, 1--13. + +\hypertarget{ref-laurie2013certificate}{} +Laurie, Ben, Adam Langley, and Emilia Kasper. 2013. ``Certificate +Transparency.'' + +\hypertarget{ref-maymounkov2002kademlia}{} +Maymounkov, Petar, and David Mazieres. 2002. ``Kademlia: A Peer-to-Peer +Information System Based on the Xor Metric.'' In \emph{International +Workshop on Peer-to-Peer Systems}, 53--65. Springer. + +\hypertarget{ref-mykletun2003providing}{} +Mykletun, Einar, Maithili Narasimha, and Gene Tsudik. 2003. ``Providing +Authentication and Integrity in Outsourced Databases Using Merkle Hash +Trees.'' \emph{UCI-SCONCE Technical Report}. + +\hypertarget{ref-sleep}{} +Ogden, Maxwell, and Mathias Buus. 2017. ``SLEEP - the Dat Protocol on +Disk Format.'' In. + +\hypertarget{ref-rossi2010ledbat}{} +Rossi, Dario, Claudio Testa, Silvio Valenti, and Luca Muscariello. 2010. +``LEDBAT: The New Bittorrent Congestion Control Protocol.'' In +\emph{ICCCN}, 1--6. + +\end{document} diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf index 5a7e758..500427e 100644 Binary files a/papers/dat-paper.pdf and b/papers/dat-paper.pdf differ diff --git a/papers/dat-paper.txt b/papers/dat-paper.txt deleted file mode 100644 index f0e71c2..0000000 --- a/papers/dat-paper.txt +++ /dev/null @@ -1,1282 +0,0 @@ -\documentclass[a4paperpaper,twocolumn]{article} -\usepackage{lmodern} -\usepackage{amssymb,amsmath} -\usepackage{ifxetex,ifluatex} -\usepackage{fixltx2e} % provides \textsubscript -\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex - \usepackage[T1]{fontenc} - \usepackage[utf8]{inputenc} -\else % if luatex or xelatex - \ifxetex - \usepackage{mathspec} - \else - \usepackage{fontspec} - \fi - \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} -\fi -% use upquote if available, for straight quotes in verbatim environments -\IfFileExists{upquote.sty}{\usepackage{upquote}}{} -% use microtype if available -\IfFileExists{microtype.sty}{% -\usepackage{microtype} -\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts -}{} -\usepackage[unicode=true]{hyperref} -\hypersetup{ - pdftitle={Dat - Distributed Dataset Synchronization And Versioning}, - pdfauthor={Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science}, - pdfborder={0 0 0}, - breaklinks=true} -\urlstyle{same} % don't use monospace font for urls -\IfFileExists{parskip.sty}{% -\usepackage{parskip} -}{% else -\setlength{\parindent}{0pt} -\setlength{\parskip}{6pt plus 2pt minus 1pt} -} -\setlength{\emergencystretch}{3em} % prevent overfull lines -\providecommand{\tightlist}{% - \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} -\setcounter{secnumdepth}{0} -% Redefines (sub)paragraphs to behave more like sections -\ifx\paragraph\undefined\else -\let\oldparagraph\paragraph -\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} -\fi -\ifx\subparagraph\undefined\else -\let\oldsubparagraph\subparagraph -\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} -\fi - -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - -\title{Dat - Distributed Dataset Synchronization And Versioning} -\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science} -\date{May 2017} - -\begin{document} -\maketitle - -\section{Abstract}\label{abstract} - -Dat is a protocol designed for syncing folders of data, even if they are -large or changing constantly. Dat uses a cryptographically secure -register of changes to prove that the requested data version is -distributed. A byte range of any file's version can be efficiently -streamed from a Dat repository over a network connection. Consumers can -choose to fully or partially replicate the contents of a remote Dat -repository, and can also subscribe to live changes. To ensure writer and -reader privacy, Dat uses public key cryptography to encrypt network -traffic. A group of Dat clients can connect to each other to form a -public or private decentralized network to exchange data between each -other. A reference implementation is provided in JavaScript. - -\section{1. Background}\label{background} - -Many datasets are shared online today using HTTP and FTP, which lack -built in support for version control or content addressing of data. This -results in link rot and content drift as files are moved, updated or -deleted, leading to an alarming rate of disappearing data references in -areas such as -\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}{published -scientific literature}. - -Cloud storage services like S3 ensure availability of data, but they -have a centralized hub-and-spoke networking model and are therefore -limited by their bandwidth, meaning popular files can become very -expensive to share. Services like Dropbox and Google Drive provide -version control and synchronization on top of cloud storage services -which fixes many issues with broken links but rely on proprietary code -and services requiring users to store their data on centralized cloud -infrastructure which has implications on cost, transfer speeds, vendor -lock-in and user privacy. - -Distributed file sharing tools can become faster as files become more -popular, removing the bandwidth bottleneck and making file distribution -cheaper. They also use link resolution and discovery systems which can -prevent broken links meaning if the original source goes offline other -backup sources can be automatically discovered. However these file -sharing tools today are not supported by Web browsers, do not have good -privacy guarantees, and do not provide a mechanism for updating files -without redistributing a new dataset which could mean entirely -redownloading data you already have. - -\section{2. Dat}\label{dat} - -Dat is a dataset synchronization protocol that does not assume a dataset -is static or that the entire dataset will be downloaded. The main -reference implementation is available from npm as -\texttt{npm\ install\ dat\ -g}. - -The protocol is agnostic to the underlying transport e.g.~you could -implement Dat over carrier pigeon. Data is stored in a format called -SLEEP (Ogden and Buus 2017), described in it's own paper. The key -properties of the Dat design are explained in this section. - -\begin{itemize} -\tightlist -\item - 2.1 \textbf{Content Integrity} - Data and publisher integrity is - verified through use of signed hashes of the content. -\item - 2.2 \textbf{Decentralized Mirroring} - Users sharing the same Dat - automatically discover each other and exchange data in a swarm. -\item - 2.3 \textbf{Network Privacy} - Dat provides certain privacy guarantees - including end-to-end encryption. -\item - 2.4 \textbf{Incremental Versioning} - Datasets can be efficiently - synced, even in real time, to other peers. -\item - 2.5 \textbf{Random Access} - Huge file hierarchies can be efficiently - traversed remotely. -\end{itemize} - -\subsection{2.1 Content Integrity}\label{content-integrity} - -Content integrity means being able to verify the data you received is -the exact same version of the data that you expected. This is important -in a distributed system as this mechanism will catch incorrect data sent -by bad peers. It also has implications for reproducibility as it lets -you refer to a specific version of a dataset. - -Link rot, when links online stop resolving, and content drift, when data -changes but the link to the data remains the same, are two common issues -in data analysis. For example, one day a file called data.zip might -change, but a typical HTTP link to the file does not include a hash of -the content, or provide a way to get updated metadata, so clients that -only have the HTTP link have no way to check if the file changed without -downloading the entire file again. Referring to a file by the hash of -its content is called content addressability, and lets users not only -verify that the data they receive is the version of the data they want, -but also lets people cite specific versions of the data by referring to -a specific hash. - -Dat uses BLAKE2b (Aumasson et al. 2013) cryptographically secure hashes -to address content. Hashes are arranged in a Merkle tree (Mykletun, -Narasimha, and Tsudik 2003), a tree where each non-leaf node is the hash -of all child nodes. Leaf nodes contain pieces of the dataset. Due to the -properties of secure cryptographic hashes the top hash can only be -produced if all data below it matches exactly. If two trees have -matching top hashes then you know that all other nodes in the tree must -match as well, and you can conclude that your dataset is synchronized. -Trees are chosen as the primary data structure in Dat as they have a -number of properties that allow for efficient access to subsets of the -metadata, which allows Dat to work efficiently over a network -connection. - -\subsubsection{Dat Links}\label{dat-links} - -Dat links are Ed25519 (Bernstein et al. 2012) public keys which have a -length of 32 bytes (64 characters when Hex encoded). You can represent -your Dat link in the following ways and Dat clients will be able to -understand them: - -\begin{itemize} -\tightlist -\item - The standalone public key: -\end{itemize} - -\texttt{8e1c7189b1b2dbb5c4ec2693787884771201da9...} - -\begin{itemize} -\tightlist -\item - Using the dat:// protocol: -\end{itemize} - -\texttt{dat://8e1c7189b1b2dbb5c4ec2693787884771...} - -\begin{itemize} -\tightlist -\item - As part of an HTTP URL: -\end{itemize} - -\texttt{https://datproject.org/8e1c7189b1b2dbb5...} - -All messages in the Dat protocol are encrypted and signed using the -public key during transport. This means that unless you know the public -key (e.g.~unless the Dat link was shared with you) then you will not be -able to discover or communicate with any member of the swarm for that -Dat. Anyone with the public key can verify that messages (such as -entries in a Dat Stream) were created by a holder of the private key. - -Every Dat repository has a corresponding private key which is kept in -your home folder and never shared. Dat never exposes either the public -or private key over the network. During the discovery phase the BLAKE2b -hash of the public key is used as the discovery key. This means that the -original key is impossible to discover (unless it was shared publicly -through a separate channel) since only the hash of the key is exposed -publicly. - -Dat does not provide an authentication mechanism at this time. Instead -it provides a capability system. Anyone with the Dat link is currently -considered able to discover and access data. Do not share your Dat links -publicly if you do not want them to be accessed. - -\subsubsection{Hypercore and Hyperdrive}\label{hypercore-and-hyperdrive} - -The Dat storage, content integrity, and networking protocols are -implemented in a module called -\href{https://npmjs.org/hypercore}{Hypercore}. Hypercore is agnostic to -the format of the input data, it operates on any stream of binary data. -For the Dat use case of synchronizing datasets we use a file system -module on top of Hypercore called -\href{https://npmjs.org/hyperdrive}{Hyperdrive}. - -Dat has a layered abstraction so that users can use Hypercore directly -to have full control over how they model their data. Hyperdrive works -well when your data can be represented as files on a filesystem, which -is the main use case with Dat. - -\subsubsection{Hypercore Registers}\label{hypercore-registers} - -Hypercore Registers are the core mechanism used in Dat. They are binary -append-only streams whose contents are cryptographically hashed and -signed and therefore can be verified by anyone with access to the public -key of the writer. They are an implementation of the concept known as a -register, a digital ledger you can trust. - -Dat uses two registers, \texttt{content} and \texttt{metadata}. The -\texttt{content} register contains the files in your repository and -\texttt{metadata} contains the metadata about the files including name, -size, last modified time, etc. Dat replicates them both when -synchronizing with another peer. - -When files are added to Dat, each file gets split up into some number of -chunks, and the chunks are then arranged into a Merkle tree, which is -used later for version control and replication processes. - -\subsection{2.2 Decentralized Mirroring}\label{decentralized-mirroring} - -Dat is a peer to peer protocol designed to exchange pieces of a dataset -amongst a swarm of peers. As soon as a peer acquires their first piece -of data in the dataset they can choose to become a partial mirror for -the dataset. If someone else contacts them and needs the piece they -have, they can choose to share it. This can happen simultaneously while -the peer is still downloading the pieces they want from others. - -\subsubsection{Source Discovery}\label{source-discovery} - -An important aspect of mirroring is source discovery, the techniques -that peers use to find each other. Source discovery means finding the IP -and port of data sources online that have a copy of that data you are -looking for. You can then connect to them and begin exchanging data. By -using source discovery techniques Dat is able to create a network where -data can be discovered even if the original data source disappears. - -Source discovery can happen over many kinds of networks, as long as you -can model the following actions: - -\begin{itemize} -\tightlist -\item - \texttt{join(key,\ {[}port{]})} - Begin performing regular lookups on - an interval for \texttt{key}. Specify \texttt{port} if you want to - announce that you share \texttt{key} as well. -\item - \texttt{leave(key,\ {[}port{]})} - Stop looking for \texttt{key}. - Specify \texttt{port} to stop announcing that you share \texttt{key} - as well. -\item - \texttt{foundpeer(key,\ ip,\ port)} - Called when a peer is found by a - lookup. -\end{itemize} - -In the Dat implementation we implement the above actions on top of three -types of discovery networks: - -\begin{itemize} -\tightlist -\item - DNS name servers - An Internet standard mechanism for resolving keys - to addresses -\item - Multicast DNS - Useful for discovering peers on local networks -\item - Kademlia Mainline Distributed Hash Table - Less central points of - failure, increases probability of Dat working even if DNS servers are - unreachable -\end{itemize} - -Additional discovery networks can be implemented as needed. We chose the -above three as a starting point to have a complementary mix of -strategies to increase the probability of source discovery. Additionally -you can specify a Dat via HTTPS link, which runs the Dat protocol in -``single-source'' mode, meaning the above discovery networks are not -used, and instead only that one HTTPS server is used as the only peer. - -\subsubsection{Peer Connections}\label{peer-connections} - -After the discovery phase, Dat should have a list of potential data -sources to try and contact. Dat uses either TCP, HTTP or -\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP} -(Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up -all available bandwidth on a network (e.g.~so that other people sharing -wifi can still use the Internet), and is still based on UDP so works -with NAT traversal techniques like UDP hole punching. HTTP is supported -for compatibility with static file servers and web browser clients. Note -that these are the protocols we support in the reference Dat -implementation, but the Dat protocol itself is transport agnostic. - -If an HTTP source is specified Dat will prefer that one over other -sources. Otherwise when Dat gets the IP and port for a potential TCP or -UTP source it tries to connect using both protocols. If one connects -first, Dat aborts the other one. If none connect, Dat will try again -until it decides that source is offline or unavailable and then stops -trying to connect to them. Sources Dat is able to connect to go into a -list of known good sources, so that if/when the Internet connection goes -down Dat can use that list to reconnect to known good sources again -quickly. - -If Dat gets a lot of potential sources it picks a handful at random to -try and connect to and keeps the rest around as additional sources to -use later in case it decides it needs more sources. - -Once a duplex binary connection to a remote source is open Dat then -layers on the Hypercore protocol, a message-based replication protocol -that allows two peers to communicate over a stateless channel to request -and exchange data. You open separate replication channels with many -peers at once which allows clients to parallelize data requests across -the entire pool of peers they have established connections with. - -\subsection{2.3 Network Privacy}\label{network-privacy} - -On the Web today, with SSL, there is a guarantee that the traffic -between your computer and the server is private. As long as you trust -the server to not leak your logs, attackers who intercept your network -traffic will not be able to read the HTTP traffic exchanged between you -and the server. This is a fairly straightforward model as clients only -have to trust a single server for some domain. - -There is an inherent tradeoff in peer to peer systems of source -discovery vs.~user privacy. The more sources you contact and ask for -some data, the more sources you trust to keep what you asked for -private. Our goal is to have Dat be configurable in respect to this -tradeoff to allow application developers to meet their own privacy -guidelines. - -It is up to client programs to make design decisions around which -discovery networks they trust. For example if a Dat client decides to -use the BitTorrent DHT to discover peers, and they are searching for a -publicly shared Dat key (e.g.~a key cited publicly in a published -scientific paper) with known contents, then because of the privacy -design of the BitTorrent DHT it becomes public knowledge what key that -client is searching for. - -A client could choose to only use discovery networks with certain -privacy guarantees. For example a client could only connect to an -approved list of sources that they trust, similar to SSL. As long as -they trust each source, the encryption built into the Dat network -protocol will prevent the Dat key they are looking for from being -leaked. - -\subsection{2.4 Incremental Versioning}\label{incremental-versioning} - -Given a stream of binary data, Dat splits the stream into chunks, hashes -each chunk, and arranges the hashes in a specific type of Merkle tree -that allows for certain replication properties. - -Dat is also able to fully or partially synchronize streams in a -distributed setting even if the stream is being appended to. This is -accomplished by using the messaging protocol to traverse the Merkle tree -of remote sources and fetch a strategic set of nodes. Due to the -low-level, message-oriented design of the replication protocol, -different node traversal strategies can be implemented. - -There are two types of versioning performed automatically by Dat. -Metadata is stored in a folder called \texttt{.dat} in the root folder -of a repository, and data is stored as normal files in the root folder. - -\subsubsection{Metadata Versioning}\label{metadata-versioning} - -Dat tries as much as possible to act as a one-to-one mirror of the state -of a folder and all its contents. When importing files, Dat uses a -sorted, depth-first recursion to list all the files in the tree. For -each file it finds, it grabs the filesystem metadata (filename, Stat -object, etc) and checks if there is already an entry for this filename -with this exact metadata already represented in the Dat repository -metadata. If the file with this metadata matches exactly the newest -version of the file metadata stored in Dat, then this file will be -skipped (no change). - -If the metadata differs from the current existing one (or there are no -entries for this filename at all in the history), then this new metadata -entry will be appended as the new `latest' version for this file in the -append-only SLEEP metadata content register (described below). - -\subsubsection{Content Versioning}\label{content-versioning} - -In addition to storing a historical record of filesystem metadata, the -content of the files themselves are also capable of being stored in a -version controlled manner. The default storage system used in Dat stores -the files as files. This has the advantage of being very straightforward -for users to understand, but the downside of not storing old versions of -content by default. - -In contrast to other version control systems like Git, Dat by default -only stores the current set of checked out files on disk in the -repository folder, not old versions. It does store all previous metadata -for old versions in \texttt{.dat}. Git for example stores all previous -content versions and all previous metadata versions in the \texttt{.git} -folder. Because Dat is designed for larger datasets, if it stored all -previous file versions in \texttt{.dat}, then the \texttt{.dat} folder -could easily fill up the users hard drive inadvertently. Therefore Dat -has multiple storage modes based on usage. - -Hypercore registers include an optional \texttt{data} file that stores -all chunks of data. In Dat, only the \texttt{metadata.data} file is -used, but the \texttt{content.data} file is not used. The default -behavior is to store the current files only as normal files. If you want -to run an `archival' node that keeps all previous versions, you can -configure Dat to use the \texttt{content.data} file instead. For -example, on a shared server with lots of storage you probably want to -store all versions. However on a workstation machine that is only -accessing a subset of one version, the default mode of storing all -metadata plus the current set of downloaded files is acceptable, because -you know the server has the full history. - -\subsubsection{Merkle Trees}\label{merkle-trees} - -Registers in Dat use a specific method of encoding a Merkle tree where -hashes are positioned by a scheme called binary in-order interval -numbering or just ``bin'' numbering. This is just a specific, -deterministic way of laying out the nodes in a tree. For example a tree -with 7 nodes will always be arranged like this: - -\begin{verbatim} -0 - 1 -2 - 3 -4 - 5 -6 -\end{verbatim} - -In Dat, the hashes of the chunks of files are always even numbers, at -the wide end of the tree. So the above tree had four original values -that become the even numbers: - -\begin{verbatim} -chunk0 -> 0 -chunk1 -> 2 -chunk2 -> 4 -chunk3 -> 6 -\end{verbatim} - -In the resulting Merkle tree, the even and odd nodes store different -information: - -\begin{itemize} -\tightlist -\item - Evens - List of data hashes {[}chunk0, chunk1, chunk2, \ldots{}{]} -\item - Odds - List of Merkle hashes (hashes of child even nodes) {[}hash0, - hash1, hash2, \ldots{}{]} -\end{itemize} - -These two lists get interleaved into a single register such that the -indexes (position) in the register are the same as the bin numbers from -the Merkle tree. - -All odd hashes are derived by hashing the two child nodes, e.g.~given -hash0 is \texttt{hash(chunk0)} and hash2 is \texttt{hash(chunk1)}, hash1 -is \texttt{hash(hash0\ +\ hash2)}. - -For example a register with two data entries would look something like -this (pseudocode): - -\begin{verbatim} -0. hash(value0) -1. hash(hash(chunk0) + hash(chunk1)) -2. hash(value1) -\end{verbatim} - -It is possible for the in-order Merkle tree to have multiple roots at -once. A root is defined as a parent node with a full set of child node -slots filled below it. - -For example, this tree hash 2 roots (1 and 4) - -\begin{verbatim} -0 - 1 -2 - -4 -\end{verbatim} - -This tree hash one root (3): - -\begin{verbatim} -0 - 1 -2 - 3 -4 - 5 -6 -\end{verbatim} - -This one has one root (1): - -\begin{verbatim} -0 - 1 -2 -\end{verbatim} - -\subsubsection{Replication Example}\label{replication-example} - -This section describes in high level the replication flow of a Dat. Note -that the low level details are available by reading the SLEEP section -below. For the sake of illustrating how this works in practice in a -networked replication scenario, consider a folder with two files: - -\begin{verbatim} -bat.jpg -cat.jpg -\end{verbatim} - -To send these files to another machine using Dat, you would first add -them to a Dat repository by splitting them into chunks and constructing -SLEEP files representing the chunks and filesystem metadata. - -Let's assume \texttt{bat.jpg} and \texttt{cat.jpg} both produce three -chunks, each around 64KB. Dat stores in a representation called SLEEP, -but here we will show a pseudo-representation for the purposes of -illustrating the replication process. The six chunks get sorted into a -list like this: - -\begin{verbatim} -bat-1 -bat-2 -bat-3 -cat-1 -cat-2 -cat-3 -\end{verbatim} - -These chunks then each get hashed, and the hashes get arranged into a -Merkle tree (the content register): - -\begin{verbatim} -0 - hash(bat-1) - 1 - hash(0 + 2) -2 - hash(bat-2) - 3 - hash(1 + 5) -4 - hash(bat-3) - 5 - hash(4 + 6) -6 - hash(cat-1) -8 - hash(cat-2) - 9 - hash(8 + 10) -10 - hash(cat-3) -\end{verbatim} - -Next we calculate the root hashes of our tree, in this case 3 and 9. We -then hash them together, and cryptographically sign the hash. This -signed hash now can be used to verify all nodes in the tree, and the -signature proves it was produced by us, the holder of the private key -for this Dat. - -This tree is for the hashes of the contents of the photos. There is also -a second Merkle tree that Dat generates that represents the list of -files and their metadata and looks something like this (the metadata -register): - -\begin{verbatim} -0 - hash({contentRegister: '9e29d624...'}) - 1 - hash(0 + 2) -2 - hash({"bat.jpg", first: 0, length: 3}) -4 - hash({"cat.jpg", first: 3, length: 3}) -\end{verbatim} - -The first entry in this feed is a special metadata entry that tells Dat -the address of the second feed (the content register). Note that node 3 -is not included yet, because 3 is the hash of \texttt{1\ +\ 5}, but 5 -does not exist yet, so will be written at a later update. - -Now we're ready to send our metadata to the other peer. The first -message is a \texttt{Register} message with the key that was shared for -this Dat. Let's call ourselves Alice and the other peer Bob. Alice sends -Bob a \texttt{Want} message that declares they want all nodes in the -file list (the metadata register). Bob replies with a single -\texttt{Have} message that indicates he has 2 nodes of data. Alice sends -three \texttt{Request} messages, one for each leaf node -(\texttt{0,\ 2,\ 4}). Bob sends back three \texttt{Data} messages. The -first \texttt{Data} message contains the content register key, the hash -of the sibling, in this case node \texttt{2}, the hash of the uncle root -\texttt{4}, as well as a signature for the root hashes (in this case -\texttt{1,\ 4}). Alice verifies the integrity of this first -\texttt{Data} message by hashing the metadata received for the content -register metadata to produce the hash for node \texttt{0}. They then -hash the hash \texttt{0} with the hash \texttt{2} that was included to -reproduce hash \texttt{1}, and hashes their \texttt{1} with the value -for \texttt{4} they received, which they can use the received signature -to verify it was the same data. When the next \texttt{Data} message is -received, a similar process is performed to verify the content. - -Now Alice has the full list of files in the Dat, but decides they only -want to download \texttt{cat.png}. Alice knows they want blocks 3 -through 6 from the content register. First Alice sends another -\texttt{Register} message with the content key to open a new replication -channel over the connection. Then Alice sends three \texttt{Request} -messages, one for each of blocks \texttt{4,\ 5,\ 6}. Bob sends back -three \texttt{Data} messages with the data for each block, as well as -the hashes needed to verify the content in a way similar to the process -described above for the metadata feed. - -\subsection{2.5 Random Access}\label{random-access} - -Dat pursues the following access capabilities: - -\begin{itemize} -\tightlist -\item - Support large file hierachies (millions of files in a single - repository). -\item - Support efficient traversal of the hierarchy (listing files in - arbitrary folders efficiently). -\item - Store all changes to all files (metadata and/or content). -\item - List all changes made to any single file. -\item - View the state of all files relative to any point in time. -\item - Subscribe live to all changes (any file). -\item - Subscribe live to changes to files under a specific path. -\item - Efficiently access any byte range of any version of any file. -\item - Allow all of the above to happen remotely, only syncing the minimum - metadata necessary to perform any action. -\item - Allow efficient comparison of remote and local repository state to - request missing pieces during synchronization. -\item - Allow entire remote archive to be synchronized, or just some subset of - files and/or versions. -\end{itemize} - -The way Dat accomplishes these is through a combination of storing all -changes in Hypercore feeds, but also using strategic metadata indexing -strategies that support certain queries efficiently to be performed by -traversing the Hypercore feeds. The protocol itself is specified in -Section 3 (SLEEP), but a scenario based summary follows here. - -\subsubsection{Scenario: Reading a file from a specific byte -offset}\label{scenario-reading-a-file-from-a-specific-byte-offset} - -Alice has a dataset in Dat, Bob wants to access a 100MB CSV called -\texttt{cat\_dna.csv} stored in the remote repository, but only wants to -access the 10MB range of the CSV spanning from 30MB - 40MB. - -Bob has never communicated with Alice before, and is starting fresh with -no knowledge of this Dat repository other than that he knows he wants -\texttt{cat\_dna.csv} at a specific offset. - -First, Bob asks Alice through the Dat protocol for the metadata he needs -to resolve \texttt{cat\_dna.csv} to the correct metadata feed entry that -represents the file he wants. Note: In this scenario we assume Bob wants -the latest version of \texttt{cat\_dna.csv}. It is also possible to do -this for a specific older version. - -Bob first sends a \texttt{Request} message for the latest entry in the -metadata feed. Alice responds. Bob looks at the \texttt{trie} value, and -using the lookup algorithm described below sends another -\texttt{Request} message for the metadata node that is closer to the -filename he is looking for. This repeats until Alice sends Bob the -matching metadata entry. This is the un-optimized resolution that uses -\texttt{log(n)} round trips, though there are ways to optimize this by -having Alice send additional sequence numbers to Bob that help him -traverse in less round trips. - -In the metadata record Bob recieved for \texttt{cat\_dna.csv} there is -the byte offset to the beginning of the file in the data feed. Bob adds -his +30MB offset to this value and starts requesting pieces of data -starting at that byte offset using the SLEEP protocol as described -below. - -This method tries to allow any byte range of any file to be accessed -without the need to synchronize the full metadata for all files up -front. - -\subsubsection{Scenario: Syncing live changes to files at a specific -path}\label{scenario-syncing-live-changes-to-files-at-a-specific-path} - -TODO - -\subsubsection{Scenario: Syncing an entire -archive}\label{scenario-syncing-an-entire-archive} - -TODO - -\subsection{3. Dat Network Protocol}\label{dat-network-protocol} - -The SLEEP format is designed to allow for sparse replication, meaning -you can efficiently download only the metadata and data required to -resolve a single byte region of a single file, which makes Dat suitable -for a wide variety of streaming, real time and large dataset use cases. - -To take advantage of this, Dat includes a network protocol. It is -message-based and stateless, making it possible to implement on a -variety of network transport protocols including UDP and TCP. Both -metadata and content registers in SLEEP share the exact same replication -protocol. - -Individual messages are encoded using Protocol Buffers and there are ten -message types using the following schema: - -\subsubsection{Wire Protocol}\label{wire-protocol} - -Over the wire messages are packed in the following lightweight container -format - -\begin{verbatim} - - - -\end{verbatim} - -The \texttt{header} value is a single varint that has two pieces of -information: the integer \texttt{type} that declares a 4-bit message -type (used below), and a channel identifier, \texttt{0} for metadata and -\texttt{1} for content. - -To generate this varint, you bitshift the 4-bit type integer onto the -end of the channel identifier, e.g. -\texttt{channel\ \textless{}\textless{}\ 4\ \textbar{}\ \textless{}4-bit-type\textgreater{}}. - -\subsubsection{Feed}\label{feed} - -Type 0. Should be the first message sent on a channel. - -\begin{itemize} -\tightlist -\item - \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore' - using the public key of the metadata register as the key. -\item - \texttt{nonce} - 32 bytes of random binary data, used in our - encryption scheme -\end{itemize} - -\begin{verbatim} -message Feed { - required bytes discoveryKey = 1; - optional bytes nonce = 2; -} -\end{verbatim} - -\subsubsection{Handshake}\label{handshake} - -Type 1. Overall connection handshake. Should be sent just after the feed -message on the first channel only (metadata). - -\begin{itemize} -\tightlist -\item - \texttt{id} - 32 byte random data used as a identifier for this peer - on the network, useful for checking if you are connected to yourself - or another peer more than once -\item - \texttt{live} - Whether or not you want to operate in live - (continuous) replication mode or end after the initial sync -\item - \texttt{userData} - User-specific metadata encoded as a byte sequence -\item - \texttt{extensions} - List of extensions that are supported on this - Feed -\end{itemize} - -\begin{verbatim} -message Handshake { - optional bytes id = 1; - optional bool live = 2; - optional bytes userData = 3; - repeated string extensions = 4; -} -\end{verbatim} - -\subsubsection{Info}\label{info} - -Type 2. Message indicating state changes. Used to indicate whether you -are uploading and/or downloading. - -Initial state for uploading/downloading is true. If both ends are not -downloading and not live it is safe to consider the stream ended. - -\begin{verbatim} -message Info { - optional bool uploading = 1; - optional bool downloading = 2; -} -\end{verbatim} - -\subsubsection{Have}\label{have} - -Type 3. How you tell the other peer what chunks of data you have or -don't have. You should only send Have messages to peers who have -expressed interest in this region with Want messages. - -\begin{itemize} -\tightlist -\item - \texttt{start} - If you only specify \texttt{start}, it means you are - telling the other side you only have 1 chunk at the position at the - value in \texttt{start}. -\item - \texttt{length} - If you specify length, you can describe a range of - values that you have all of, starting from \texttt{start}. -\item - \texttt{bitfield} - If you would like to send a range of sparse data - about haves/don't haves via bitfield, relative to \texttt{start}. -\end{itemize} - -\begin{verbatim} -message Have { - required uint64 start = 1; - optional uint64 length = 2 [default = 1]; - optional bytes bitfield = 3; -} -\end{verbatim} - -When sending bitfields you must run length encode them. The encoded -bitfield is a series of compressed and uncompressed bit sequences. All -sequences start with a header that is a varint. - -If the last bit is set in the varint (it is an odd number) then a header -represents a compressed bit sequence. - -\begin{verbatim} -compressed-sequence = varint( - byte-length-of-sequence - << 2 | bit << 1 | 1 -) -\end{verbatim} - -If the last bit is \emph{not} set then a header represents a -non-compressed sequence. - -\begin{verbatim} -uncompressed-sequence = varint( - byte-length-of-bitfield << 1 | 0 -) + (bitfield) -\end{verbatim} - -\subsubsection{Unhave}\label{unhave} - -Type 4. How you communicate that you deleted or removed a chunk you used -to have. - -\begin{verbatim} -message Unhave { - required uint64 start = 1; - optional uint64 length = 2 [default = 1]; -} -\end{verbatim} - -\subsubsection{Want}\label{want} - -Type 5. How you ask the other peer to subscribe you to Have messages for -a region of chunks. The \texttt{length} value defaults to Infinity or -feed.length (if not live). - -\begin{verbatim} -message Want { - required uint64 start = 1; - optional uint64 length = 2; -} -\end{verbatim} - -\subsubsection{Unwant}\label{unwant} - -Type 6. How you ask to unsubscribe from Have messages for a region of -chunks from the other peer. You should only Unwant previously Wanted -regions, but if you do Unwant something that hasn't been Wanted it won't -have any effect. The \texttt{length} value defaults to Infinity or -feed.length (if not live). - -\begin{verbatim} -message Unwant { - required uint64 start = 1; - optional uint64 length = 2; -} -\end{verbatim} - -\subsubsection{Request}\label{request} - -Type 7. Request a single chunk of data. - -\begin{itemize} -\tightlist -\item - \texttt{index} - The chunk index for the chunk you want. You should - only ask for indexes that you have received the Have messages for. -\item - \texttt{bytes} - You can also optimistically specify a byte offset, - and in the case the remote is able to resolve the chunk for this byte - offset depending on their Merkle tree state, they will ignore the - \texttt{index} and send the chunk that resolves for this byte offset - instead. But if they cannot resolve the byte request, \texttt{index} - will be used. -\item - \texttt{hash} - If you only want the hash of the chunk and not the - chunk data itself. -\item - \texttt{nodes} - A 64 bit long bitfield representing which parent - nodes you have. -\end{itemize} - -The \texttt{nodes} bitfield is an optional optimization to reduce the -amount of duplicate nodes exchanged during the replication lifecycle. It -indicates which parents you have or don't have. You have a maximum of 64 -parents you can specify. Because \texttt{uint64} in Protocol Buffers is -implemented as a varint, over the wire this does not take up 64 bits in -most cases. The first bit is reserved to signify whether or not you need -a signature in response. The rest of the bits represent whether or not -you have (\texttt{1}) or don't have (\texttt{0}) the information at this -node already. The ordering is determined by walking parent, sibling up -the tree all the way to the root. - -\begin{verbatim} -message Request { - required uint64 index = 1; - optional uint64 bytes = 2; - optional bool hash = 3; - optional uint64 nodes = 4; -} -\end{verbatim} - -\subsubsection{Cancel}\label{cancel} - -Type 8. Cancel a previous Request message that you haven't received yet. - -\begin{verbatim} -message Cancel { - required uint64 index = 1; - optional uint64 bytes = 2; - optional bool hash = 3; -} -\end{verbatim} - -\subsubsection{Data}\label{data} - -Type 9. Sends a single chunk of data to the other peer. You can send it -in response to a Request or unsolicited on its own as a friendly gift. -The data includes all of the Merkle tree parent nodes needed to verify -the hash chain all the way up to the Merkle roots for this chunk. -Because you can produce the direct parents by hashing the chunk, only -the roots and `uncle' hashes are included (the siblings to all of the -parent nodes). - -\begin{itemize} -\tightlist -\item - \texttt{index} - The chunk position for this chunk. -\item - \texttt{value} - The chunk binary data. Empty if you are sending only - the hash. -\item - \texttt{Node.index} - The index for this chunk in in-order notation -\item - \texttt{Node.hash} - The hash of this chunk -\item - \texttt{Node.size}- The aggregate chunk size for all children below - this node (The sum of all chunk sizes of all children) -\item - \texttt{signature} - If you are sending a root node, all root nodes - must have the signature included. -\end{itemize} - -\begin{verbatim} -message Data { - required uint64 index = 1; - optional bytes value = 2; - repeated Node nodes = 3; - optional bytes signature = 4; - - message Node { - required uint64 index = 1; - required bytes hash = 2; - required uint64 size = 3; - } -} -\end{verbatim} - -\section{4. Multi-Writer}\label{multi-writer} - -The design of Dat up to this point assumes you have a single keyholder -writing and signing data and appending it to the metadata and content -feed. However having the ability for multiple keyholders to be able to -write to a single repository allows for many interesting use cases such -as forking and collaborative workflows. - -In order to do this, we use one \texttt{metadata.data} feed for each -writer. Each writer kets their own keypair. Each writer is responsible -for storing their private key. To add a new writer to your feed, you -include their key in a metadata feed entry. - -For example, if Alice wants to add Bob to have write access to a Dat -repository, Alice would take Bob's public key and writes it to the -`local' metadata feed (the feed that Alice owns, e.g.~the original -feed). Now anyone else who replicates from Alice will find Bob's key in -the history. If in the future Bob distributes a version of the Dat that -he added new data to, everyone who has a copy of the Dat from Alice will -have a copy of Bob's key that they can use to verify that Bob's writes -are valid. - -On disk, each users feed is stored in a separate hyperdrive. The -original hyperdrive (owned by Alice) is called the `local' hyperdrive. -Bob's hyperdrive would be stored separately in the SLEEP folder -addressed by Bob's public key. - -In case Bob and Alice write different values for the same file (e.g.~Bob -creates a ``fork''), when they sync up with each other replication will -still work, but for the forked value the Dat client will return an array -of values for that key instead of just one value. The values are linked -to the writer that wrote them, so in the case of receiving multiple -values, clients can choose to choose the value from Alice, or Bob, or -the latest value, or whatever other strategy they prefer. - -If a writer updates the value of a forked key with new value they are -performing a merge. - -\section{5. Existing Work}\label{existing-work} - -Dat is inspired by a number of features from existing systems. - -\subsection{Git}\label{git} - -Git popularized the idea of a directed acyclic graph (DAG) combined with -a Merkle tree, a way to represent changes to data where each change is -addressed by the secure hash of the change plus all ancestor hashes in a -graph. This provides a way to trust data integrity, as the only way a -specific hash could be derived by another peer is if they have the same -data and change history required to reproduce that hash. This is -important for reproducibility as it lets you trust that a specific git -commit hash refers to a specific source code state. - -Decentralized version control tools for source code like Git provide a -protocol for efficiently downloading changes to a set of files, but are -optimized for text files and have issues with large files. Solutions -like Git-LFS solve this by using HTTP to download large files, rather -than the Git protocol. GitHub offers Git-LFS hosting but charges -repository owners for bandwidth on popular files. Building a distributed -distribution layer for files in a Git repository is difficult due to -design of Git Packfiles which are delta compressed repository states -that do not easily support random access to byte ranges in previous file -versions. - -\subsection{BitTorrent}\label{bittorrent} - -BitTorrent implements a swarm based file sharing protocol for static -datasets. Data is split into fixed sized chunks, hashed, and then that -hash is used to discover peers that have the same data. An advantage of -using BitTorrent for dataset transfers is that download bandwidth can be -fully saturated. Since the file is split into pieces, and peers can -efficiently discover which pieces each of the peers they are connected -to have, it means one peer can download non-overlapping regions of the -dataset from many peers at the same time in parallel, maximizing network -throughput. - -Fixed sized chunking has drawbacks for data that changes. BitTorrent -assumes all metadata will be transferred up front which makes it -impractical for streaming or updating content. Most BitTorrent clients -divide data into 1024 pieces meaning large datasets could have a very -large chunk size which impacts random access performance (e.g.~for -streaming video). - -Another drawback of BitTorrent is due to the way clients advertise and -discover other peers in absence of any protocol level privacy or trust. -From a user privacy standpoint, BitTorrent leaks what users are -accessing or attempting to access, and does not provide the same -browsing privacy functions as systems like SSL. - -\subsection{Kademlia Distributed Hash -Table}\label{kademlia-distributed-hash-table} - -Kademlia (Maymounkov and Mazieres 2002) is a distributed hash table, a -distributed key/value store that can serve a similar purpose to DNS -servers but has no hard coded server addresses. All clients in Kademlia -are also servers. As long as you know at least one address of another -peer in the network, you can ask them for the key you are trying to find -and they will either have it or give you some other people to talk to -that are more likely to have it. - -If you don't have an initial peer to talk to you, most clients use a -bootstrap server that randomly gives you a peer in the network to start -with. If the bootstrap server goes down, the network still functions as -long as other methods can be used to bootstrap new peers (such as -sending them peer addresses through side channels like how .torrent -files include tracker addresses to try in case Kademlia finds no peers). - -Kademlia is distinct from previous DHT designs due to its simplicity. It -uses a very simple XOR operation between two keys as its ``distance'' -metric to decide which peers are closer to the data being searched for. -On paper it seems like it wouldn't work as it doesn't take into account -things like ping speed or bandwidth. Instead its design is very simple -on purpose to minimize the amount of control/gossip messages and to -minimize the amount of complexity required to implement it. In practice -Kademlia has been extremely successful and is widely deployed as the -``Mainline DHT'' for BitTorrent, with support in all popular BitTorrent -clients today. - -Due to the simplicity in the original Kademlia design a number of -attacks such as DDOS and/or sybil have been demonstrated. There are -protocol extensions (BEPs) which in certain cases mitigate the effects -of these attacks, such as BEP 44 which includes a DDOS mitigation -technique. Nonetheless anyone using Kademlia should be aware of the -limitations. - -\subsection{Peer to Peer Streaming Peer Protocol -(PPSPP)}\label{peer-to-peer-streaming-peer-protocol-ppspp} - -PPSPP -(\href{https://datatracker.ietf.org/doc/rfc7574/?include_text=1}{IETF -RFC 7574}, (Bakker, Petrocco, and Grishchenko 2015)) is a protocol for -live streaming content over a peer to peer network. In it they define a -specific type of Merkle Tree that allows for subsets of the hashes to be -requested by a peer in order to reduce the time-till-playback for end -users. BitTorrent for example transfers all hashes up front, which is -not suitable for live streaming. - -Their Merkle trees are ordered using a scheme they call ``bin -numbering'', which is a method for deterministically arranging an -append-only log of leaf nodes into an in-order layout tree where -non-leaf nodes are derived hashes. If you want to verify a specific -node, you only need to request its sibling's hash and all its uncle -hashes. PPSPP is very concerned with reducing round trip time and -time-till-playback by allowing for many kinds of optimizations, such as -to pack as many hashes into datagrams as possible when exchanging tree -information with peers. - -Although PPSPP was designed with streaming video in mind, the ability to -request a subset of metadata from a large and/or streaming dataset is -very desirable for many other types of datasets. - -\subsection{WebTorrent}\label{webtorrent} - -With WebRTC, browsers can now make peer to peer connections directly to -other browsers. BitTorrent uses UDP sockets which aren't available to -browser JavaScript, so can't be used as-is on the Web. - -WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC -as the transport. This includes the BitTorrent block exchange protocol -as well as the tracker protocol implemented in a way that can enable -hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent -swarms (if a client is capable of making both UDP sockets as well as -WebRTC sockets, such as Node.js). Trackers are exposed to web clients -over HTTP or WebSockets. - -\subsection{InterPlanetary File -System}\label{interplanetary-file-system} - -IPFS is a family of application and network protocols that have peer to -peer file sharing and data permanence baked in. IPFS abstracts network -protocols and naming systems to provide an alternative application -delivery platform to today's Web. For example, instead of using HTTP and -DNS directly, in IPFS you would use LibP2P streams and IPNS in order to -gain access to the features of the IPFS platform. - -\subsection{Certificate Transparency/Secure -Registers}\label{certificate-transparencysecure-registers} - -The UK Government Digital Service have developed the concept of a -register which they define as a digital public ledger you can trust. In -the UK government registers are beginning to be piloted as a way to -expose essential open data sets in a way where consumers can verify the -data has not been tampered with, and allows the data publishers to -update their data sets over time. - -The design of registers was inspired by the infrastructure backing the -Certificate Transparency (Laurie, Langley, and Kasper 2013) project, -initiated at Google, which provides a service on top of SSL certificates -that enables service providers to write certificates to a distributed -public ledger. Any client or service provider can verify if a -certificate they received is in the ledger, which protects against so -called ``rogue certificates''. - -\section{6. Reference Implementation}\label{reference-implementation} - -The connection logic is implemented in a module called -\href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. -This builds on discovery-channel and adds connection establishment, -management and statistics. It provides statistics such as how many -sources are currently connected, how many good and bad behaving sources -have been talked to, and it automatically handles connecting and -reconnecting to sources. UTP support is implemented in the module -\href{https://www.npmjs.com/package/utp-native}{utp-native}. - -Our implementation of source discovery is called -\href{https://npmjs.org/discovery-channel}{discovery-channel}. We also -run a \href{https://www.npmjs.com/package/dns-discovery}{custom DNS -server} that Dat clients use (in addition to specifying their own if -they need to), as well as a -\href{https://github.com/bittorrent/bootstrap-dht}{DHT bootstrap} -server. These discovery servers are the only centralized infrastructure -we need for Dat to work over the Internet, but they are redundant, -interchangeable, never see the actual data being shared, anyone can run -their own and Dat will still work even if they all are unavailable. If -this happens discovery will just be manual (e.g.~manually sharing -IP/ports). - -\section{Acknowledgements}\label{acknowledgements} - -This work was made possible through grants from the John S. and James L. -Knight and Alfred P. Sloan Foundations. - -\section*{References}\label{references} -\addcontentsline{toc}{section}{References} - -\hypertarget{refs}{} -\hypertarget{ref-aumasson2013blake2}{} -Aumasson, Jean-Philippe, Samuel Neves, Zooko Wilcox-O'Hearn, and -Christian Winnerlein. 2013. ``BLAKE2: Simpler, Smaller, Fast as Md5.'' -In \emph{International Conference on Applied Cryptography and Network -Security}, 119--35. Springer. - -\hypertarget{ref-bakker2015peer}{} -Bakker, A, R Petrocco, and V Grishchenko. 2015. ``Peer-to-Peer Streaming -Peer Protocol (Ppspp).'' - -\hypertarget{ref-bernstein2012high}{} -Bernstein, Daniel J, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin -Yang. 2012. ``High-Speed High-Security Signatures.'' \emph{Journal of -Cryptographic Engineering}. Springer, 1--13. - -\hypertarget{ref-laurie2013certificate}{} -Laurie, Ben, Adam Langley, and Emilia Kasper. 2013. ``Certificate -Transparency.'' - -\hypertarget{ref-maymounkov2002kademlia}{} -Maymounkov, Petar, and David Mazieres. 2002. ``Kademlia: A Peer-to-Peer -Information System Based on the Xor Metric.'' In \emph{International -Workshop on Peer-to-Peer Systems}, 53--65. Springer. - -\hypertarget{ref-mykletun2003providing}{} -Mykletun, Einar, Maithili Narasimha, and Gene Tsudik. 2003. ``Providing -Authentication and Integrity in Outsourced Databases Using Merkle Hash -Trees.'' \emph{UCI-SCONCE Technical Report}. - -\hypertarget{ref-sleep}{} -Ogden, Maxwell, and Mathias Buus. 2017. ``SLEEP - the Dat Protocol on -Disk Format.'' In. - -\hypertarget{ref-rossi2010ledbat}{} -Rossi, Dario, Claudio Testa, Silvio Valenti, and Luca Muscariello. 2010. -``LEDBAT: The New Bittorrent Congestion Control Protocol.'' In -\emph{ICCCN}, 1--6. - -\end{document} diff --git a/papers/sleep.latex b/papers/sleep.latex new file mode 100644 index 0000000..cd695d0 --- /dev/null +++ b/papers/sleep.latex @@ -0,0 +1,751 @@ +\documentclass[a4paperpaper,twocolumn]{article} +\usepackage{lmodern} +\usepackage{amssymb,amsmath} +\usepackage{ifxetex,ifluatex} +\usepackage{fixltx2e} % provides \textsubscript +\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex + \usepackage[T1]{fontenc} + \usepackage[utf8]{inputenc} +\else % if luatex or xelatex + \ifxetex + \usepackage{mathspec} + \else + \usepackage{fontspec} + \fi + \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} +\fi +% use upquote if available, for straight quotes in verbatim environments +\IfFileExists{upquote.sty}{\usepackage{upquote}}{} +% use microtype if available +\IfFileExists{microtype.sty}{% +\usepackage{microtype} +\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts +}{} +\usepackage[unicode=true]{hyperref} +\hypersetup{ + pdftitle={SLEEP - Syncable Ledger of Exact Events Protocol}, + pdfauthor={Mathias Buus Madsen, Maxwell Ogden, Code for Science}, + pdfborder={0 0 0}, + breaklinks=true} +\urlstyle{same} % don't use monospace font for urls +\IfFileExists{parskip.sty}{% +\usepackage{parskip} +}{% else +\setlength{\parindent}{0pt} +\setlength{\parskip}{6pt plus 2pt minus 1pt} +} +\setlength{\emergencystretch}{3em} % prevent overfull lines +\providecommand{\tightlist}{% + \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} +\setcounter{secnumdepth}{0} +% Redefines (sub)paragraphs to behave more like sections +\ifx\paragraph\undefined\else +\let\oldparagraph\paragraph +\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} +\fi +\ifx\subparagraph\undefined\else +\let\oldsubparagraph\subparagraph +\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} +\fi + +\title{SLEEP - Syncable Ledger of Exact Events Protocol} +\author{Mathias Buus Madsen, Maxwell Ogden, Code for Science} +\date{August 2017} + +\begin{document} +\maketitle + +\subsection{SLEEP}\label{sleep} + +This document is a technical description of the SLEEP format intended +for implementers. SLEEP is the the on-disk format that Dat produces and +uses. It is a set of 9 files that hold all of the metadata needed to +list the contents of a Dat repository and verify the integrity of the +data you receive. SLEEP is designed to work with REST, allowing servers +to be plain HTTP file servers serving the static SLEEP files, meaning +you can implement a Dat protocol client using HTTP with a static HTTP +file server as the backend. + +SLEEP files contain metadata about the data inside a Dat repository, +including cryptographic hashes, cryptographic signatures, filenames and +file permissions. The SLEEP format is specifically designed to allow +efficient access to subsets of the metadata and/or data in the +repository, even on very large repositories, which enables Dat's peer to +peer networking to be fast. + +The acronym SLEEP is a slumber related pun on REST and stands for +Syncable Ledger of Exact Events Protocol. The Syncable part refers to +how SLEEP files are append-only in nature, meaning they grow over time +and new updates can be subscribed to as a realtime feed of events +through the Dat protocol. + +The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. +SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. + +\subsubsection{SLEEP Files}\label{sleep-files} + +SLEEP is a set of 9 files that should be stored with the following +names. In Dat, the files are stored in a folder called \texttt{.dat} in +the top level of the repository. + +\begin{verbatim} +metadata.key +metadata.signatures +metadata.bitfield +metadata.tree +metadata.data +content.key +content.signatures +content.bitfield +content.tree +\end{verbatim} + +The files prefixed with \texttt{content} store metadata about the +primary data in a Dat repository, for example the raw binary contents of +the files. The files prefixed with \texttt{metadata} store metadata +about the files in the repository, for example the filenames, file +sizes, and file permissions. The \texttt{content} and \texttt{metadata} +files are both Hypercore registers, making SLEEP a set of two Hypercore +registers. + +\subsubsection{SLEEP File Headers}\label{sleep-file-headers} + +The following structured binary format is used for \texttt{signatures}, +\texttt{bitfield}, and \texttt{tree} files. The header contains metadata +as well as information needed to decode the rest of the files after the +header. SLEEP files are designed to be easy to append new data, easy to +read arbitrary byte offsets in the middle, and are relatively flat, +simple files that rely on the filesystem for the heavy lifting. + +SLEEP files are laid out like this: + +\begin{verbatim} +<32 byte header> + + + + +\end{verbatim} + +\begin{itemize} +\tightlist +\item + 32 byte header +\item + 4 bytes Uint32BE (``Big-Endian'') - magic byte (value varies depending + on which file, used to quickly identify which file type it is) +\item + 1 byte - version number of the file header protocol, current version + is 0 +\item + 2 byte Uint16BE - entry size, describes how long each entry in the + file is +\item + 1 byte - length prefix for body +\item + rest of 32 byte header - string describing key or hash algorithm. + length of this string matches the length in the previous length prefix + field. This string must fit within the 32 byte header limitation (24 + bytes reserved for string). Unused bytes should be filled with zeroes. +\end{itemize} + +Possible values in the Dat implementation for the body field are: + +\begin{verbatim} +Ed25519 +BLAKE2b +\end{verbatim} + +To calculate the offset of some entry position, first read the header +and get the entry size, then do +\texttt{32\ +\ entrySize\ *\ entryIndex}. To calculate how many entries +are in a file, you can use the entry size and the filesize on disk and +do \texttt{(fileSize\ -\ 32)\ /\ entrySize}. + +As mentioned above, \texttt{signatures}, \texttt{bitfield} and +\texttt{tree} are the three SLEEP files. There are two additional files, +\texttt{key}, and \texttt{data}, which do not contain SLEEP file headers +and store plain serialized data for easy access. \texttt{key} stores the +public key that is described by the \texttt{signatures} file, and +\texttt{data} stores the raw chunk data that the \texttt{tree} file +contains the hashes and metadata for. + +\subsubsection{File Descriptions}\label{file-descriptions} + +\paragraph{key}\label{key} + +The public key used to verify the signatures in the \texttt{signatures} +file, stored in binary as a single buffer written to disk. To find out +what format of key is stored in this file, read the header of +\texttt{signatures}. In Dat, it's always a ed25519 public key, but other +implementations can specify other key types using a string value in that +header. + +\paragraph{tree}\label{tree} + +A SLEEP formatted 32 byte header with data entries representing a +serialized Merkle tree based on the data in the data storage layer. All +the fixed size nodes written in in-order tree notation. The header +algorithm string for \texttt{tree} files is \texttt{BLAKE2b}. The entry +size is 40 bytes. Entries are formatted like this: + +\begin{verbatim} +<32 byte header> + <4 byte magic string: 0x05025702> + <1 byte version number: 0> + <2 byte entry size: 40> + <1 byte algorithm name length prefix: 7> + <7 byte algorithm name: BLAKE2b> + <17 zeroes> +<40 byte entries> + <32 byte BLAKE2b hash> + <8 byte Uint64BE children leaf byte length> +\end{verbatim} + +The children leaf byte length is the byte size containing the sum byte +length of all leaf nodes in the tree below this node. + +This file uses the in-order notation, meaning even entries are leaf +nodes and odd entries are parent nodes (non-leaf). + +To prevent pre-image attacks, all hashes start with a one byte type +descriptor: + +\begin{verbatim} +0 - LEAF +1 - PARENT +2 - ROOT +\end{verbatim} + +To calculate leaf node entries (the hashes of the data entries) we hash +this data: + +\begin{verbatim} +BLAKE2b( + <1 byte type> + 0 + <8 bytes Uint64BE> + length of entry data + +) +\end{verbatim} + +Then we take this 32 byte hash and write it to the tree as 40 bytes like +this: + +\begin{verbatim} +<32 bytes> + BLAKE2b hash +<8 bytes Uint64BE> + length of data +\end{verbatim} + +Note that the Uint64 of length of data is included both in the hashed +data and written at the end of the entry. This is to expose more +metadata to Dat for advanced use cases such as verifying data length in +sparse replication scenarios. + +To calculate parent node entries (the hashes of the leaf nodes) we hash +this data: + +\begin{verbatim} +BLAKE2b( + <1 byte> + 1 + <8 bytes Uint64BE> + left child length + right child length + <32 bytes> + left child hash + <32 bytes> + right child hash +) +\end{verbatim} + +Then we take this 32 byte hash and write it to the tree as 40 bytes like +this: + +\begin{verbatim} +<32 bytes> + BLAKE2b hash +<8 bytes Uint64BE> + left child length + right child length +\end{verbatim} + +The reason the tree entries contain data lengths is to allow for sparse +mode replication. Encoding lengths (and including lengths in all hashes) +means you can verify the Merkle subtrees independent of the rest of the +tree, which happens during sparse replication scenarios. + +The tree file corresponds directly to the \texttt{data} file. + +\paragraph{data}\label{data} + +The \texttt{data} file is only included in the SLEEP format for the +\texttt{metadata.*} prefixed files which contains filesystem metadata +and not actual file data. For the \texttt{content.*} files, the data is +stored externally (in Dat it is stored as normal files on the filesystem +and not in a SLEEP file). However you can configure Dat to use a +\texttt{content.data} file if you want and it will still work. If you +want to store the full history of all versions of all files, using the +\texttt{content.data} file would provide that guarantee, but would have +the disadvantage of storing files as chunks merged into one huge file +(not as user friendly). + +The \texttt{data} file does not contain a SLEEP file header. It just +contains a bunch of concatenated data entries. Entries are written in +the same order as they appear in the \texttt{tree} file. To read a +\texttt{data} file, first decode the \texttt{tree} file and for every +leaf in the \texttt{tree} file you can calculate a data offset for the +data described by that leaf node in the \texttt{data} file. + +\subparagraph{Index Lookup}\label{index-lookup} + +For example, if we wanted to seek to a specific entry offset (say entry +42): + +\begin{itemize} +\tightlist +\item + First, read the header of the \texttt{tree} file and get the entry + size, then do \texttt{32\ +\ entrySize\ *\ 42} to get the raw tree + index: \texttt{32\ +\ (40\ *\ 42)} +\item + Since we want the leaf entry (even node in the in-order layout), we + multiply the entry index by 2: \texttt{32\ +\ (40\ *\ (42\ *\ 2))} +\item + Read the 40 bytes at that offset in the \texttt{tree} file to get the + leaf node entry. +\item + Read the last 8 bytes of the entry to get the length of the data entry +\item + To calculate the offset of where in the \texttt{data} file your entry + begins, you need to sum all the lengths of all the earlier entries in + the tree. The most efficient way to do this is to sum all the previous + parent node (non-leaf) entry lengths. You can also sum all leaf node + lengths, but parent nodes contain the sum of their children's lengths + so it's more efficient to use parents. During Dat replication, these + nodes are fetched as part of the Merkle tree verification so you will + already have them locally. This is a log(N) operation where N is the + entry index. Entries are also small and therefore easily cacheable. +\item + Once you get the offset, you use the length you decoded above and read + N bytes (where N is the decoded length) at the offset in the + \texttt{data} file. You can verify the data integrity using the 32 + byte hash from the \texttt{tree} entry. +\end{itemize} + +\subparagraph{Byte Lookup}\label{byte-lookup} + +The above method illustrates how to resolve a chunk position index to a +byte offset. You can also do the reverse operation, resolving a byte +offset to a chunk position index. This is used to stream arbitrary +random access regions of files in sparse replication scenarios. + +\begin{itemize} +\tightlist +\item + First, you start by calculating the current Merkle roots +\item + Each node in the tree (including these root nodes) stores the + aggregate file size of all byte sizes of the nodes below it. So the + roots cumulatively will describe all possible byte ranges for this + repository. +\item + Find the root that contains the byte range of the offset you are + looking for and get the node information for all of that nodes + children using the Index Lookup method, and recursively repeat this + step until you find the lowest down child node that describes this + byte range. +\item + The chunk described by this child node will contain the byte range you + are looking for. You can use the \texttt{byteOffset} field in the + \texttt{Stat} metadata object to seek to the correct position in the + content file for the start of this chunk. +\end{itemize} + +\subparagraph{Metadata Overhead}\label{metadata-overhead} + +Using this scheme, if you write 4GB of data using on average 64KB data +chunks (note: chunks can be variable length and do not need to be the +same size), your tree file will be around 5MB (0.0125\% overhead). + +\paragraph{signatures}\label{signatures} + +A SLEEP formatted 32 byte header with data entries being 64 byte +signatures. + +\begin{verbatim} +<32 byte header> + <4 byte magic string: 0x05025701> + <1 byte version number: 0> + <2 byte entry size: 64> + <1 byte algorithm name length prefix: 7> + <7 byte algorithm name: Ed25519> + <17 zeroes> +<64 byte entries> + <64 byte Ed25519 signature> +\end{verbatim} + +Every time the tree is updated we sign the current roots of the Merkle +tree, and append them to the signatures file. The signatures file starts +with no entries. Each time a new leaf is appended to the \texttt{tree} +file (aka whenever data is added to a Dat), we take all root hashes at +the current state of the Merkle tree and hash and sign them, then append +them as a new entry to the signatures file. + +\begin{verbatim} +Ed25519 sign( + BLAKE2b( + <1 byte> + 2 // root type + for (every root node left-to-right) { + <32 byte root hash> + <8 byte Uint64BE root tree index> + <8 byte Uint64BE child byte lengths> + } + ) +) +\end{verbatim} + +The reason we hash all the root nodes is that the BLAKE2b hash above is +only calculable if you have all of the pieces of data required to +generate all the intermediate hashes. This is the crux of Dat's data +integrity guarantees. + +\paragraph{bitfield}\label{bitfield} + +A SLEEP formatted 32 byte header followed by a series of 3328 byte long +entries. + +\begin{verbatim} +<32 byte header> + <4 byte magic string: 0x05025700> + <1 byte version number: 0> + <2 byte entry size: 3328> + <1 byte algorithm name length: 0> + <1 byte algorithm name: 0> + <24 zeroes> +<3328 byte entries> // (2048 + 1024 + 256) +\end{verbatim} + +The bitfield describes which pieces of data you have, and which nodes in +the \texttt{tree} file have been written. This file exists as an index +of the \texttt{tree} and \texttt{data} to quickly figure out which +pieces of data you have or are missing. This file can be regenerated if +you delete it, so it is considered a materialized index. + +The \texttt{bitfield} file actually contains three bitfields of +different sizes. A bitfield (AKA bitmap) is defined as a set of bits +where each bit (0 or 1) represents if you have or do not have a piece of +data at that bit index. So if there is a dataset of 10 cat pictures, and +you have pictures 1, 3, and 5 but are missing the rest, your bitfield +would look like \texttt{1010100000}. + +Each entry contains three objects: + +\begin{itemize} +\tightlist +\item + Data Bitfield (1024 bytes) - 1 bit for for each data entry that you + have synced (1 for every entry in \texttt{data}). +\item + Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in + \texttt{tree}) +\item + Bitfield Index (256 bytes) - This is an index of the Data Bitfield + that makes it efficient to figure out which pieces of data are missing + from the Data Bitfield without having to do a linear scan. +\end{itemize} + +The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that +because most filesystems work in 4Kb chunk sizes, we can fit the Data, +Tree and Index in less then 4Kb of data for efficient writes to the +filesystem. The Tree and Index sizes are based on the Data size (the +Tree has twice the entries as the Data, odd and even nodes vs just even +nodes in \texttt{tree}, and Index is always 1/4th the size). + +To generate the Index, you take pairs of 2 bytes at a time from the Data +Bitfield, check if all bits in the 2 bytes are the same, and generate 4 +bits of Index metadata~for every 2 bytes of Data (hence how 1024 bytes +of Data ends up as 256 bytes of Index). + +First you generate a 2 bit tuple for the 2 bytes of Data: + +\begin{verbatim} +if (data is all 1's) then [1,1] +if (data is all 0's) then [0,0] +if (data is not all the same) then [1, 0] +\end{verbatim} + +The Index itself is an in-order binary tree, not a traditional bitfield. +To generate the tree, you take the tuples you generate above and then +write them into a tree like the following example, where non-leaf nodes +are generated using the above scheme by looking at the results of the +relative even child tuples for each odd parent tuple: + +\begin{verbatim} +// for e.g. 16 bytes (8 tuples) of +// sparsely replicated data +0 - [00 00 00 00] +1 - [10 10 10 10] +2 - [11 11 11 11] +\end{verbatim} + +The tuples at entry \texttt{1} above are \texttt{{[}1,0{]}} because the +relative child tuples are not uniform. In the following example, all +non-leaf nodes are \texttt{{[}1,1{]}} because their relative children +are all uniform (\texttt{{[}1,1{]}}) + +\begin{verbatim} +// for e.g. 32 bytes (16 tuples) of +// fully replicated data (all 1's) +0 - [11 11 11 11] +1 - [11 11 11 11] +2 - [11 11 11 11] +3 - [11 11 11 11] +4 - [11 11 11 11] +5 - [11 11 11 11] +6 - [11 11 11 11] +\end{verbatim} + +Using this scheme, it takes at most 8 bytes of Index to represent 32 +bytes of data. In this example the Index can compresses well because it +consists of all one bits. Similarly, an empty bitfield is all zero bits. + +If you write 4GB of data using on average 64KB data chunk size, your +bitfield will be at most 32KB. + +\paragraph{metadata.data}\label{metadata.data} + +This file is used to store content described by the rest of the +\texttt{metadata.*} hypercore SLEEP files. Whereas the +\texttt{content.*} SLEEP files describe the data stored in the actual +data cloned in the Dat repository filesystem, the \texttt{metadata} data +feed is stored inside the \texttt{.dat} folder along with the rest of +the SLEEP files. + +The contents of this file is a series of versions of the Dat filesystem +tree. As this is a hypercore data feed, it's just an append only log of +binary data entries. The challenge is representing a tree in a +one-dimensional way to make it representable as a Hypercore register. +For example, imagine three files: + +\begin{verbatim} +~/dataset $ ls +figures + graph1.png + graph2.png +results.csv + +1 directory, 3 files +\end{verbatim} + +We want to take this structure and map it to a serialized representation +that gets written into an append only log in a way that still allows for +efficient random access by file path. + +To do this, we convert the filesystem metadata into entries in a feed +like this: + +\begin{verbatim} +{ + "path": "/results.csv", + trie: [[]], + sequence: 0 +} +{ + "path": "/figures/graph1.png", + trie: [[0], []], + sequence: 1 +} +{ + "path": "/figures/graph2.png", + trie: [[0], [1]], + sequence: 2 +} +\end{verbatim} + +\subparagraph{Filename Resolution}\label{filename-resolution} + +Each sequence represents adding one of the files to the register, so at +sequence 0 the filesystem state only has a single file, +\texttt{results.csv} in it. At sequence 1, there are only 2 files added +to the register, and at sequence 3 all files are finally added. The +\texttt{children} field represents a shorthand way of declaring which +other files at every level of the directory hierarchy exist alongside +the file being added at that revision. For example at the time of +sequence 1, children is \texttt{{[}{[}0{]},\ {[}{]}{]}}. The first +sub-array, \texttt{{[}0{]}}, represents the first folder in the +\texttt{path}, which is the root folder \texttt{/}. In this case +\texttt{{[}0{]}} means the root folder at this point in time only has a +single file, the file that is the subject of sequence \texttt{0}. The +second subarray is empty \texttt{{[}{]}} because there are no other +existing files in the second folder in the \texttt{path}, +\texttt{figures}. + +To look up a file by filename, you fetch the latest entry in the log, +then use the \texttt{children} metadata in that entry to look up the +longest common ancestor based on the parent folders of the filename you +are querying. You can then recursively repeat this operation until you +find the \texttt{path} entry you are looking for (or you exhaust all +options which means the file does not exist). This is a +\texttt{O(number\ of\ slashes\ in\ your\ path)} operation. + +For example, if you wanted to look up \texttt{/results.csv} given the +above register, you would start by grabbing the metadata at sequence 2. +The longest common ancestor between \texttt{/results.csv} and +\texttt{/figures/graph2} is \texttt{/}. You then grab the corresponding +entry in the children array for \texttt{/}, which in this case is the +first entry, \texttt{{[}0{]}}. You then repeat this with all of the +children entries until you find a child that is closer to the entry you +are looking for. In this example, the first entry happens to be the +match we are looking for. + +You can also perform lookups relative to a point in time by starting +from a specific sequence number in the register. For example to get the +state of some file relative to an old sequence number, similar to +checking out an old version of a repository in Git. + +\subparagraph{Data Serialization}\label{data-serialization} + +The format of the \texttt{metadata.data} file is as follows: + +\begin{verbatim} +
+ + + + +\end{verbatim} + +Each entry in the file is encoded using Protocol Buffers (Varda 2008). + +The first message we write to the file is of a type called Header which +uses this schema: + +\begin{verbatim} +message Header { + required string type = 1; + optional bytes content = 2; +} +\end{verbatim} + +This is used to declare two pieces of metadata used by Dat. It includes +a \texttt{type} string with the value \texttt{hyperdrive} and +\texttt{content} binary value that holds the public key of the content +register that this metadata register represents. When you share a Dat, +the metadata key is the main key that gets used, and the content +register key is linked from here in the metadata. + +After the header the file will contain many filesystem \texttt{Node} +entries: + +\begin{verbatim} +message Node { + required string path = 1; + optional Stat value = 2; + optional bytes trie = 3; + repeated Writer writers = 4; + optional uint64 writersSequence = 5; +} + +message Writer { + required bytes publicKey = 1; + optional string permission = 2; +} +\end{verbatim} + +The \texttt{Node} object has five fields + +\begin{itemize} +\tightlist +\item + \texttt{path} - the string of the absolute file path of this file. +\item + \texttt{Stat} - a Stat encoded object representing the file metadata +\item + \texttt{trie} - a compressed list of the sequence numbers as described + earlier +\item + \texttt{writers} - a list of the writers who are allowed to write to + this dat +\item + \texttt{writersSequence} - a reference to the last sequence where the + writers array was modified. you can use this to quickly find the value + of the writers keys. +\end{itemize} + +The \texttt{trie} value is encoded by starting with the nested array of +sequence numbers, e.g. +\texttt{{[}{[}{[}0,\ 3{]}{]},\ {[}{[}0,\ 2{]},\ {[}0,\ 1{]}{]}{]}}. Each +entry is a tuple where the first item is the index of the feed in the +\texttt{writers} array and the second value is the sequence number. +Finally you prepend the trie value with a version number varint. + +To write these subarrays we use variable width integers (varints), using +a repeating pattern like this, one for each array: + +\begin{verbatim} + + + + + + +\end{verbatim} + +This encoding is designed for efficiency as it reduces the filesystem +path + feed index metadata down to a series of small integers. + +The \texttt{Stat} objects use this encoding: + +\begin{verbatim} +message Stat { + required uint32 mode = 1; + optional uint32 uid = 2; + optional uint32 gid = 3; + optional uint64 size = 4; + optional uint64 blocks = 5; + optional uint64 offset = 6; + optional uint64 byteOffset = 7; + optional uint64 mtime = 8; + optional uint64 ctime = 9; +} +\end{verbatim} + +These are the field definitions: + +\begin{itemize} +\tightlist +\item + \texttt{mode} - POSIX file mode bitmask +\item + \texttt{uid} - POSIX user id +\item + \texttt{gid} - POSIX group id +\item + \texttt{size} - file size in bytes +\item + \texttt{blocks} - number of data chunks that make up this file +\item + \texttt{offset} - the data feed entry index for the first chunk in + this file +\item + \texttt{byteOffset} - the data feed file byte offset for the first + chunk in this file +\item + \texttt{mtime} - POSIX modified\_at time +\item + \texttt{mtime} - POSIX created\_at time +\end{itemize} + +\subsection*{References}\label{references} +\addcontentsline{toc}{subsection}{References} + +\hypertarget{refs}{} +\hypertarget{ref-varda2008protocol}{} +Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange +Format.'' \emph{Google Open Source Blog, Available at Least as Early as +Jul}. + +\end{document} diff --git a/papers/sleep.pdf b/papers/sleep.pdf index a4281e8..1c59c91 100644 Binary files a/papers/sleep.pdf and b/papers/sleep.pdf differ diff --git a/papers/sleep.txt b/papers/sleep.txt deleted file mode 100644 index 241347f..0000000 --- a/papers/sleep.txt +++ /dev/null @@ -1,756 +0,0 @@ -\documentclass[a4paperpaper,twocolumn]{article} -\usepackage{lmodern} -\usepackage{amssymb,amsmath} -\usepackage{ifxetex,ifluatex} -\usepackage{fixltx2e} % provides \textsubscript -\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex - \usepackage[T1]{fontenc} - \usepackage[utf8]{inputenc} -\else % if luatex or xelatex - \ifxetex - \usepackage{mathspec} - \else - \usepackage{fontspec} - \fi - \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} -\fi -% use upquote if available, for straight quotes in verbatim environments -\IfFileExists{upquote.sty}{\usepackage{upquote}}{} -% use microtype if available -\IfFileExists{microtype.sty}{% -\usepackage{microtype} -\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts -}{} -\usepackage[unicode=true]{hyperref} -\hypersetup{ - pdftitle={SLEEP - Syncable Ledger of Exact Events Protocol}, - pdfauthor={Mathias Buus Madsen, Maxwell Ogden, Code for Science}, - pdfborder={0 0 0}, - breaklinks=true} -\urlstyle{same} % don't use monospace font for urls -\IfFileExists{parskip.sty}{% -\usepackage{parskip} -}{% else -\setlength{\parindent}{0pt} -\setlength{\parskip}{6pt plus 2pt minus 1pt} -} -\setlength{\emergencystretch}{3em} % prevent overfull lines -\providecommand{\tightlist}{% - \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} -\setcounter{secnumdepth}{0} -% Redefines (sub)paragraphs to behave more like sections -\ifx\paragraph\undefined\else -\let\oldparagraph\paragraph -\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} -\fi -\ifx\subparagraph\undefined\else -\let\oldsubparagraph\subparagraph -\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} -\fi - -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - -\title{SLEEP - Syncable Ledger of Exact Events Protocol} -\author{Mathias Buus Madsen, Maxwell Ogden, Code for Science} -\date{August 2017} - -\begin{document} -\maketitle - -\subsection{SLEEP}\label{sleep} - -This document is a technical description of the SLEEP format intended -for implementers. SLEEP is the the on-disk format that Dat produces and -uses. It is a set of 9 files that hold all of the metadata needed to -list the contents of a Dat repository and verify the integrity of the -data you receive. SLEEP is designed to work with REST, allowing servers -to be plain HTTP file servers serving the static SLEEP files, meaning -you can implement a Dat protocol client using HTTP with a static HTTP -file server as the backend. - -SLEEP files contain metadata about the data inside a Dat repository, -including cryptographic hashes, cryptographic signatures, filenames and -file permissions. The SLEEP format is specifically designed to allow -efficient access to subsets of the metadata and/or data in the -repository, even on very large repositories, which enables Dat's peer to -peer networking to be fast. - -The acronym SLEEP is a slumber related pun on REST and stands for -Syncable Ledger of Exact Events Protocol. The Syncable part refers to -how SLEEP files are append-only in nature, meaning they grow over time -and new updates can be subscribed to as a realtime feed of events -through the Dat protocol. - -The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. -SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. - -\subsubsection{SLEEP Files}\label{sleep-files} - -SLEEP is a set of 9 files that should be stored with the following -names. In Dat, the files are stored in a folder called \texttt{.dat} in -the top level of the repository. - -\begin{verbatim} -metadata.key -metadata.signatures -metadata.bitfield -metadata.tree -metadata.data -content.key -content.signatures -content.bitfield -content.tree -\end{verbatim} - -The files prefixed with \texttt{content} store metadata about the -primary data in a Dat repository, for example the raw binary contents of -the files. The files prefixed with \texttt{metadata} store metadata -about the files in the repository, for example the filenames, file -sizes, and file permissions. The \texttt{content} and \texttt{metadata} -files are both Hypercore registers, making SLEEP a set of two Hypercore -registers. - -\subsubsection{SLEEP File Headers}\label{sleep-file-headers} - -The following structured binary format is used for \texttt{signatures}, -\texttt{bitfield}, and \texttt{tree} files. The header contains metadata -as well as information needed to decode the rest of the files after the -header. SLEEP files are designed to be easy to append new data, easy to -read arbitrary byte offsets in the middle, and are relatively flat, -simple files that rely on the filesystem for the heavy lifting. - -SLEEP files are laid out like this: - -\begin{verbatim} -<32 byte header> - - - - -\end{verbatim} - -\begin{itemize} -\tightlist -\item - 32 byte header -\item - 4 bytes - magic byte (value varies depending on which file, used to - quickly identify which file type it is) -\item - 1 byte - version number of the file header protocol, current version - is 0 -\item - 2 byte Uint16BE - entry size, describes how long each entry in the - file is -\item - 1 byte - length prefix for body -\item - rest of 32 byte header - string describing key algorithm (in dat - `ed25519'). length of this string matches the length in the previous - length prefix field. This string must fit within the 32 byte header - limitation (24 bytes reserved for string). Unused bytes should be - filled with zeroes. -\end{itemize} - -Possible values in the Dat implementation for the body field are: - -\begin{verbatim} -Ed25519 -BLAKE2b -\end{verbatim} - -To calculate the offset of some entry position, first read the header -and get the entry size, then do -\texttt{32\ +\ entrySize\ *\ entryIndex}. To calculate how many entries -are in a file, you can use the entry size and the filesize on disk and -do \texttt{(fileSize\ -\ 32)\ /\ entrySize}. - -As mentioned above, \texttt{signatures}, \texttt{bitfield} and -\texttt{tree} are the three SLEEP files. There are two additional files, -\texttt{key}, and \texttt{data}, which do not contain SLEEP file headers -and store plain serialized data for easy access. \texttt{key} stores the -public key that is described by the \texttt{signatures} file, and -\texttt{data} stores the raw chunk data that the \texttt{tree} file -contains the hashes and metadata for. - -\subsubsection{File Descriptions}\label{file-descriptions} - -\paragraph{key}\label{key} - -The public key used to verify the signatures in the \texttt{signatures} -file, stored in binary as a single buffer written to disk. To find out -what format of key is stored in this file, read the header of -\texttt{signatures}. In Dat, it's always a ed25519 public key, but other -implementations can specify other key types using a string value in that -header. - -\paragraph{tree}\label{tree} - -A SLEEP formatted 32 byte header with data entries representing a -serialized Merkle tree based on the data in the data storage layer. All -the fixed size nodes written in in-order tree notation. The header -algorithm string for \texttt{tree} files is \texttt{BLAKE2b}. The entry -size is 40 bytes. Entries are formatted like this: - -\begin{verbatim} -<32 byte header> - <4 byte magic string: 0x05025702> - <1 byte version number: 0> - <2 byte entry size: 40> - <1 byte algorithm name length prefix: 7> - <7 byte algorithm name: BLAKE2b> - <17 zeroes> -<40 byte entries> - <32 byte BLAKE2b hash> - <8 byte Uint64BE children leaf byte length> -\end{verbatim} - -The children leaf byte length is the byte size containing the sum byte -length of all leaf nodes in the tree below this node. - -This file uses the in-order notation, meaning even entries are leaf -nodes and odd entries are parent nodes (non-leaf). - -To prevent pre-image attacks, all hashes start with a one byte type -descriptor: - -\begin{verbatim} -0 - LEAF -1 - PARENT -2 - ROOT -\end{verbatim} - -To calculate leaf node entries (the hashes of the data entries) we hash -this data: - -\begin{verbatim} -BLAKE2b( - <1 byte type> - 0 - <8 bytes Uint64BE> - length of entry data - -) -\end{verbatim} - -Then we take this 32 byte hash and write it to the tree as 40 bytes like -this: - -\begin{verbatim} -<32 bytes> - BLAKE2b hash -<8 bytes Uint64BE> - length of data -\end{verbatim} - -Note that the Uint64 of length of data is included both in the hashed -data and written at the end of the entry. This is to expose more -metadata to Dat for advanced use cases such as verifying data length in -sparse replication scenarios. - -To calculate parent node entries (the hashes of the leaf nodes) we hash -this data: - -\begin{verbatim} -BLAKE2b( - <1 byte> - 1 - <8 bytes Uint64BE> - left child length + right child length - <32 bytes> - left child hash - <32 bytes> - right child hash -) -\end{verbatim} - -Then we take this 32 byte hash and write it to the tree as 40 bytes like -this: - -\begin{verbatim} -<32 bytes> - BLAKE2b hash -<8 bytes Uint64BE> - left child length + right child length -\end{verbatim} - -The reason the tree entries contain data lengths is to allow for sparse -mode replication. Encoding lengths (and including lengths in all hashes) -means you can verify the Merkle subtrees independent of the rest of the -tree, which happens during sparse replication scenarios. - -The tree file corresponds directly to the \texttt{data} file. - -\paragraph{data}\label{data} - -The \texttt{data} file is only included in the SLEEP format for the -\texttt{metadata.*} prefixed files which contains filesystem metadata -and not actual file data. For the \texttt{content.*} files, the data is -stored externally (in Dat it is stored as normal files on the filesystem -and not in a SLEEP file). However you can configure Dat to use a -\texttt{content.data} file if you want and it will still work. If you -want to store the full history of all versions of all files, using the -\texttt{content.data} file would provide that guarantee, but would have -the disadvantage of storing files as chunks merged into one huge file -(not as user friendly). - -The \texttt{data} file does not contain a SLEEP file header. It just -contains a bunch of concatenated data entries. Entries are written in -the same order as they appear in the \texttt{tree} file. To read a -\texttt{data} file, first decode the \texttt{tree} file and for every -leaf in the \texttt{tree} file you can calculate a data offset for the -data described by that leaf node in the \texttt{data} file. - -\subparagraph{Index Lookup}\label{index-lookup} - -For example, if we wanted to seek to a specific entry offset (say entry -42): - -\begin{itemize} -\tightlist -\item - First, read the header of the \texttt{tree} file and get the entry - size, then do \texttt{32\ +\ entrySize\ *\ 42} to get the raw tree - index: \texttt{32\ +\ (40\ *\ 42)} -\item - Since we want the leaf entry (even node in the in-order layout), we - multiply the entry index by 2: \texttt{32\ +\ (40\ *\ (42\ *\ 2))} -\item - Read the 40 bytes at that offset in the \texttt{tree} file to get the - leaf node entry. -\item - Read the last 8 bytes of the entry to get the length of the data entry -\item - To calculate the offset of where in the \texttt{data} file your entry - begins, you need to sum all the lengths of all the earlier entries in - the tree. The most efficient way to do this is to sum all the previous - parent node (non-leaf) entry lengths. You can also sum all leaf node - lengths, but parent nodes contain the sum of their children's lengths - so it's more efficient to use parents. During Dat replication, these - nodes are fetched as part of the Merkle tree verification so you will - already have them locally. This is a log(N) operation where N is the - entry index. Entries are also small and therefore easily cacheable. -\item - Once you get the offset, you use the length you decoded above and read - N bytes (where N is the decoded length) at the offset in the - \texttt{data} file. You can verify the data integrity using the 32 - byte hash from the \texttt{tree} entry. -\end{itemize} - -\subparagraph{Byte Lookup}\label{byte-lookup} - -The above method illustrates how to resolve a chunk position index to a -byte offset. You can also do the reverse operation, resolving a byte -offset to a chunk position index. This is used to stream arbitrary -random access regions of files in sparse replication scenarios. - -\begin{itemize} -\tightlist -\item - First, you start by calculating the current Merkle roots -\item - Each node in the tree (including these root nodes) stores the - aggregate file size of all byte sizes of the nodes below it. So the - roots cumulatively will describe all possible byte ranges for this - repository. -\item - Find the root that contains the byte range of the offset you are - looking for and get the node information for all of that nodes - children using the Index Lookup method, and recursively repeat this - step until you find the lowest down child node that describes this - byte range. -\item - The chunk described by this child node will contain the byte range you - are looking for. You can use the \texttt{byteOffset} property in the - \texttt{Stat} metadata object to seek into the right position in the - content for the start of this chunk. -\end{itemize} - -\subparagraph{Metadata Overhead}\label{metadata-overhead} - -Using this scheme, if you write 4GB of data using on average 64KB data -chunks (note: chunks can be variable length and do not need to be the -same size), your tree file will be around 5MB (0.0125\% overhead). - -\paragraph{signatures}\label{signatures} - -A SLEEP formatted 32 byte header with data entries being 64 byte -signatures. - -\begin{verbatim} -<32 byte header> - <4 byte magic string: 0x05025701> - <1 byte version number: 0> - <2 byte entry size: 64> - <1 byte algorithm name length prefix: 7> - <7 byte algorithm name: Ed25519> - <17 zeroes> -<64 byte entries> - <64 byte Ed25519 signature> -\end{verbatim} - -Every time the tree is updated we sign the current roots of the Merkle -tree, and append them to the signatures file. The signatures file starts -with no entries. Each time a new leaf is appended to the \texttt{tree} -file (aka whenever data is added to a Dat), we take all root hashes at -the current state of the Merkle tree and hash and sign them, then append -them as a new entry to the signatures file. - -\begin{verbatim} -Ed25519 sign( - BLAKE2b( - <1 byte> - 2 // root type - for (every root node left-to-right) { - <32 byte root hash> - <8 byte Uint64BE root tree index> - <8 byte Uint64BE child byte lengths> - } - ) -) -\end{verbatim} - -The reason we hash all the root nodes is that the BLAKE2b hash above is -only calculable if you have all of the pieces of data required to -generate all the intermediate hashes. This is the crux of Dat's data -integrity guarantees. - -\paragraph{bitfield}\label{bitfield} - -A SLEEP formatted 32 byte header followed by a series of 3328 byte long -entries. - -\begin{verbatim} -<32 byte header> - <4 byte magic string: 0x05025700> - <1 byte version number: 0> - <2 byte entry size: 3328> - <1 byte algorithm name length: 0> - <1 byte algorithm name: 0> - <24 zeroes> -<3328 byte entries> // (2048 + 1024 + 256) -\end{verbatim} - -The bitfield describes which pieces of data you have, and which nodes in -the \texttt{tree} file have been written. This file exists as an index -of the \texttt{tree} and \texttt{data} to quickly figure out which -pieces of data you have or are missing. This file can be regenerated if -you delete it, so it is considered a materialized index. - -The \texttt{bitfield} file actually contains three bitfields of -different sizes. A bitfield (AKA bitmap) is defined as a set of bits -where each bit (0 or 1) represents if you have or do not have a piece of -data at that bit index. So if there is a dataset of 10 cat pictures, and -you have pictures 1, 3, and 5 but are missing the rest, your bitfield -would look like \texttt{1010100000}. - -Each entry contains three objects: - -\begin{itemize} -\tightlist -\item - Data Bitfield (1024 bytes) - 1 bit for for each data entry that you - have synced (1 for every entry in \texttt{data}). -\item - Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in - \texttt{tree}) -\item - Bitfield Index (256 bytes) - This is an index of the Data Bitfield - that makes it efficient to figure out which pieces of data are missing - from the Data Bitfield without having to do a linear scan. -\end{itemize} - -The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that -because most filesystems work in 4Kb chunk sizes, we can fit the Data, -Tree and Index in less then 4Kb of data for efficient writes to the -filesystem. The Tree and Index sizes are based on the Data size (the -Tree has twice the entries as the Data, odd and even nodes vs just even -nodes in \texttt{tree}, and Index is always 1/4th the size). - -To generate the Index, you take pairs of 2 bytes at a time from the Data -Bitfield, check if all bits in the 2 bytes are the same, and generate 4 -bits of Index metadata~for every 2 bytes of Data (hence how 1024 bytes -of Data ends up as 256 bytes of Index). - -First you generate a 2 bit tuple for the 2 bytes of Data: - -\begin{verbatim} -if (data is all 1's) then [1,1] -if (data is all 0's) then [0,0] -if (data is not all the same) then [1, 0] -\end{verbatim} - -The Index itself is an in-order binary tree, not a traditional bitfield. -To generate the tree, you take the tuples you generate above and then -write them into a tree like the following example, where non-leaf nodes -are generated using the above scheme by looking at the results of the -relative even child tuples for each odd parent tuple: - -\begin{verbatim} -// for e.g. 16 bytes (8 tuples) of -// sparsely replicated data -0 - [00 00 00 00] -1 - [10 10 10 10] -2 - [11 11 11 11] -\end{verbatim} - -The tuples at entry \texttt{1} above are \texttt{{[}1,0{]}} because the -relative child tuples are not uniform. In the following example, all -non-leaf nodes are \texttt{{[}1,1{]}} because their relative children -are all uniform (\texttt{{[}1,1{]}}) - -\begin{verbatim} -// for e.g. 32 bytes (16 tuples) of -// fully replicated data (all 1's) -0 - [11 11 11 11] -1 - [11 11 11 11] -2 - [11 11 11 11] -3 - [11 11 11 11] -4 - [11 11 11 11] -5 - [11 11 11 11] -6 - [11 11 11 11] -\end{verbatim} - -Using this scheme, to represent 32 bytes of data it takes at most 8 -bytes of Index. In this example it compresses nicely as its all -contiguous ones on disk, similarly for an empty bitfield it would be all -zeroes. - -If you write 4GB of data using on average 64KB data chunk size, your -bitfield will be at most 32KB. - -\paragraph{metadata.data}\label{metadata.data} - -This file is used to store content described by the rest of the -\texttt{metadata.*} hypercore SLEEP files. Whereas the -\texttt{content.*} SLEEP files describe the data stored in the actual -data cloned in the Dat repository filesystem, the \texttt{metadata} data -feed is stored inside the \texttt{.dat} folder along with the rest of -the SLEEP files. - -The contents of this file is a series of versions of the Dat filesystem -tree. As this is a hypercore data feed, it's just an append only log of -binary data entries. The challenge is representing a tree in a -one-dimensional way to make it representable as a Hypercore register. -For example, imagine three files: - -\begin{verbatim} -~/dataset $ ls -figures - graph1.png - graph2.png -results.csv - -1 directory, 3 files -\end{verbatim} - -We want to take this structure and map it to a serialized representation -that gets written into an append only log in a way that still allows for -efficient random access by file path. - -To do this, we convert the filesystem metadata into entries in a feed -like this: - -\begin{verbatim} -{ - "path": "/results.csv", - trie: [[]], - sequence: 0 -} -{ - "path": "/figures/graph1.png", - trie: [[0], []], - sequence: 1 -} -{ - "path": "/figures/graph2.png", - trie: [[0], [1]], - sequence: 2 -} -\end{verbatim} - -\subparagraph{Filename Resolution}\label{filename-resolution} - -Each sequence represents adding one of the files to the register, so at -sequence 0 the filesystem state only has a single file, -\texttt{results.csv} in it. At sequence 1, there are only 2 files added -to the register, and at sequence 3 all files are finally added. The -\texttt{children} field represents a shorthand way of declaring which -other files at every level of the directory hierarchy exist alongside -the file being added at that revision. For example at the time of -sequence 1, children is \texttt{{[}{[}0{]},\ {[}{]}{]}}. The first -sub-array, \texttt{{[}0{]}}, represents the first folder in the -\texttt{path}, which is the root folder \texttt{/}. In this case -\texttt{{[}0{]}} means the root folder at this point in time only has a -single file, the file that is the subject of sequence \texttt{0}. The -second subarray is empty \texttt{{[}{]}} because there are no other -existing files in the second folder in the \texttt{path}, -\texttt{figures}. - -To look up a file by filename, you fetch the latest entry in the log, -then use the \texttt{children} metadata in that entry to look up the -longest common ancestor based on the parent folders of the filename you -are querying. You can then recursively repeat this operation until you -find the \texttt{path} entry you are looking for (or you exhaust all -options which means the file does not exist). This is a -\texttt{O(number\ of\ slashes\ in\ your\ path)} operation. - -For example, if you wanted to look up \texttt{/results.csv} given the -above register, you would start by grabbing the metadata at sequence 2. -The longest common ancestor between \texttt{/results.csv} and -\texttt{/figures/graph2} is \texttt{/}. You then grab the corresponding -entry in the children array for \texttt{/}, which in this case is the -first entry, \texttt{{[}0{]}}. You then repeat this with all of the -children entries until you find a child that is closer to the entry you -are looking for. In this example, the first entry happens to be the -match we are looking for. - -You can also perform lookups relative to a point in time by starting -from a specific sequence number in the register. For example to get the -state of some file relative to an old sequence number, similar to -checking out an old version of a repository in Git. - -\subparagraph{Data Serialization}\label{data-serialization} - -The format of the \texttt{metadata.data} file is as follows: - -\begin{verbatim} -
- - - - -\end{verbatim} - -Each entry in the file is encoded using Protocol Buffers (Varda 2008). - -The first message we write to the file is of a type called Header which -uses this schema: - -\begin{verbatim} -message Header { - required string type = 1; - optional bytes content = 2; -} -\end{verbatim} - -This is used to declare two pieces of metadata used by Dat. It includes -a \texttt{type} string with the value \texttt{hyperdrive} and -\texttt{content} binary value that holds the public key of the content -register that this metadata register represents. When you share a Dat, -the metadata key is the main key that gets used, and the content -register key is linked from here in the metadata. - -After the header the file will contain many filesystem \texttt{Node} -entries: - -\begin{verbatim} -message Node { - required string path = 1; - optional Stat value = 2; - optional bytes trie = 3; - repeated Writer writers = 4; - optional uint64 writersSequence = 5; -} - -message Writer { - required bytes publicKey = 1; - optional string permission = 2; -} -\end{verbatim} - -The \texttt{Node} object has five fields - -\begin{itemize} -\tightlist -\item - \texttt{path} - the string of the absolute file path of this file. -\item - \texttt{Stat} - a Stat encoded object representing the file metadata -\item - \texttt{trie} - a compressed list of the sequence numbers as described - earlier -\item - \texttt{writers} - a list of the writers who are allowed to write to - this dat -\item - \texttt{writersSequence} - a reference to the last sequence where the - writers array was modified. you can use this to quickly find the value - of the writers keys. -\end{itemize} - -The \texttt{trie} value is encoded by starting with the nested array of -sequence numbers, e.g. -\texttt{{[}{[}{[}0,\ 3{]}{]},\ {[}{[}0,\ 2{]},\ {[}0,\ 1{]}{]}{]}}. Each -entry is a tuple where the first item is the index of the feed in the -\texttt{writers} array and the second value is the sequence number. -Finally you prepend the trie value with a version number varint. - -To write these subarrays we use variable width integers (varints), using -a repeating pattern like this, one for each array: - -\begin{verbatim} - - - - - - -\end{verbatim} - -This encoding is designed for efficiency as it reduces the filesystem -path + feed index metadata down to a series of small integers. - -The \texttt{Stat} objects use this encoding: - -\begin{verbatim} -message Stat { - required uint32 mode = 1; - optional uint32 uid = 2; - optional uint32 gid = 3; - optional uint64 size = 4; - optional uint64 blocks = 5; - optional uint64 offset = 6; - optional uint64 byteOffset = 7; - optional uint64 mtime = 8; - optional uint64 ctime = 9; -} -\end{verbatim} - -These are the field definitions: - -\begin{itemize} -\tightlist -\item - \texttt{mode} - POSIX file mode bitmask -\item - \texttt{uid} - POSIX user id -\item - \texttt{gid} - POSIX group id -\item - \texttt{size} - file size in bytes -\item - \texttt{blocks} - number of data chunks that make up this file -\item - \texttt{offset} - the data feed entry index for the first chunk in - this file -\item - \texttt{byteOffset} - the data feed file byte offset for the first - chunk in this file -\item - \texttt{mtime} - POSIX modified\_at time -\item - \texttt{mtime} - POSIX created\_at time -\end{itemize} - -\hypertarget{refs}{} -\hypertarget{ref-varda2008protocol}{} -Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange -Format.'' \emph{Google Open Source Blog, Available at Least as Early as -Jul}. - -\end{document} -- cgit v1.2.3