diff options
Diffstat (limited to 'papers/dat-paper.latex')
-rw-r--r-- | papers/dat-paper.latex | 1228 |
1 files changed, 1228 insertions, 0 deletions
diff --git a/papers/dat-paper.latex b/papers/dat-paper.latex new file mode 100644 index 0000000..5d90b07 --- /dev/null +++ b/papers/dat-paper.latex @@ -0,0 +1,1228 @@ +\documentclass[a4paperpaper,twocolumn]{article} +\usepackage{lmodern} +\usepackage{amssymb,amsmath} +\usepackage{ifxetex,ifluatex} +\usepackage{fixltx2e} % provides \textsubscript +\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex + \usepackage[T1]{fontenc} + \usepackage[utf8]{inputenc} +\else % if luatex or xelatex + \ifxetex + \usepackage{mathspec} + \else + \usepackage{fontspec} + \fi + \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} +\fi +% use upquote if available, for straight quotes in verbatim environments +\IfFileExists{upquote.sty}{\usepackage{upquote}}{} +% use microtype if available +\IfFileExists{microtype.sty}{% +\usepackage{microtype} +\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts +}{} +\usepackage[unicode=true]{hyperref} +\hypersetup{ + pdftitle={Dat - Distributed Dataset Synchronization And Versioning}, + pdfauthor={Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science}, + pdfborder={0 0 0}, + breaklinks=true} +\urlstyle{same} % don't use monospace font for urls +\IfFileExists{parskip.sty}{% +\usepackage{parskip} +}{% else +\setlength{\parindent}{0pt} +\setlength{\parskip}{6pt plus 2pt minus 1pt} +} +\setlength{\emergencystretch}{3em} % prevent overfull lines +\providecommand{\tightlist}{% + \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} +\setcounter{secnumdepth}{0} +% Redefines (sub)paragraphs to behave more like sections +\ifx\paragraph\undefined\else +\let\oldparagraph\paragraph +\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} +\fi +\ifx\subparagraph\undefined\else +\let\oldsubparagraph\subparagraph +\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} +\fi + +\title{Dat - Distributed Dataset Synchronization And Versioning} +\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science} +\date{May 2017 (last updated: Jan 2018)} + +\begin{document} +\maketitle + +\section{Abstract}\label{abstract} + +Dat is a protocol designed for syncing folders of data, even if they are +large or changing constantly. Dat uses a cryptographically secure +register of changes to prove that the requested data version is +distributed. A byte range of any file's version can be efficiently +streamed from a Dat repository over a network connection. Consumers can +choose to fully or partially replicate the contents of a remote Dat +repository, and can also subscribe to live changes. To ensure writer and +reader privacy, Dat uses public key cryptography to encrypt network +traffic. A group of Dat clients can connect to each other to form a +public or private decentralized network to exchange data between each +other. A reference implementation is provided in JavaScript. + +\section{1. Background}\label{background} + +Many datasets are shared online today using HTTP and FTP, which lack +built in support for version control or content addressing of data. This +results in link rot and content drift as files are moved, updated or +deleted, leading to an alarming rate of disappearing data references in +areas such as +\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}{published +scientific literature}. + +Cloud storage services like S3 ensure availability of data, but they +have a centralized hub-and-spoke networking model and are therefore +limited by their bandwidth, meaning popular files can become very +expensive to share. Services like Dropbox and Google Drive provide +version control and synchronization on top of cloud storage services +which fixes many issues with broken links but rely on proprietary code +and services requiring users to store their data on centralized cloud +infrastructure which has implications on cost, transfer speeds, vendor +lock-in and user privacy. + +Distributed file sharing tools can become faster as files become more +popular, removing the bandwidth bottleneck and making file distribution +cheaper. They also use link resolution and discovery systems which can +prevent broken links meaning if the original source goes offline other +backup sources can be automatically discovered. However these file +sharing tools today are not supported by Web browsers, do not have good +privacy guarantees, and do not provide a mechanism for updating files +without redistributing a new dataset which could mean entirely +re-downloading data you already have. + +\section{2. Dat}\label{dat} + +Dat is a dataset synchronization protocol that does not assume a dataset +is static or that the entire dataset will be downloaded. The main +reference implementation is available from npm as +\texttt{npm\ install\ dat\ -g}. + +The protocol is agnostic to the underlying transport e.g.~you could +implement Dat over carrier pigeon. Data is stored in a format called +SLEEP (Ogden and Buus 2017), described in its own paper. The key +properties of the Dat design are explained in this section. + +\begin{itemize} +\tightlist +\item + 2.1 \textbf{Content Integrity} - Data and publisher integrity is + verified through use of signed hashes of the content. +\item + 2.2 \textbf{Decentralized Mirroring} - Users sharing the same Dat + automatically discover each other and exchange data in a swarm. +\item + 2.3 \textbf{Network Privacy} - Dat provides certain privacy guarantees + including end-to-end encryption. +\item + 2.4 \textbf{Incremental Versioning} - Datasets can be efficiently + synced, even in real time, to other peers. +\item + 2.5 \textbf{Random Access} - Huge file hierarchies can be efficiently + traversed remotely. +\end{itemize} + +\subsection{2.1 Content Integrity}\label{content-integrity} + +Content integrity means being able to verify the data you received is +the exact same version of the data that you expected. This is important +in a distributed system as this mechanism will catch incorrect data sent +by bad peers. It also has implications for reproducibility as it lets +you refer to a specific version of a dataset. + +Link rot, when links online stop resolving, and content drift, when data +changes but the link to the data remains the same, are two common issues +in data analysis. For example, one day a file called data.zip might +change, but a typical HTTP link to the file does not include a hash of +the content, or provide a way to get updated metadata, so clients that +only have the HTTP link have no way to check if the file changed without +downloading the entire file again. Referring to a file by the hash of +its content is called content addressability, and lets users not only +verify that the data they receive is the version of the data they want, +but also lets people cite specific versions of the data by referring to +a specific hash. + +Dat uses BLAKE2b (Aumasson et al. 2013) cryptographically secure hashes +to address content. Hashes are arranged in a Merkle tree (Mykletun, +Narasimha, and Tsudik 2003), a tree where each non-leaf node is the hash +of all child nodes. Leaf nodes contain pieces of the dataset. Due to the +properties of secure cryptographic hashes the top hash can only be +produced if all data below it matches exactly. If two trees have +matching top hashes then you know that all other nodes in the tree must +match as well, and you can conclude that your dataset is synchronized. +Trees are chosen as the primary data structure in Dat as they have a +number of properties that allow for efficient access to subsets of the +metadata, which allows Dat to work efficiently over a network +connection. + +\subsubsection{Dat Links}\label{dat-links} + +Dat links are Ed25519 (Bernstein et al. 2012) public keys which have a +length of 32 bytes (64 characters when Hex encoded). You can represent +your Dat link in the following ways and Dat clients will be able to +understand them: + +\begin{itemize} +\tightlist +\item + The standalone public key: +\end{itemize} + +\texttt{8e1c7189b1b2dbb5c4ec2693787884771201da9...} + +\begin{itemize} +\tightlist +\item + Using the dat:// protocol: +\end{itemize} + +\texttt{dat://8e1c7189b1b2dbb5c4ec2693787884771...} + +\begin{itemize} +\tightlist +\item + As part of an HTTP URL: +\end{itemize} + +\texttt{https://datproject.org/8e1c7189b1b2dbb5...} + +All messages in the Dat protocol are encrypted and signed using the +public key during transport. This means that unless you know the public +key (e.g.~unless the Dat link was shared with you) then you will not be +able to discover or communicate with any member of the swarm for that +Dat. Anyone with the public key can verify that messages (such as +entries in a Dat Stream) were created by a holder of the private key. + +Every Dat repository has a corresponding private key which is kept in +your home folder and never shared. Dat never exposes either the public +or private key over the network. During the discovery phase the BLAKE2b +hash of the public key is used as the discovery key. This means that the +original key is impossible to discover (unless it was shared publicly +through a separate channel) since only the hash of the key is exposed +publicly. + +Dat does not provide an authentication mechanism at this time. Instead +it provides a capability system. Anyone with the Dat link is currently +considered able to discover and access data. Do not share your Dat links +publicly if you do not want them to be accessed. + +\subsubsection{Hypercore and Hyperdrive}\label{hypercore-and-hyperdrive} + +The Dat storage, content integrity, and networking protocols are +implemented in a module called +\href{https://npmjs.org/hypercore}{Hypercore}. Hypercore is agnostic to +the format of the input data, it operates on any stream of binary data. +For the Dat use case of synchronizing datasets we use a file system +module on top of Hypercore called +\href{https://npmjs.org/hyperdrive}{Hyperdrive}. + +Dat has a layered abstraction so that users can use Hypercore directly +to have full control over how they model their data. Hyperdrive works +well when your data can be represented as files on a filesystem, which +is the main use case with Dat. + +\subsubsection{Hypercore Registers}\label{hypercore-registers} + +Hypercore Registers are the core mechanism used in Dat. They are binary +append-only streams whose contents are cryptographically hashed and +signed and therefore can be verified by anyone with access to the public +key of the writer. They are an implementation of the concept known as a +register, a digital ledger you can trust. + +Dat uses two registers, \texttt{content} and \texttt{metadata}. The +\texttt{content} register contains the files in your repository and +\texttt{metadata} contains the metadata about the files including name, +size, last modified time, etc. Dat replicates them both when +synchronizing with another peer. + +When files are added to Dat, each file gets split up into some number of +chunks, and the chunks are then arranged into a Merkle tree, which is +used later for version control and replication processes. + +\subsection{2.2 Decentralized Mirroring}\label{decentralized-mirroring} + +Dat is a peer to peer protocol designed to exchange pieces of a dataset +amongst a swarm of peers. As soon as a peer acquires their first piece +of data in the dataset they can choose to become a partial mirror for +the dataset. If someone else contacts them and needs the piece they +have, they can choose to share it. This can happen simultaneously while +the peer is still downloading the pieces they want from others. + +\subsubsection{Source Discovery}\label{source-discovery} + +An important aspect of mirroring is source discovery, the techniques +that peers use to find each other. Source discovery means finding the IP +and port of data sources online that have a copy of that data you are +looking for. You can then connect to them and begin exchanging data. By +using source discovery techniques Dat is able to create a network where +data can be discovered even if the original data source disappears. + +Source discovery can happen over many kinds of networks, as long as you +can model the following actions: + +\begin{itemize} +\tightlist +\item + \texttt{join(key,\ {[}port{]})} - Begin performing regular lookups on + an interval for \texttt{key}. Specify \texttt{port} if you want to + announce that you share \texttt{key} as well. +\item + \texttt{leave(key,\ {[}port{]})} - Stop looking for \texttt{key}. + Specify \texttt{port} to stop announcing that you share \texttt{key} + as well. +\item + \texttt{foundpeer(key,\ ip,\ port)} - Called when a peer is found by a + lookup. +\end{itemize} + +In the Dat implementation we implement the above actions on top of three +types of discovery networks: + +\begin{itemize} +\tightlist +\item + DNS name servers - An Internet standard mechanism for resolving keys + to addresses +\item + Multicast DNS - Useful for discovering peers on local networks +\item + Kademlia Mainline Distributed Hash Table - Less central points of + failure, increases probability of Dat working even if DNS servers are + unreachable +\end{itemize} + +Additional discovery networks can be implemented as needed. We chose the +above three as a starting point to have a complementary mix of +strategies to increase the probability of source discovery. Additionally +you can specify a Dat via HTTPS link, which runs the Dat protocol in +``single-source'' mode, meaning the above discovery networks are not +used, and instead only that one HTTPS server is used as the only peer. + +\subsubsection{Peer Connections}\label{peer-connections} + +After the discovery phase, Dat should have a list of potential data +sources to try and contact. Dat uses either TCP, HTTP or +\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP} +(Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up +all available bandwidth on a network (e.g.~so that other people sharing +WiFi can still use the Internet), and is still based on UDP so works +with NAT traversal techniques like UDP hole punching. HTTP is supported +for compatibility with static file servers and web browser clients. Note +that these are the protocols we support in the reference Dat +implementation, but the Dat protocol itself is transport agnostic. + +If an HTTP source is specified Dat will prefer that one over other +sources. Otherwise when Dat gets the IP and port for a potential TCP or +UTP source it tries to connect using both protocols. If one connects +first, Dat aborts the other one. If none connect, Dat will try again +until it decides that source is offline or unavailable and then stops +trying to connect to them. Sources Dat is able to connect to go into a +list of known good sources, so that if/when the Internet connection goes +down Dat can use that list to reconnect to known good sources again +quickly. + +If Dat gets a lot of potential sources it picks a handful at random to +try and connect to and keeps the rest around as additional sources to +use later in case it decides it needs more sources. + +Once a duplex binary connection to a remote source is open Dat then +layers on the Hypercore protocol, a message-based replication protocol +that allows two peers to communicate over a stateless channel to request +and exchange data. You open separate replication channels with many +peers at once which allows clients to parallelize data requests across +the entire pool of peers they have established connections with. + +\subsection{2.3 Network Privacy}\label{network-privacy} + +On the Web today, with SSL, there is a guarantee that the traffic +between your computer and the server is private. As long as you trust +the server to not leak your logs, attackers who intercept your network +traffic will not be able to read the HTTP traffic exchanged between you +and the server. This is a fairly straightforward model as clients only +have to trust a single server for some domain. + +There is an inherent tradeoff in peer to peer systems of source +discovery vs.~user privacy. The more sources you contact and ask for +some data, the more sources you trust to keep what you asked for +private. Our goal is to have Dat be configurable in respect to this +tradeoff to allow application developers to meet their own privacy +guidelines. + +It is up to client programs to make design decisions around which +discovery networks they trust. For example if a Dat client decides to +use the BitTorrent DHT to discover peers, and they are searching for a +publicly shared Dat key (e.g.~a key cited publicly in a published +scientific paper) with known contents, then because of the privacy +design of the BitTorrent DHT it becomes public knowledge what key that +client is searching for. + +A client could choose to only use discovery networks with certain +privacy guarantees. For example a client could only connect to an +approved list of sources that they trust, similar to SSL. As long as +they trust each source, the encryption built into the Dat network +protocol will prevent the Dat key they are looking for from being +leaked. + +\subsection{2.4 Incremental Versioning}\label{incremental-versioning} + +Given a stream of binary data, Dat splits the stream into chunks, hashes +each chunk, and arranges the hashes in a specific type of Merkle tree +that allows for certain replication properties. + +Dat is also able to fully or partially synchronize streams in a +distributed setting even if the stream is being appended to. This is +accomplished by using the messaging protocol to traverse the Merkle tree +of remote sources and fetch a strategic set of nodes. Due to the +low-level, message-oriented design of the replication protocol, +different node traversal strategies can be implemented. + +There are two types of versioning performed automatically by Dat. +Metadata is stored in a folder called \texttt{.dat} in the root folder +of a repository, and data is stored as normal files in the root folder. + +\subsubsection{Metadata Versioning}\label{metadata-versioning} + +Dat tries as much as possible to act as a one-to-one mirror of the state +of a folder and all its contents. When importing files, Dat uses a +sorted, depth-first recursion to list all the files in the tree. For +each file it finds, it grabs the filesystem metadata (filename, Stat +object, etc) and checks if there is already an entry for this filename +with this exact metadata already represented in the Dat repository +metadata. If the file with this metadata matches exactly the newest +version of the file metadata stored in Dat, then this file will be +skipped (no change). + +If the metadata differs from the current existing one (or there are no +entries for this filename at all in the history), then this new metadata +entry will be appended as the new `latest' version for this file in the +append-only SLEEP metadata content register (described below). + +\subsubsection{Content Versioning}\label{content-versioning} + +In addition to storing a historical record of filesystem metadata, the +content of the files themselves are also capable of being stored in a +version controlled manner. The default storage system used in Dat stores +the files as files. This has the advantage of being very straightforward +for users to understand, but the downside of not storing old versions of +content by default. + +In contrast to other version control systems like Git, Dat by default +only stores the current set of checked out files on disk in the +repository folder, not old versions. It does store all previous metadata +for old versions in \texttt{.dat}. Git for example stores all previous +content versions and all previous metadata versions in the \texttt{.git} +folder. Because Dat is designed for larger datasets, if it stored all +previous file versions in \texttt{.dat}, then the \texttt{.dat} folder +could easily fill up the users hard drive inadvertently. Therefore Dat +has multiple storage modes based on usage. + +Hypercore registers include an optional \texttt{data} file that stores +all chunks of data. In Dat, only the \texttt{metadata.data} file is +used, but the \texttt{content.data} file is not used. The default +behavior is to store the current files only as normal files. If you want +to run an `archival' node that keeps all previous versions, you can +configure Dat to use the \texttt{content.data} file instead. For +example, on a shared server with lots of storage you probably want to +store all versions. However on a workstation machine that is only +accessing a subset of one version, the default mode of storing all +metadata plus the current set of downloaded files is acceptable, because +you know the server has the full history. + +\subsubsection{Merkle Trees}\label{merkle-trees} + +Registers in Dat use a specific method of encoding a Merkle tree where +hashes are positioned by a scheme called binary in-order interval +numbering or just ``bin'' numbering. This is just a specific, +deterministic way of laying out the nodes in a tree. For example a tree +with 7 nodes will always be arranged like this: + +\begin{verbatim} +0 + 1 +2 + 3 +4 + 5 +6 +\end{verbatim} + +In Dat, the hashes of the chunks of files are always even numbers, at +the wide end of the tree. So the above tree had four original values +that become the even numbers: + +\begin{verbatim} +chunk0 -> 0 +chunk1 -> 2 +chunk2 -> 4 +chunk3 -> 6 +\end{verbatim} + +In the resulting Merkle tree, the even and odd nodes store different +information: + +\begin{itemize} +\tightlist +\item + Evens - List of data hashes {[}chunk0, chunk1, chunk2, \ldots{}{]} +\item + Odds - List of Merkle hashes (hashes of child even nodes) {[}hash0, + hash1, hash2, \ldots{}{]} +\end{itemize} + +These two lists get interleaved into a single register such that the +indexes (position) in the register are the same as the bin numbers from +the Merkle tree. + +All odd hashes are derived by hashing the two child nodes, e.g.~given +hash0 is \texttt{hash(chunk0)} and hash2 is \texttt{hash(chunk1)}, hash1 +is \texttt{hash(hash0\ +\ hash2)}. + +For example a register with two data entries would look something like +this (pseudocode): + +\begin{verbatim} +0. hash(chunk0) +1. hash(hash(chunk0) + hash(chunk1)) +2. hash(chunk1) +\end{verbatim} + +It is possible for the in-order Merkle tree to have multiple roots at +once. A root is defined as a parent node with a full set of child node +slots filled below it. + +For example, this tree has 2 roots (1 and 4) + +\begin{verbatim} +0 + 1 +2 + +4 +\end{verbatim} + +This tree has one root (3): + +\begin{verbatim} +0 + 1 +2 + 3 +4 + 5 +6 +\end{verbatim} + +This one has one root (1): + +\begin{verbatim} +0 + 1 +2 +\end{verbatim} + +\subsubsection{Replication Example}\label{replication-example} + +This section describes in high level the replication flow of a Dat. Note +that the low level details are available by reading the SLEEP section +below. For the sake of illustrating how this works in practice in a +networked replication scenario, consider a folder with two files: + +\begin{verbatim} +bat.jpg +cat.jpg +\end{verbatim} + +To send these files to another machine using Dat, you would first add +them to a Dat repository by splitting them into chunks and constructing +SLEEP files representing the chunks and filesystem metadata. + +Let's assume \texttt{bat.jpg} and \texttt{cat.jpg} both produce three +chunks, each around 64KB. Dat stores in a representation called SLEEP, +but here we will show a pseudo-representation for the purposes of +illustrating the replication process. The six chunks get sorted into a +list like this: + +\begin{verbatim} +bat-1 +bat-2 +bat-3 +cat-1 +cat-2 +cat-3 +\end{verbatim} + +These chunks then each get hashed, and the hashes get arranged into a +Merkle tree (the content register): + +\begin{verbatim} +0 - hash(bat-1) + 1 - hash(0 + 2) +2 - hash(bat-2) + 3 - hash(1 + 5) +4 - hash(bat-3) + 5 - hash(4 + 6) +6 - hash(cat-1) +8 - hash(cat-2) + 9 - hash(8 + 10) +10 - hash(cat-3) +\end{verbatim} + +Next we calculate the root hashes of our tree, in this case 3 and 9. We +then hash them together, and cryptographically sign the hash. This +signed hash now can be used to verify all nodes in the tree, and the +signature proves it was produced by us, the holder of the private key +for this Dat. + +This tree is for the hashes of the contents of the photos. There is also +a second Merkle tree that Dat generates that represents the list of +files and their metadata and looks something like this (the metadata +register): + +\begin{verbatim} +0 - hash({contentRegister: '9e29d624...'}) + 1 - hash(0 + 2) +2 - hash({"bat.jpg", first: 0, length: 3}) +4 - hash({"cat.jpg", first: 3, length: 3}) +\end{verbatim} + +The first entry in this feed is a special metadata entry that tells Dat +the address of the second feed (the content register). Note that node 3 +is not included yet, because 3 is the hash of \texttt{1\ +\ 5}, but 5 +does not exist yet, so will be written at a later update. + +Now we're ready to send our metadata to the other peer. The first +message is a \texttt{Register} message with the key that was shared for +this Dat. Let's call ourselves Alice and the other peer Bob. Alice sends +Bob a \texttt{Want} message that declares they want all nodes in the +file list (the metadata register). Bob replies with a single +\texttt{Have} message that indicates he has 2 nodes of data. Alice sends +three \texttt{Request} messages, one for each leaf node +(\texttt{0,\ 2,\ 4}). Bob sends back three \texttt{Data} messages. The +first \texttt{Data} message contains the content register key, the hash +of the sibling, in this case node \texttt{2}, the hash of the uncle root +\texttt{4}, as well as a signature for the root hashes (in this case +\texttt{1,\ 4}). Alice verifies the integrity of this first +\texttt{Data} message by hashing the metadata received for the content +register metadata to produce the hash for node \texttt{0}. They then +hash the hash \texttt{0} with the hash \texttt{2} that was included to +reproduce hash \texttt{1}, and hashes their \texttt{1} with the value +for \texttt{4} they received, which they can use the received signature +to verify it was the same data. When the next \texttt{Data} message is +received, a similar process is performed to verify the content. + +Now Alice has the full list of files in the Dat, but decides they only +want to download \texttt{cat.png}. Alice knows they want blocks 3 +through 6 from the content register. First Alice sends another +\texttt{Register} message with the content key to open a new replication +channel over the connection. Then Alice sends three \texttt{Request} +messages, one for each of blocks \texttt{4,\ 5,\ 6}. Bob sends back +three \texttt{Data} messages with the data for each block, as well as +the hashes needed to verify the content in a way similar to the process +described above for the metadata feed. + +\subsection{2.5 Random Access}\label{random-access} + +Dat pursues the following access capabilities: + +\begin{itemize} +\tightlist +\item + Support large file hierachies (millions of files in a single + repository). +\item + Support efficient traversal of the hierarchy (listing files in + arbitrary folders efficiently). +\item + Store all changes to all files (metadata and/or content). +\item + List all changes made to any single file. +\item + View the state of all files relative to any point in time. +\item + Subscribe live to all changes (any file). +\item + Subscribe live to changes to files under a specific path. +\item + Efficiently access any byte range of any version of any file. +\item + Allow all of the above to happen remotely, only syncing the minimum + metadata necessary to perform any action. +\item + Allow efficient comparison of remote and local repository state to + request missing pieces during synchronization. +\item + Allow entire remote archive to be synchronized, or just some subset of + files and/or versions. +\end{itemize} + +The way Dat accomplishes these is through a combination of storing all +changes in Hypercore feeds, but also using strategic metadata indexing +strategies that support certain queries efficiently to be performed by +traversing the Hypercore feeds. The protocol itself is specified in +Section 3 (SLEEP), but a scenario based summary follows here. + +\subsubsection{Scenario: Reading a file from a specific byte +offset}\label{scenario-reading-a-file-from-a-specific-byte-offset} + +Alice has a dataset in Dat, Bob wants to access a 100MB CSV called +\texttt{cat\_dna.csv} stored in the remote repository, but only wants to +access the 10MB range of the CSV spanning from 30MB - 40MB. + +Bob has never communicated with Alice before, and is starting fresh with +no knowledge of this Dat repository other than that he knows he wants +\texttt{cat\_dna.csv} at a specific offset. + +First, Bob asks Alice through the Dat protocol for the metadata he needs +to resolve \texttt{cat\_dna.csv} to the correct metadata feed entry that +represents the file he wants. Note: In this scenario we assume Bob wants +the latest version of \texttt{cat\_dna.csv}. It is also possible to do +this for a specific older version. + +Bob first sends a \texttt{Request} message for the latest entry in the +metadata feed. Alice responds. Bob looks at the \texttt{trie} value, and +using the lookup algorithm described below sends another +\texttt{Request} message for the metadata node that is closer to the +filename he is looking for. This repeats until Alice sends Bob the +matching metadata entry. This is the un-optimized resolution that uses +\texttt{log(n)} round trips, though there are ways to optimize this by +having Alice send additional sequence numbers to Bob that help him +traverse in less round trips. + +In the metadata record Bob received for \texttt{cat\_dna.csv} there is +the byte offset to the beginning of the file in the data feed. Bob adds +his +30MB offset to this value and starts requesting pieces of data +starting at that byte offset using the SLEEP protocol as described +below. + +This method tries to allow any byte range of any file to be accessed +without the need to synchronize the full metadata for all files up +front. + +\subsection{3. Dat Network Protocol}\label{dat-network-protocol} + +The SLEEP format is designed to allow for sparse replication, meaning +you can efficiently download only the metadata and data required to +resolve a single byte region of a single file, which makes Dat suitable +for a wide variety of streaming, real time and large dataset use cases. + +To take advantage of this, Dat includes a network protocol. It is +message-based and stateless, making it possible to implement on a +variety of network transport protocols including UDP and TCP. Both +metadata and content registers in SLEEP share the exact same replication +protocol. + +Individual messages are encoded using Protocol Buffers and there are ten +message types using the following schema: + +\subsubsection{Wire Protocol}\label{wire-protocol} + +Over the wire messages are packed in the following lightweight container +format + +\begin{verbatim} +<varint - length of rest of message> + <varint - header> + <message> +\end{verbatim} + +The \texttt{header} value is a single varint that has two pieces of +information: the integer \texttt{type} that declares a 4-bit message +type (used below), and a channel identifier, \texttt{0} for metadata and +\texttt{1} for content. + +To generate this varint, you bitshift the 4-bit type integer onto the +end of the channel identifier, e.g. +\texttt{channel\ \textless{}\textless{}\ 4\ \textbar{}\ \textless{}4-bit-type\textgreater{}}. + +\subsubsection{Feed}\label{feed} + +Type 0. Should be the first message sent on a channel. + +\begin{itemize} +\tightlist +\item + \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore' + using the public key of the metadata register as the key. +\item + \texttt{nonce} - 24 bytes (192 bits) of random binary data, used in + our encryption scheme +\end{itemize} + +\begin{verbatim} +message Feed { + required bytes discoveryKey = 1; + optional bytes nonce = 2; +} +\end{verbatim} + +\subsubsection{Handshake}\label{handshake} + +Type 1. Overall connection handshake. Should be sent just after the feed +message on the first channel only (metadata). + +\begin{itemize} +\tightlist +\item + \texttt{id} - 32 byte random data used as a identifier for this peer + on the network, useful for checking if you are connected to yourself + or another peer more than once +\item + \texttt{live} - Whether or not you want to operate in live + (continuous) replication mode or end after the initial sync +\item + \texttt{userData} - User-specific metadata encoded as a byte sequence +\item + \texttt{extensions} - List of extensions that are supported on this + Feed +\end{itemize} + +\begin{verbatim} +message Handshake { + optional bytes id = 1; + optional bool live = 2; + optional bytes userData = 3; + repeated string extensions = 4; +} +\end{verbatim} + +\subsubsection{Info}\label{info} + +Type 2. Message indicating state changes. Used to indicate whether you +are uploading and/or downloading. + +Initial state for uploading/downloading is true. If both ends are not +downloading and not live it is safe to consider the stream ended. + +\begin{verbatim} +message Info { + optional bool uploading = 1; + optional bool downloading = 2; +} +\end{verbatim} + +\subsubsection{Have}\label{have} + +Type 3. How you tell the other peer what chunks of data you have or +don't have. You should only send Have messages to peers who have +expressed interest in this region with Want messages. + +\begin{itemize} +\tightlist +\item + \texttt{start} - If you only specify \texttt{start}, it means you are + telling the other side you only have 1 chunk at the position at the + value in \texttt{start}. +\item + \texttt{length} - If you specify length, you can describe a range of + values that you have all of, starting from \texttt{start}. +\item + \texttt{bitfield} - If you would like to send a range of sparse data + about haves/don't haves via bitfield, relative to \texttt{start}. +\end{itemize} + +\begin{verbatim} +message Have { + required uint64 start = 1; + optional uint64 length = 2 [default = 1]; + optional bytes bitfield = 3; +} +\end{verbatim} + +When sending bitfields you must run length encode them. The encoded +bitfield is a series of compressed and uncompressed bit sequences. All +sequences start with a header that is a varint. + +If the last bit is set in the varint (it is an odd number) then a header +represents a compressed bit sequence. + +\begin{verbatim} +compressed-sequence = varint( + byte-length-of-sequence + << 2 | bit << 1 | 1 +) +\end{verbatim} + +If the last bit is \emph{not} set then a header represents a +non-compressed sequence. + +\begin{verbatim} +uncompressed-sequence = varint( + byte-length-of-bitfield << 1 | 0 +) + (bitfield) +\end{verbatim} + +\subsubsection{Unhave}\label{unhave} + +Type 4. How you communicate that you deleted or removed a chunk you used +to have. + +\begin{verbatim} +message Unhave { + required uint64 start = 1; + optional uint64 length = 2 [default = 1]; +} +\end{verbatim} + +\subsubsection{Want}\label{want} + +Type 5. How you ask the other peer to subscribe you to Have messages for +a region of chunks. The \texttt{length} value defaults to Infinity or +feed.length (if not live). + +\begin{verbatim} +message Want { + required uint64 start = 1; + optional uint64 length = 2; +} +\end{verbatim} + +\subsubsection{Unwant}\label{unwant} + +Type 6. How you ask to unsubscribe from Have messages for a region of +chunks from the other peer. You should only Unwant previously Wanted +regions, but if you do Unwant something that hasn't been Wanted it won't +have any effect. The \texttt{length} value defaults to Infinity or +feed.length (if not live). + +\begin{verbatim} +message Unwant { + required uint64 start = 1; + optional uint64 length = 2; +} +\end{verbatim} + +\subsubsection{Request}\label{request} + +Type 7. Request a single chunk of data. + +\begin{itemize} +\tightlist +\item + \texttt{index} - The chunk index for the chunk you want. You should + only ask for indexes that you have received the Have messages for. +\item + \texttt{bytes} - You can also optimistically specify a byte offset, + and in the case the remote is able to resolve the chunk for this byte + offset depending on their Merkle tree state, they will ignore the + \texttt{index} and send the chunk that resolves for this byte offset + instead. But if they cannot resolve the byte request, \texttt{index} + will be used. +\item + \texttt{hash} - If you only want the hash of the chunk and not the + chunk data itself. +\item + \texttt{nodes} - A 64 bit long bitfield representing which parent + nodes you have. +\end{itemize} + +The \texttt{nodes} bitfield is an optional optimization to reduce the +amount of duplicate nodes exchanged during the replication lifecycle. It +indicates which parents you have or don't have. You have a maximum of 64 +parents you can specify. Because \texttt{uint64} in Protocol Buffers is +implemented as a varint, over the wire this does not take up 64 bits in +most cases. The first bit is reserved to signify whether or not you need +a signature in response. The rest of the bits represent whether or not +you have (\texttt{1}) or don't have (\texttt{0}) the information at this +node already. The ordering is determined by walking parent, sibling up +the tree all the way to the root. + +\begin{verbatim} +message Request { + required uint64 index = 1; + optional uint64 bytes = 2; + optional bool hash = 3; + optional uint64 nodes = 4; +} +\end{verbatim} + +\subsubsection{Cancel}\label{cancel} + +Type 8. Cancel a previous Request message that you haven't received yet. + +\begin{verbatim} +message Cancel { + required uint64 index = 1; + optional uint64 bytes = 2; + optional bool hash = 3; +} +\end{verbatim} + +\subsubsection{Data}\label{data} + +Type 9. Sends a single chunk of data to the other peer. You can send it +in response to a Request or unsolicited on its own as a friendly gift. +The data includes all of the Merkle tree parent nodes needed to verify +the hash chain all the way up to the Merkle roots for this chunk. +Because you can produce the direct parents by hashing the chunk, only +the roots and `uncle' hashes are included (the siblings to all of the +parent nodes). + +\begin{itemize} +\tightlist +\item + \texttt{index} - The chunk position for this chunk. +\item + \texttt{value} - The chunk binary data. Empty if you are sending only + the hash. +\item + \texttt{Node.index} - The index for this chunk in in-order notation +\item + \texttt{Node.hash} - The hash of this chunk +\item + \texttt{Node.size}- The aggregate chunk size for all children below + this node (The sum of all chunk sizes of all children) +\item + \texttt{signature} - If you are sending a root node, all root nodes + must have the signature included. +\end{itemize} + +\begin{verbatim} +message Data { + required uint64 index = 1; + optional bytes value = 2; + repeated Node nodes = 3; + optional bytes signature = 4; + + message Node { + required uint64 index = 1; + required bytes hash = 2; + required uint64 size = 3; + } +} +\end{verbatim} + +\section{4. Existing Work}\label{existing-work} + +Dat is inspired by a number of features from existing systems. + +\subsection{Git}\label{git} + +Git popularized the idea of a directed acyclic graph (DAG) combined with +a Merkle tree, a way to represent changes to data where each change is +addressed by the secure hash of the change plus all ancestor hashes in a +graph. This provides a way to trust data integrity, as the only way a +specific hash could be derived by another peer is if they have the same +data and change history required to reproduce that hash. This is +important for reproducibility as it lets you trust that a specific git +commit hash refers to a specific source code state. + +Decentralized version control tools for source code like Git provide a +protocol for efficiently downloading changes to a set of files, but are +optimized for text files and have issues with large files. Solutions +like Git-LFS solve this by using HTTP to download large files, rather +than the Git protocol. GitHub offers Git-LFS hosting but charges +repository owners for bandwidth on popular files. Building a distributed +distribution layer for files in a Git repository is difficult due to +design of Git Packfiles which are delta compressed repository states +that do not easily support random access to byte ranges in previous file +versions. + +\subsection{BitTorrent}\label{bittorrent} + +BitTorrent implements a swarm based file sharing protocol for static +datasets. Data is split into fixed sized chunks, hashed, and then that +hash is used to discover peers that have the same data. An advantage of +using BitTorrent for dataset transfers is that download bandwidth can be +fully saturated. Since the file is split into pieces, and peers can +efficiently discover which pieces each of the peers they are connected +to have, it means one peer can download non-overlapping regions of the +dataset from many peers at the same time in parallel, maximizing network +throughput. + +Fixed sized chunking has drawbacks for data that changes. BitTorrent +assumes all metadata will be transferred up front which makes it +impractical for streaming or updating content. Most BitTorrent clients +divide data into 1024 pieces meaning large datasets could have a very +large chunk size which impacts random access performance (e.g.~for +streaming video). + +Another drawback of BitTorrent is due to the way clients advertise and +discover other peers in absence of any protocol level privacy or trust. +From a user privacy standpoint, BitTorrent leaks what users are +accessing or attempting to access, and does not provide the same +browsing privacy functions as systems like SSL. + +\subsection{Kademlia Distributed Hash +Table}\label{kademlia-distributed-hash-table} + +Kademlia (Maymounkov and Mazieres 2002) is a distributed hash table, a +distributed key/value store that can serve a similar purpose to DNS +servers but has no hard coded server addresses. All clients in Kademlia +are also servers. As long as you know at least one address of another +peer in the network, you can ask them for the key you are trying to find +and they will either have it or give you some other people to talk to +that are more likely to have it. + +If you don't have an initial peer to talk to you, most clients use a +bootstrap server that randomly gives you a peer in the network to start +with. If the bootstrap server goes down, the network still functions as +long as other methods can be used to bootstrap new peers (such as +sending them peer addresses through side channels like how .torrent +files include tracker addresses to try in case Kademlia finds no peers). + +Kademlia is distinct from previous DHT designs due to its simplicity. It +uses a very simple XOR operation between two keys as its ``distance'' +metric to decide which peers are closer to the data being searched for. +On paper it seems like it wouldn't work as it doesn't take into account +things like ping speed or bandwidth. Instead its design is very simple +on purpose to minimize the amount of control/gossip messages and to +minimize the amount of complexity required to implement it. In practice +Kademlia has been extremely successful and is widely deployed as the +``Mainline DHT'' for BitTorrent, with support in all popular BitTorrent +clients today. + +Due to the simplicity in the original Kademlia design a number of +attacks such as DDOS and/or sybil have been demonstrated. There are +protocol extensions (BEPs) which in certain cases mitigate the effects +of these attacks, such as BEP 44 which includes a DDOS mitigation +technique. Nonetheless anyone using Kademlia should be aware of the +limitations. + +\subsection{Peer to Peer Streaming Peer Protocol +(PPSPP)}\label{peer-to-peer-streaming-peer-protocol-ppspp} + +PPSPP +(\href{https://datatracker.ietf.org/doc/rfc7574/?include_text=1}{IETF +RFC 7574}, (Bakker, Petrocco, and Grishchenko 2015)) is a protocol for +live streaming content over a peer to peer network. In it they define a +specific type of Merkle Tree that allows for subsets of the hashes to be +requested by a peer in order to reduce the time-till-playback for end +users. BitTorrent for example transfers all hashes up front, which is +not suitable for live streaming. + +Their Merkle trees are ordered using a scheme they call ``bin +numbering'', which is a method for deterministically arranging an +append-only log of leaf nodes into an in-order layout tree where +non-leaf nodes are derived hashes. If you want to verify a specific +node, you only need to request its sibling's hash and all its uncle +hashes. PPSPP is very concerned with reducing round trip time and +time-till-playback by allowing for many kinds of optimizations, such as +to pack as many hashes into datagrams as possible when exchanging tree +information with peers. + +Although PPSPP was designed with streaming video in mind, the ability to +request a subset of metadata from a large and/or streaming dataset is +very desirable for many other types of datasets. + +\subsection{WebTorrent}\label{webtorrent} + +With WebRTC, browsers can now make peer to peer connections directly to +other browsers. BitTorrent uses UDP sockets which aren't available to +browser JavaScript, so can't be used as-is on the Web. + +WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC +as the transport. This includes the BitTorrent block exchange protocol +as well as the tracker protocol implemented in a way that can enable +hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent +swarms (if a client is capable of making both UDP sockets as well as +WebRTC sockets, such as Node.js). Trackers are exposed to web clients +over HTTP or WebSockets. + +\subsection{InterPlanetary File +System}\label{interplanetary-file-system} + +IPFS is a family of application and network protocols that have peer to +peer file sharing and data permanence baked in. IPFS abstracts network +protocols and naming systems to provide an alternative application +delivery platform to today's Web. For example, instead of using HTTP and +DNS directly, in IPFS you would use LibP2P streams and IPNS in order to +gain access to the features of the IPFS platform. + +\subsection{Certificate Transparency/Secure +Registers}\label{certificate-transparencysecure-registers} + +The UK Government Digital Service have developed the concept of a +register which they define as a digital public ledger you can trust. In +the UK government registers are beginning to be piloted as a way to +expose essential open data sets in a way where consumers can verify the +data has not been tampered with, and allows the data publishers to +update their data sets over time. + +The design of registers was inspired by the infrastructure backing the +Certificate Transparency (Laurie, Langley, and Kasper 2013) project, +initiated at Google, which provides a service on top of SSL certificates +that enables service providers to write certificates to a distributed +public ledger. Any client or service provider can verify if a +certificate they received is in the ledger, which protects against so +called ``rogue certificates''. + +\section{5. Reference Implementation}\label{reference-implementation} + +The connection logic is implemented in a module called +\href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. +This builds on discovery-channel and adds connection establishment, +management and statistics. It provides statistics such as how many +sources are currently connected, how many good and bad behaving sources +have been talked to, and it automatically handles connecting and +reconnecting to sources. UTP support is implemented in the module +\href{https://www.npmjs.com/package/utp-native}{utp-native}. + +Our implementation of source discovery is called +\href{https://npmjs.org/discovery-channel}{discovery-channel}. We also +run a \href{https://www.npmjs.com/package/dns-discovery}{custom DNS +server} that Dat clients use (in addition to specifying their own if +they need to), as well as a +\href{https://github.com/bittorrent/bootstrap-dht}{DHT bootstrap} +server. These discovery servers are the only centralized infrastructure +we need for Dat to work over the Internet, but they are redundant, +interchangeable, never see the actual data being shared, anyone can run +their own and Dat will still work even if they all are unavailable. If +this happens discovery will just be manual (e.g.~manually sharing +IP/ports). + +\section{Acknowledgements}\label{acknowledgements} + +This work was made possible through grants from the John S. and James L. +Knight and Alfred P. Sloan Foundations. + +\section*{References}\label{references} +\addcontentsline{toc}{section}{References} + +\hypertarget{refs}{} +\hypertarget{ref-aumasson2013blake2}{} +Aumasson, Jean-Philippe, Samuel Neves, Zooko Wilcox-O'Hearn, and +Christian Winnerlein. 2013. ``BLAKE2: Simpler, Smaller, Fast as Md5.'' +In \emph{International Conference on Applied Cryptography and Network +Security}, 119--35. Springer. + +\hypertarget{ref-bakker2015peer}{} +Bakker, A, R Petrocco, and V Grishchenko. 2015. ``Peer-to-Peer Streaming +Peer Protocol (Ppspp).'' + +\hypertarget{ref-bernstein2012high}{} +Bernstein, Daniel J, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin +Yang. 2012. ``High-Speed High-Security Signatures.'' \emph{Journal of +Cryptographic Engineering}. Springer, 1--13. + +\hypertarget{ref-laurie2013certificate}{} +Laurie, Ben, Adam Langley, and Emilia Kasper. 2013. ``Certificate +Transparency.'' + +\hypertarget{ref-maymounkov2002kademlia}{} +Maymounkov, Petar, and David Mazieres. 2002. ``Kademlia: A Peer-to-Peer +Information System Based on the Xor Metric.'' In \emph{International +Workshop on Peer-to-Peer Systems}, 53--65. Springer. + +\hypertarget{ref-mykletun2003providing}{} +Mykletun, Einar, Maithili Narasimha, and Gene Tsudik. 2003. ``Providing +Authentication and Integrity in Outsourced Databases Using Merkle Hash +Trees.'' \emph{UCI-SCONCE Technical Report}. + +\hypertarget{ref-sleep}{} +Ogden, Maxwell, and Mathias Buus. 2017. ``SLEEP - the Dat Protocol on +Disk Format.'' In. + +\hypertarget{ref-rossi2010ledbat}{} +Rossi, Dario, Claudio Testa, Silvio Valenti, and Luca Muscariello. 2010. +``LEDBAT: The New Bittorrent Congestion Control Protocol.'' In +\emph{ICCCN}, 1--6. + +\end{document} |