diff options
Diffstat (limited to 'papers/dat-paper.txt')
-rw-r--r-- | papers/dat-paper.txt | 1282 |
1 files changed, 0 insertions, 1282 deletions
diff --git a/papers/dat-paper.txt b/papers/dat-paper.txt deleted file mode 100644 index f0e71c2..0000000 --- a/papers/dat-paper.txt +++ /dev/null @@ -1,1282 +0,0 @@ -\documentclass[a4paperpaper,twocolumn]{article} -\usepackage{lmodern} -\usepackage{amssymb,amsmath} -\usepackage{ifxetex,ifluatex} -\usepackage{fixltx2e} % provides \textsubscript -\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex - \usepackage[T1]{fontenc} - \usepackage[utf8]{inputenc} -\else % if luatex or xelatex - \ifxetex - \usepackage{mathspec} - \else - \usepackage{fontspec} - \fi - \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase} -\fi -% use upquote if available, for straight quotes in verbatim environments -\IfFileExists{upquote.sty}{\usepackage{upquote}}{} -% use microtype if available -\IfFileExists{microtype.sty}{% -\usepackage{microtype} -\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts -}{} -\usepackage[unicode=true]{hyperref} -\hypersetup{ - pdftitle={Dat - Distributed Dataset Synchronization And Versioning}, - pdfauthor={Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science}, - pdfborder={0 0 0}, - breaklinks=true} -\urlstyle{same} % don't use monospace font for urls -\IfFileExists{parskip.sty}{% -\usepackage{parskip} -}{% else -\setlength{\parindent}{0pt} -\setlength{\parskip}{6pt plus 2pt minus 1pt} -} -\setlength{\emergencystretch}{3em} % prevent overfull lines -\providecommand{\tightlist}{% - \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}} -\setcounter{secnumdepth}{0} -% Redefines (sub)paragraphs to behave more like sections -\ifx\paragraph\undefined\else -\let\oldparagraph\paragraph -\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}} -\fi -\ifx\subparagraph\undefined\else -\let\oldsubparagraph\subparagraph -\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} -\fi - -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - -\title{Dat - Distributed Dataset Synchronization And Versioning} -\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science} -\date{May 2017} - -\begin{document} -\maketitle - -\section{Abstract}\label{abstract} - -Dat is a protocol designed for syncing folders of data, even if they are -large or changing constantly. Dat uses a cryptographically secure -register of changes to prove that the requested data version is -distributed. A byte range of any file's version can be efficiently -streamed from a Dat repository over a network connection. Consumers can -choose to fully or partially replicate the contents of a remote Dat -repository, and can also subscribe to live changes. To ensure writer and -reader privacy, Dat uses public key cryptography to encrypt network -traffic. A group of Dat clients can connect to each other to form a -public or private decentralized network to exchange data between each -other. A reference implementation is provided in JavaScript. - -\section{1. Background}\label{background} - -Many datasets are shared online today using HTTP and FTP, which lack -built in support for version control or content addressing of data. This -results in link rot and content drift as files are moved, updated or -deleted, leading to an alarming rate of disappearing data references in -areas such as -\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}{published -scientific literature}. - -Cloud storage services like S3 ensure availability of data, but they -have a centralized hub-and-spoke networking model and are therefore -limited by their bandwidth, meaning popular files can become very -expensive to share. Services like Dropbox and Google Drive provide -version control and synchronization on top of cloud storage services -which fixes many issues with broken links but rely on proprietary code -and services requiring users to store their data on centralized cloud -infrastructure which has implications on cost, transfer speeds, vendor -lock-in and user privacy. - -Distributed file sharing tools can become faster as files become more -popular, removing the bandwidth bottleneck and making file distribution -cheaper. They also use link resolution and discovery systems which can -prevent broken links meaning if the original source goes offline other -backup sources can be automatically discovered. However these file -sharing tools today are not supported by Web browsers, do not have good -privacy guarantees, and do not provide a mechanism for updating files -without redistributing a new dataset which could mean entirely -redownloading data you already have. - -\section{2. Dat}\label{dat} - -Dat is a dataset synchronization protocol that does not assume a dataset -is static or that the entire dataset will be downloaded. The main -reference implementation is available from npm as -\texttt{npm\ install\ dat\ -g}. - -The protocol is agnostic to the underlying transport e.g.~you could -implement Dat over carrier pigeon. Data is stored in a format called -SLEEP (Ogden and Buus 2017), described in it's own paper. The key -properties of the Dat design are explained in this section. - -\begin{itemize} -\tightlist -\item - 2.1 \textbf{Content Integrity} - Data and publisher integrity is - verified through use of signed hashes of the content. -\item - 2.2 \textbf{Decentralized Mirroring} - Users sharing the same Dat - automatically discover each other and exchange data in a swarm. -\item - 2.3 \textbf{Network Privacy} - Dat provides certain privacy guarantees - including end-to-end encryption. -\item - 2.4 \textbf{Incremental Versioning} - Datasets can be efficiently - synced, even in real time, to other peers. -\item - 2.5 \textbf{Random Access} - Huge file hierarchies can be efficiently - traversed remotely. -\end{itemize} - -\subsection{2.1 Content Integrity}\label{content-integrity} - -Content integrity means being able to verify the data you received is -the exact same version of the data that you expected. This is important -in a distributed system as this mechanism will catch incorrect data sent -by bad peers. It also has implications for reproducibility as it lets -you refer to a specific version of a dataset. - -Link rot, when links online stop resolving, and content drift, when data -changes but the link to the data remains the same, are two common issues -in data analysis. For example, one day a file called data.zip might -change, but a typical HTTP link to the file does not include a hash of -the content, or provide a way to get updated metadata, so clients that -only have the HTTP link have no way to check if the file changed without -downloading the entire file again. Referring to a file by the hash of -its content is called content addressability, and lets users not only -verify that the data they receive is the version of the data they want, -but also lets people cite specific versions of the data by referring to -a specific hash. - -Dat uses BLAKE2b (Aumasson et al. 2013) cryptographically secure hashes -to address content. Hashes are arranged in a Merkle tree (Mykletun, -Narasimha, and Tsudik 2003), a tree where each non-leaf node is the hash -of all child nodes. Leaf nodes contain pieces of the dataset. Due to the -properties of secure cryptographic hashes the top hash can only be -produced if all data below it matches exactly. If two trees have -matching top hashes then you know that all other nodes in the tree must -match as well, and you can conclude that your dataset is synchronized. -Trees are chosen as the primary data structure in Dat as they have a -number of properties that allow for efficient access to subsets of the -metadata, which allows Dat to work efficiently over a network -connection. - -\subsubsection{Dat Links}\label{dat-links} - -Dat links are Ed25519 (Bernstein et al. 2012) public keys which have a -length of 32 bytes (64 characters when Hex encoded). You can represent -your Dat link in the following ways and Dat clients will be able to -understand them: - -\begin{itemize} -\tightlist -\item - The standalone public key: -\end{itemize} - -\texttt{8e1c7189b1b2dbb5c4ec2693787884771201da9...} - -\begin{itemize} -\tightlist -\item - Using the dat:// protocol: -\end{itemize} - -\texttt{dat://8e1c7189b1b2dbb5c4ec2693787884771...} - -\begin{itemize} -\tightlist -\item - As part of an HTTP URL: -\end{itemize} - -\texttt{https://datproject.org/8e1c7189b1b2dbb5...} - -All messages in the Dat protocol are encrypted and signed using the -public key during transport. This means that unless you know the public -key (e.g.~unless the Dat link was shared with you) then you will not be -able to discover or communicate with any member of the swarm for that -Dat. Anyone with the public key can verify that messages (such as -entries in a Dat Stream) were created by a holder of the private key. - -Every Dat repository has a corresponding private key which is kept in -your home folder and never shared. Dat never exposes either the public -or private key over the network. During the discovery phase the BLAKE2b -hash of the public key is used as the discovery key. This means that the -original key is impossible to discover (unless it was shared publicly -through a separate channel) since only the hash of the key is exposed -publicly. - -Dat does not provide an authentication mechanism at this time. Instead -it provides a capability system. Anyone with the Dat link is currently -considered able to discover and access data. Do not share your Dat links -publicly if you do not want them to be accessed. - -\subsubsection{Hypercore and Hyperdrive}\label{hypercore-and-hyperdrive} - -The Dat storage, content integrity, and networking protocols are -implemented in a module called -\href{https://npmjs.org/hypercore}{Hypercore}. Hypercore is agnostic to -the format of the input data, it operates on any stream of binary data. -For the Dat use case of synchronizing datasets we use a file system -module on top of Hypercore called -\href{https://npmjs.org/hyperdrive}{Hyperdrive}. - -Dat has a layered abstraction so that users can use Hypercore directly -to have full control over how they model their data. Hyperdrive works -well when your data can be represented as files on a filesystem, which -is the main use case with Dat. - -\subsubsection{Hypercore Registers}\label{hypercore-registers} - -Hypercore Registers are the core mechanism used in Dat. They are binary -append-only streams whose contents are cryptographically hashed and -signed and therefore can be verified by anyone with access to the public -key of the writer. They are an implementation of the concept known as a -register, a digital ledger you can trust. - -Dat uses two registers, \texttt{content} and \texttt{metadata}. The -\texttt{content} register contains the files in your repository and -\texttt{metadata} contains the metadata about the files including name, -size, last modified time, etc. Dat replicates them both when -synchronizing with another peer. - -When files are added to Dat, each file gets split up into some number of -chunks, and the chunks are then arranged into a Merkle tree, which is -used later for version control and replication processes. - -\subsection{2.2 Decentralized Mirroring}\label{decentralized-mirroring} - -Dat is a peer to peer protocol designed to exchange pieces of a dataset -amongst a swarm of peers. As soon as a peer acquires their first piece -of data in the dataset they can choose to become a partial mirror for -the dataset. If someone else contacts them and needs the piece they -have, they can choose to share it. This can happen simultaneously while -the peer is still downloading the pieces they want from others. - -\subsubsection{Source Discovery}\label{source-discovery} - -An important aspect of mirroring is source discovery, the techniques -that peers use to find each other. Source discovery means finding the IP -and port of data sources online that have a copy of that data you are -looking for. You can then connect to them and begin exchanging data. By -using source discovery techniques Dat is able to create a network where -data can be discovered even if the original data source disappears. - -Source discovery can happen over many kinds of networks, as long as you -can model the following actions: - -\begin{itemize} -\tightlist -\item - \texttt{join(key,\ {[}port{]})} - Begin performing regular lookups on - an interval for \texttt{key}. Specify \texttt{port} if you want to - announce that you share \texttt{key} as well. -\item - \texttt{leave(key,\ {[}port{]})} - Stop looking for \texttt{key}. - Specify \texttt{port} to stop announcing that you share \texttt{key} - as well. -\item - \texttt{foundpeer(key,\ ip,\ port)} - Called when a peer is found by a - lookup. -\end{itemize} - -In the Dat implementation we implement the above actions on top of three -types of discovery networks: - -\begin{itemize} -\tightlist -\item - DNS name servers - An Internet standard mechanism for resolving keys - to addresses -\item - Multicast DNS - Useful for discovering peers on local networks -\item - Kademlia Mainline Distributed Hash Table - Less central points of - failure, increases probability of Dat working even if DNS servers are - unreachable -\end{itemize} - -Additional discovery networks can be implemented as needed. We chose the -above three as a starting point to have a complementary mix of -strategies to increase the probability of source discovery. Additionally -you can specify a Dat via HTTPS link, which runs the Dat protocol in -``single-source'' mode, meaning the above discovery networks are not -used, and instead only that one HTTPS server is used as the only peer. - -\subsubsection{Peer Connections}\label{peer-connections} - -After the discovery phase, Dat should have a list of potential data -sources to try and contact. Dat uses either TCP, HTTP or -\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP} -(Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up -all available bandwidth on a network (e.g.~so that other people sharing -wifi can still use the Internet), and is still based on UDP so works -with NAT traversal techniques like UDP hole punching. HTTP is supported -for compatibility with static file servers and web browser clients. Note -that these are the protocols we support in the reference Dat -implementation, but the Dat protocol itself is transport agnostic. - -If an HTTP source is specified Dat will prefer that one over other -sources. Otherwise when Dat gets the IP and port for a potential TCP or -UTP source it tries to connect using both protocols. If one connects -first, Dat aborts the other one. If none connect, Dat will try again -until it decides that source is offline or unavailable and then stops -trying to connect to them. Sources Dat is able to connect to go into a -list of known good sources, so that if/when the Internet connection goes -down Dat can use that list to reconnect to known good sources again -quickly. - -If Dat gets a lot of potential sources it picks a handful at random to -try and connect to and keeps the rest around as additional sources to -use later in case it decides it needs more sources. - -Once a duplex binary connection to a remote source is open Dat then -layers on the Hypercore protocol, a message-based replication protocol -that allows two peers to communicate over a stateless channel to request -and exchange data. You open separate replication channels with many -peers at once which allows clients to parallelize data requests across -the entire pool of peers they have established connections with. - -\subsection{2.3 Network Privacy}\label{network-privacy} - -On the Web today, with SSL, there is a guarantee that the traffic -between your computer and the server is private. As long as you trust -the server to not leak your logs, attackers who intercept your network -traffic will not be able to read the HTTP traffic exchanged between you -and the server. This is a fairly straightforward model as clients only -have to trust a single server for some domain. - -There is an inherent tradeoff in peer to peer systems of source -discovery vs.~user privacy. The more sources you contact and ask for -some data, the more sources you trust to keep what you asked for -private. Our goal is to have Dat be configurable in respect to this -tradeoff to allow application developers to meet their own privacy -guidelines. - -It is up to client programs to make design decisions around which -discovery networks they trust. For example if a Dat client decides to -use the BitTorrent DHT to discover peers, and they are searching for a -publicly shared Dat key (e.g.~a key cited publicly in a published -scientific paper) with known contents, then because of the privacy -design of the BitTorrent DHT it becomes public knowledge what key that -client is searching for. - -A client could choose to only use discovery networks with certain -privacy guarantees. For example a client could only connect to an -approved list of sources that they trust, similar to SSL. As long as -they trust each source, the encryption built into the Dat network -protocol will prevent the Dat key they are looking for from being -leaked. - -\subsection{2.4 Incremental Versioning}\label{incremental-versioning} - -Given a stream of binary data, Dat splits the stream into chunks, hashes -each chunk, and arranges the hashes in a specific type of Merkle tree -that allows for certain replication properties. - -Dat is also able to fully or partially synchronize streams in a -distributed setting even if the stream is being appended to. This is -accomplished by using the messaging protocol to traverse the Merkle tree -of remote sources and fetch a strategic set of nodes. Due to the -low-level, message-oriented design of the replication protocol, -different node traversal strategies can be implemented. - -There are two types of versioning performed automatically by Dat. -Metadata is stored in a folder called \texttt{.dat} in the root folder -of a repository, and data is stored as normal files in the root folder. - -\subsubsection{Metadata Versioning}\label{metadata-versioning} - -Dat tries as much as possible to act as a one-to-one mirror of the state -of a folder and all its contents. When importing files, Dat uses a -sorted, depth-first recursion to list all the files in the tree. For -each file it finds, it grabs the filesystem metadata (filename, Stat -object, etc) and checks if there is already an entry for this filename -with this exact metadata already represented in the Dat repository -metadata. If the file with this metadata matches exactly the newest -version of the file metadata stored in Dat, then this file will be -skipped (no change). - -If the metadata differs from the current existing one (or there are no -entries for this filename at all in the history), then this new metadata -entry will be appended as the new `latest' version for this file in the -append-only SLEEP metadata content register (described below). - -\subsubsection{Content Versioning}\label{content-versioning} - -In addition to storing a historical record of filesystem metadata, the -content of the files themselves are also capable of being stored in a -version controlled manner. The default storage system used in Dat stores -the files as files. This has the advantage of being very straightforward -for users to understand, but the downside of not storing old versions of -content by default. - -In contrast to other version control systems like Git, Dat by default -only stores the current set of checked out files on disk in the -repository folder, not old versions. It does store all previous metadata -for old versions in \texttt{.dat}. Git for example stores all previous -content versions and all previous metadata versions in the \texttt{.git} -folder. Because Dat is designed for larger datasets, if it stored all -previous file versions in \texttt{.dat}, then the \texttt{.dat} folder -could easily fill up the users hard drive inadvertently. Therefore Dat -has multiple storage modes based on usage. - -Hypercore registers include an optional \texttt{data} file that stores -all chunks of data. In Dat, only the \texttt{metadata.data} file is -used, but the \texttt{content.data} file is not used. The default -behavior is to store the current files only as normal files. If you want -to run an `archival' node that keeps all previous versions, you can -configure Dat to use the \texttt{content.data} file instead. For -example, on a shared server with lots of storage you probably want to -store all versions. However on a workstation machine that is only -accessing a subset of one version, the default mode of storing all -metadata plus the current set of downloaded files is acceptable, because -you know the server has the full history. - -\subsubsection{Merkle Trees}\label{merkle-trees} - -Registers in Dat use a specific method of encoding a Merkle tree where -hashes are positioned by a scheme called binary in-order interval -numbering or just ``bin'' numbering. This is just a specific, -deterministic way of laying out the nodes in a tree. For example a tree -with 7 nodes will always be arranged like this: - -\begin{verbatim} -0 - 1 -2 - 3 -4 - 5 -6 -\end{verbatim} - -In Dat, the hashes of the chunks of files are always even numbers, at -the wide end of the tree. So the above tree had four original values -that become the even numbers: - -\begin{verbatim} -chunk0 -> 0 -chunk1 -> 2 -chunk2 -> 4 -chunk3 -> 6 -\end{verbatim} - -In the resulting Merkle tree, the even and odd nodes store different -information: - -\begin{itemize} -\tightlist -\item - Evens - List of data hashes {[}chunk0, chunk1, chunk2, \ldots{}{]} -\item - Odds - List of Merkle hashes (hashes of child even nodes) {[}hash0, - hash1, hash2, \ldots{}{]} -\end{itemize} - -These two lists get interleaved into a single register such that the -indexes (position) in the register are the same as the bin numbers from -the Merkle tree. - -All odd hashes are derived by hashing the two child nodes, e.g.~given -hash0 is \texttt{hash(chunk0)} and hash2 is \texttt{hash(chunk1)}, hash1 -is \texttt{hash(hash0\ +\ hash2)}. - -For example a register with two data entries would look something like -this (pseudocode): - -\begin{verbatim} -0. hash(value0) -1. hash(hash(chunk0) + hash(chunk1)) -2. hash(value1) -\end{verbatim} - -It is possible for the in-order Merkle tree to have multiple roots at -once. A root is defined as a parent node with a full set of child node -slots filled below it. - -For example, this tree hash 2 roots (1 and 4) - -\begin{verbatim} -0 - 1 -2 - -4 -\end{verbatim} - -This tree hash one root (3): - -\begin{verbatim} -0 - 1 -2 - 3 -4 - 5 -6 -\end{verbatim} - -This one has one root (1): - -\begin{verbatim} -0 - 1 -2 -\end{verbatim} - -\subsubsection{Replication Example}\label{replication-example} - -This section describes in high level the replication flow of a Dat. Note -that the low level details are available by reading the SLEEP section -below. For the sake of illustrating how this works in practice in a -networked replication scenario, consider a folder with two files: - -\begin{verbatim} -bat.jpg -cat.jpg -\end{verbatim} - -To send these files to another machine using Dat, you would first add -them to a Dat repository by splitting them into chunks and constructing -SLEEP files representing the chunks and filesystem metadata. - -Let's assume \texttt{bat.jpg} and \texttt{cat.jpg} both produce three -chunks, each around 64KB. Dat stores in a representation called SLEEP, -but here we will show a pseudo-representation for the purposes of -illustrating the replication process. The six chunks get sorted into a -list like this: - -\begin{verbatim} -bat-1 -bat-2 -bat-3 -cat-1 -cat-2 -cat-3 -\end{verbatim} - -These chunks then each get hashed, and the hashes get arranged into a -Merkle tree (the content register): - -\begin{verbatim} -0 - hash(bat-1) - 1 - hash(0 + 2) -2 - hash(bat-2) - 3 - hash(1 + 5) -4 - hash(bat-3) - 5 - hash(4 + 6) -6 - hash(cat-1) -8 - hash(cat-2) - 9 - hash(8 + 10) -10 - hash(cat-3) -\end{verbatim} - -Next we calculate the root hashes of our tree, in this case 3 and 9. We -then hash them together, and cryptographically sign the hash. This -signed hash now can be used to verify all nodes in the tree, and the -signature proves it was produced by us, the holder of the private key -for this Dat. - -This tree is for the hashes of the contents of the photos. There is also -a second Merkle tree that Dat generates that represents the list of -files and their metadata and looks something like this (the metadata -register): - -\begin{verbatim} -0 - hash({contentRegister: '9e29d624...'}) - 1 - hash(0 + 2) -2 - hash({"bat.jpg", first: 0, length: 3}) -4 - hash({"cat.jpg", first: 3, length: 3}) -\end{verbatim} - -The first entry in this feed is a special metadata entry that tells Dat -the address of the second feed (the content register). Note that node 3 -is not included yet, because 3 is the hash of \texttt{1\ +\ 5}, but 5 -does not exist yet, so will be written at a later update. - -Now we're ready to send our metadata to the other peer. The first -message is a \texttt{Register} message with the key that was shared for -this Dat. Let's call ourselves Alice and the other peer Bob. Alice sends -Bob a \texttt{Want} message that declares they want all nodes in the -file list (the metadata register). Bob replies with a single -\texttt{Have} message that indicates he has 2 nodes of data. Alice sends -three \texttt{Request} messages, one for each leaf node -(\texttt{0,\ 2,\ 4}). Bob sends back three \texttt{Data} messages. The -first \texttt{Data} message contains the content register key, the hash -of the sibling, in this case node \texttt{2}, the hash of the uncle root -\texttt{4}, as well as a signature for the root hashes (in this case -\texttt{1,\ 4}). Alice verifies the integrity of this first -\texttt{Data} message by hashing the metadata received for the content -register metadata to produce the hash for node \texttt{0}. They then -hash the hash \texttt{0} with the hash \texttt{2} that was included to -reproduce hash \texttt{1}, and hashes their \texttt{1} with the value -for \texttt{4} they received, which they can use the received signature -to verify it was the same data. When the next \texttt{Data} message is -received, a similar process is performed to verify the content. - -Now Alice has the full list of files in the Dat, but decides they only -want to download \texttt{cat.png}. Alice knows they want blocks 3 -through 6 from the content register. First Alice sends another -\texttt{Register} message with the content key to open a new replication -channel over the connection. Then Alice sends three \texttt{Request} -messages, one for each of blocks \texttt{4,\ 5,\ 6}. Bob sends back -three \texttt{Data} messages with the data for each block, as well as -the hashes needed to verify the content in a way similar to the process -described above for the metadata feed. - -\subsection{2.5 Random Access}\label{random-access} - -Dat pursues the following access capabilities: - -\begin{itemize} -\tightlist -\item - Support large file hierachies (millions of files in a single - repository). -\item - Support efficient traversal of the hierarchy (listing files in - arbitrary folders efficiently). -\item - Store all changes to all files (metadata and/or content). -\item - List all changes made to any single file. -\item - View the state of all files relative to any point in time. -\item - Subscribe live to all changes (any file). -\item - Subscribe live to changes to files under a specific path. -\item - Efficiently access any byte range of any version of any file. -\item - Allow all of the above to happen remotely, only syncing the minimum - metadata necessary to perform any action. -\item - Allow efficient comparison of remote and local repository state to - request missing pieces during synchronization. -\item - Allow entire remote archive to be synchronized, or just some subset of - files and/or versions. -\end{itemize} - -The way Dat accomplishes these is through a combination of storing all -changes in Hypercore feeds, but also using strategic metadata indexing -strategies that support certain queries efficiently to be performed by -traversing the Hypercore feeds. The protocol itself is specified in -Section 3 (SLEEP), but a scenario based summary follows here. - -\subsubsection{Scenario: Reading a file from a specific byte -offset}\label{scenario-reading-a-file-from-a-specific-byte-offset} - -Alice has a dataset in Dat, Bob wants to access a 100MB CSV called -\texttt{cat\_dna.csv} stored in the remote repository, but only wants to -access the 10MB range of the CSV spanning from 30MB - 40MB. - -Bob has never communicated with Alice before, and is starting fresh with -no knowledge of this Dat repository other than that he knows he wants -\texttt{cat\_dna.csv} at a specific offset. - -First, Bob asks Alice through the Dat protocol for the metadata he needs -to resolve \texttt{cat\_dna.csv} to the correct metadata feed entry that -represents the file he wants. Note: In this scenario we assume Bob wants -the latest version of \texttt{cat\_dna.csv}. It is also possible to do -this for a specific older version. - -Bob first sends a \texttt{Request} message for the latest entry in the -metadata feed. Alice responds. Bob looks at the \texttt{trie} value, and -using the lookup algorithm described below sends another -\texttt{Request} message for the metadata node that is closer to the -filename he is looking for. This repeats until Alice sends Bob the -matching metadata entry. This is the un-optimized resolution that uses -\texttt{log(n)} round trips, though there are ways to optimize this by -having Alice send additional sequence numbers to Bob that help him -traverse in less round trips. - -In the metadata record Bob recieved for \texttt{cat\_dna.csv} there is -the byte offset to the beginning of the file in the data feed. Bob adds -his +30MB offset to this value and starts requesting pieces of data -starting at that byte offset using the SLEEP protocol as described -below. - -This method tries to allow any byte range of any file to be accessed -without the need to synchronize the full metadata for all files up -front. - -\subsubsection{Scenario: Syncing live changes to files at a specific -path}\label{scenario-syncing-live-changes-to-files-at-a-specific-path} - -TODO - -\subsubsection{Scenario: Syncing an entire -archive}\label{scenario-syncing-an-entire-archive} - -TODO - -\subsection{3. Dat Network Protocol}\label{dat-network-protocol} - -The SLEEP format is designed to allow for sparse replication, meaning -you can efficiently download only the metadata and data required to -resolve a single byte region of a single file, which makes Dat suitable -for a wide variety of streaming, real time and large dataset use cases. - -To take advantage of this, Dat includes a network protocol. It is -message-based and stateless, making it possible to implement on a -variety of network transport protocols including UDP and TCP. Both -metadata and content registers in SLEEP share the exact same replication -protocol. - -Individual messages are encoded using Protocol Buffers and there are ten -message types using the following schema: - -\subsubsection{Wire Protocol}\label{wire-protocol} - -Over the wire messages are packed in the following lightweight container -format - -\begin{verbatim} -<varint - length of rest of message> - <varint - header> - <message> -\end{verbatim} - -The \texttt{header} value is a single varint that has two pieces of -information: the integer \texttt{type} that declares a 4-bit message -type (used below), and a channel identifier, \texttt{0} for metadata and -\texttt{1} for content. - -To generate this varint, you bitshift the 4-bit type integer onto the -end of the channel identifier, e.g. -\texttt{channel\ \textless{}\textless{}\ 4\ \textbar{}\ \textless{}4-bit-type\textgreater{}}. - -\subsubsection{Feed}\label{feed} - -Type 0. Should be the first message sent on a channel. - -\begin{itemize} -\tightlist -\item - \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore' - using the public key of the metadata register as the key. -\item - \texttt{nonce} - 32 bytes of random binary data, used in our - encryption scheme -\end{itemize} - -\begin{verbatim} -message Feed { - required bytes discoveryKey = 1; - optional bytes nonce = 2; -} -\end{verbatim} - -\subsubsection{Handshake}\label{handshake} - -Type 1. Overall connection handshake. Should be sent just after the feed -message on the first channel only (metadata). - -\begin{itemize} -\tightlist -\item - \texttt{id} - 32 byte random data used as a identifier for this peer - on the network, useful for checking if you are connected to yourself - or another peer more than once -\item - \texttt{live} - Whether or not you want to operate in live - (continuous) replication mode or end after the initial sync -\item - \texttt{userData} - User-specific metadata encoded as a byte sequence -\item - \texttt{extensions} - List of extensions that are supported on this - Feed -\end{itemize} - -\begin{verbatim} -message Handshake { - optional bytes id = 1; - optional bool live = 2; - optional bytes userData = 3; - repeated string extensions = 4; -} -\end{verbatim} - -\subsubsection{Info}\label{info} - -Type 2. Message indicating state changes. Used to indicate whether you -are uploading and/or downloading. - -Initial state for uploading/downloading is true. If both ends are not -downloading and not live it is safe to consider the stream ended. - -\begin{verbatim} -message Info { - optional bool uploading = 1; - optional bool downloading = 2; -} -\end{verbatim} - -\subsubsection{Have}\label{have} - -Type 3. How you tell the other peer what chunks of data you have or -don't have. You should only send Have messages to peers who have -expressed interest in this region with Want messages. - -\begin{itemize} -\tightlist -\item - \texttt{start} - If you only specify \texttt{start}, it means you are - telling the other side you only have 1 chunk at the position at the - value in \texttt{start}. -\item - \texttt{length} - If you specify length, you can describe a range of - values that you have all of, starting from \texttt{start}. -\item - \texttt{bitfield} - If you would like to send a range of sparse data - about haves/don't haves via bitfield, relative to \texttt{start}. -\end{itemize} - -\begin{verbatim} -message Have { - required uint64 start = 1; - optional uint64 length = 2 [default = 1]; - optional bytes bitfield = 3; -} -\end{verbatim} - -When sending bitfields you must run length encode them. The encoded -bitfield is a series of compressed and uncompressed bit sequences. All -sequences start with a header that is a varint. - -If the last bit is set in the varint (it is an odd number) then a header -represents a compressed bit sequence. - -\begin{verbatim} -compressed-sequence = varint( - byte-length-of-sequence - << 2 | bit << 1 | 1 -) -\end{verbatim} - -If the last bit is \emph{not} set then a header represents a -non-compressed sequence. - -\begin{verbatim} -uncompressed-sequence = varint( - byte-length-of-bitfield << 1 | 0 -) + (bitfield) -\end{verbatim} - -\subsubsection{Unhave}\label{unhave} - -Type 4. How you communicate that you deleted or removed a chunk you used -to have. - -\begin{verbatim} -message Unhave { - required uint64 start = 1; - optional uint64 length = 2 [default = 1]; -} -\end{verbatim} - -\subsubsection{Want}\label{want} - -Type 5. How you ask the other peer to subscribe you to Have messages for -a region of chunks. The \texttt{length} value defaults to Infinity or -feed.length (if not live). - -\begin{verbatim} -message Want { - required uint64 start = 1; - optional uint64 length = 2; -} -\end{verbatim} - -\subsubsection{Unwant}\label{unwant} - -Type 6. How you ask to unsubscribe from Have messages for a region of -chunks from the other peer. You should only Unwant previously Wanted -regions, but if you do Unwant something that hasn't been Wanted it won't -have any effect. The \texttt{length} value defaults to Infinity or -feed.length (if not live). - -\begin{verbatim} -message Unwant { - required uint64 start = 1; - optional uint64 length = 2; -} -\end{verbatim} - -\subsubsection{Request}\label{request} - -Type 7. Request a single chunk of data. - -\begin{itemize} -\tightlist -\item - \texttt{index} - The chunk index for the chunk you want. You should - only ask for indexes that you have received the Have messages for. -\item - \texttt{bytes} - You can also optimistically specify a byte offset, - and in the case the remote is able to resolve the chunk for this byte - offset depending on their Merkle tree state, they will ignore the - \texttt{index} and send the chunk that resolves for this byte offset - instead. But if they cannot resolve the byte request, \texttt{index} - will be used. -\item - \texttt{hash} - If you only want the hash of the chunk and not the - chunk data itself. -\item - \texttt{nodes} - A 64 bit long bitfield representing which parent - nodes you have. -\end{itemize} - -The \texttt{nodes} bitfield is an optional optimization to reduce the -amount of duplicate nodes exchanged during the replication lifecycle. It -indicates which parents you have or don't have. You have a maximum of 64 -parents you can specify. Because \texttt{uint64} in Protocol Buffers is -implemented as a varint, over the wire this does not take up 64 bits in -most cases. The first bit is reserved to signify whether or not you need -a signature in response. The rest of the bits represent whether or not -you have (\texttt{1}) or don't have (\texttt{0}) the information at this -node already. The ordering is determined by walking parent, sibling up -the tree all the way to the root. - -\begin{verbatim} -message Request { - required uint64 index = 1; - optional uint64 bytes = 2; - optional bool hash = 3; - optional uint64 nodes = 4; -} -\end{verbatim} - -\subsubsection{Cancel}\label{cancel} - -Type 8. Cancel a previous Request message that you haven't received yet. - -\begin{verbatim} -message Cancel { - required uint64 index = 1; - optional uint64 bytes = 2; - optional bool hash = 3; -} -\end{verbatim} - -\subsubsection{Data}\label{data} - -Type 9. Sends a single chunk of data to the other peer. You can send it -in response to a Request or unsolicited on its own as a friendly gift. -The data includes all of the Merkle tree parent nodes needed to verify -the hash chain all the way up to the Merkle roots for this chunk. -Because you can produce the direct parents by hashing the chunk, only -the roots and `uncle' hashes are included (the siblings to all of the -parent nodes). - -\begin{itemize} -\tightlist -\item - \texttt{index} - The chunk position for this chunk. -\item - \texttt{value} - The chunk binary data. Empty if you are sending only - the hash. -\item - \texttt{Node.index} - The index for this chunk in in-order notation -\item - \texttt{Node.hash} - The hash of this chunk -\item - \texttt{Node.size}- The aggregate chunk size for all children below - this node (The sum of all chunk sizes of all children) -\item - \texttt{signature} - If you are sending a root node, all root nodes - must have the signature included. -\end{itemize} - -\begin{verbatim} -message Data { - required uint64 index = 1; - optional bytes value = 2; - repeated Node nodes = 3; - optional bytes signature = 4; - - message Node { - required uint64 index = 1; - required bytes hash = 2; - required uint64 size = 3; - } -} -\end{verbatim} - -\section{4. Multi-Writer}\label{multi-writer} - -The design of Dat up to this point assumes you have a single keyholder -writing and signing data and appending it to the metadata and content -feed. However having the ability for multiple keyholders to be able to -write to a single repository allows for many interesting use cases such -as forking and collaborative workflows. - -In order to do this, we use one \texttt{metadata.data} feed for each -writer. Each writer kets their own keypair. Each writer is responsible -for storing their private key. To add a new writer to your feed, you -include their key in a metadata feed entry. - -For example, if Alice wants to add Bob to have write access to a Dat -repository, Alice would take Bob's public key and writes it to the -`local' metadata feed (the feed that Alice owns, e.g.~the original -feed). Now anyone else who replicates from Alice will find Bob's key in -the history. If in the future Bob distributes a version of the Dat that -he added new data to, everyone who has a copy of the Dat from Alice will -have a copy of Bob's key that they can use to verify that Bob's writes -are valid. - -On disk, each users feed is stored in a separate hyperdrive. The -original hyperdrive (owned by Alice) is called the `local' hyperdrive. -Bob's hyperdrive would be stored separately in the SLEEP folder -addressed by Bob's public key. - -In case Bob and Alice write different values for the same file (e.g.~Bob -creates a ``fork''), when they sync up with each other replication will -still work, but for the forked value the Dat client will return an array -of values for that key instead of just one value. The values are linked -to the writer that wrote them, so in the case of receiving multiple -values, clients can choose to choose the value from Alice, or Bob, or -the latest value, or whatever other strategy they prefer. - -If a writer updates the value of a forked key with new value they are -performing a merge. - -\section{5. Existing Work}\label{existing-work} - -Dat is inspired by a number of features from existing systems. - -\subsection{Git}\label{git} - -Git popularized the idea of a directed acyclic graph (DAG) combined with -a Merkle tree, a way to represent changes to data where each change is -addressed by the secure hash of the change plus all ancestor hashes in a -graph. This provides a way to trust data integrity, as the only way a -specific hash could be derived by another peer is if they have the same -data and change history required to reproduce that hash. This is -important for reproducibility as it lets you trust that a specific git -commit hash refers to a specific source code state. - -Decentralized version control tools for source code like Git provide a -protocol for efficiently downloading changes to a set of files, but are -optimized for text files and have issues with large files. Solutions -like Git-LFS solve this by using HTTP to download large files, rather -than the Git protocol. GitHub offers Git-LFS hosting but charges -repository owners for bandwidth on popular files. Building a distributed -distribution layer for files in a Git repository is difficult due to -design of Git Packfiles which are delta compressed repository states -that do not easily support random access to byte ranges in previous file -versions. - -\subsection{BitTorrent}\label{bittorrent} - -BitTorrent implements a swarm based file sharing protocol for static -datasets. Data is split into fixed sized chunks, hashed, and then that -hash is used to discover peers that have the same data. An advantage of -using BitTorrent for dataset transfers is that download bandwidth can be -fully saturated. Since the file is split into pieces, and peers can -efficiently discover which pieces each of the peers they are connected -to have, it means one peer can download non-overlapping regions of the -dataset from many peers at the same time in parallel, maximizing network -throughput. - -Fixed sized chunking has drawbacks for data that changes. BitTorrent -assumes all metadata will be transferred up front which makes it -impractical for streaming or updating content. Most BitTorrent clients -divide data into 1024 pieces meaning large datasets could have a very -large chunk size which impacts random access performance (e.g.~for -streaming video). - -Another drawback of BitTorrent is due to the way clients advertise and -discover other peers in absence of any protocol level privacy or trust. -From a user privacy standpoint, BitTorrent leaks what users are -accessing or attempting to access, and does not provide the same -browsing privacy functions as systems like SSL. - -\subsection{Kademlia Distributed Hash -Table}\label{kademlia-distributed-hash-table} - -Kademlia (Maymounkov and Mazieres 2002) is a distributed hash table, a -distributed key/value store that can serve a similar purpose to DNS -servers but has no hard coded server addresses. All clients in Kademlia -are also servers. As long as you know at least one address of another -peer in the network, you can ask them for the key you are trying to find -and they will either have it or give you some other people to talk to -that are more likely to have it. - -If you don't have an initial peer to talk to you, most clients use a -bootstrap server that randomly gives you a peer in the network to start -with. If the bootstrap server goes down, the network still functions as -long as other methods can be used to bootstrap new peers (such as -sending them peer addresses through side channels like how .torrent -files include tracker addresses to try in case Kademlia finds no peers). - -Kademlia is distinct from previous DHT designs due to its simplicity. It -uses a very simple XOR operation between two keys as its ``distance'' -metric to decide which peers are closer to the data being searched for. -On paper it seems like it wouldn't work as it doesn't take into account -things like ping speed or bandwidth. Instead its design is very simple -on purpose to minimize the amount of control/gossip messages and to -minimize the amount of complexity required to implement it. In practice -Kademlia has been extremely successful and is widely deployed as the -``Mainline DHT'' for BitTorrent, with support in all popular BitTorrent -clients today. - -Due to the simplicity in the original Kademlia design a number of -attacks such as DDOS and/or sybil have been demonstrated. There are -protocol extensions (BEPs) which in certain cases mitigate the effects -of these attacks, such as BEP 44 which includes a DDOS mitigation -technique. Nonetheless anyone using Kademlia should be aware of the -limitations. - -\subsection{Peer to Peer Streaming Peer Protocol -(PPSPP)}\label{peer-to-peer-streaming-peer-protocol-ppspp} - -PPSPP -(\href{https://datatracker.ietf.org/doc/rfc7574/?include_text=1}{IETF -RFC 7574}, (Bakker, Petrocco, and Grishchenko 2015)) is a protocol for -live streaming content over a peer to peer network. In it they define a -specific type of Merkle Tree that allows for subsets of the hashes to be -requested by a peer in order to reduce the time-till-playback for end -users. BitTorrent for example transfers all hashes up front, which is -not suitable for live streaming. - -Their Merkle trees are ordered using a scheme they call ``bin -numbering'', which is a method for deterministically arranging an -append-only log of leaf nodes into an in-order layout tree where -non-leaf nodes are derived hashes. If you want to verify a specific -node, you only need to request its sibling's hash and all its uncle -hashes. PPSPP is very concerned with reducing round trip time and -time-till-playback by allowing for many kinds of optimizations, such as -to pack as many hashes into datagrams as possible when exchanging tree -information with peers. - -Although PPSPP was designed with streaming video in mind, the ability to -request a subset of metadata from a large and/or streaming dataset is -very desirable for many other types of datasets. - -\subsection{WebTorrent}\label{webtorrent} - -With WebRTC, browsers can now make peer to peer connections directly to -other browsers. BitTorrent uses UDP sockets which aren't available to -browser JavaScript, so can't be used as-is on the Web. - -WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC -as the transport. This includes the BitTorrent block exchange protocol -as well as the tracker protocol implemented in a way that can enable -hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent -swarms (if a client is capable of making both UDP sockets as well as -WebRTC sockets, such as Node.js). Trackers are exposed to web clients -over HTTP or WebSockets. - -\subsection{InterPlanetary File -System}\label{interplanetary-file-system} - -IPFS is a family of application and network protocols that have peer to -peer file sharing and data permanence baked in. IPFS abstracts network -protocols and naming systems to provide an alternative application -delivery platform to today's Web. For example, instead of using HTTP and -DNS directly, in IPFS you would use LibP2P streams and IPNS in order to -gain access to the features of the IPFS platform. - -\subsection{Certificate Transparency/Secure -Registers}\label{certificate-transparencysecure-registers} - -The UK Government Digital Service have developed the concept of a -register which they define as a digital public ledger you can trust. In -the UK government registers are beginning to be piloted as a way to -expose essential open data sets in a way where consumers can verify the -data has not been tampered with, and allows the data publishers to -update their data sets over time. - -The design of registers was inspired by the infrastructure backing the -Certificate Transparency (Laurie, Langley, and Kasper 2013) project, -initiated at Google, which provides a service on top of SSL certificates -that enables service providers to write certificates to a distributed -public ledger. Any client or service provider can verify if a -certificate they received is in the ledger, which protects against so -called ``rogue certificates''. - -\section{6. Reference Implementation}\label{reference-implementation} - -The connection logic is implemented in a module called -\href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. -This builds on discovery-channel and adds connection establishment, -management and statistics. It provides statistics such as how many -sources are currently connected, how many good and bad behaving sources -have been talked to, and it automatically handles connecting and -reconnecting to sources. UTP support is implemented in the module -\href{https://www.npmjs.com/package/utp-native}{utp-native}. - -Our implementation of source discovery is called -\href{https://npmjs.org/discovery-channel}{discovery-channel}. We also -run a \href{https://www.npmjs.com/package/dns-discovery}{custom DNS -server} that Dat clients use (in addition to specifying their own if -they need to), as well as a -\href{https://github.com/bittorrent/bootstrap-dht}{DHT bootstrap} -server. These discovery servers are the only centralized infrastructure -we need for Dat to work over the Internet, but they are redundant, -interchangeable, never see the actual data being shared, anyone can run -their own and Dat will still work even if they all are unavailable. If -this happens discovery will just be manual (e.g.~manually sharing -IP/ports). - -\section{Acknowledgements}\label{acknowledgements} - -This work was made possible through grants from the John S. and James L. -Knight and Alfred P. Sloan Foundations. - -\section*{References}\label{references} -\addcontentsline{toc}{section}{References} - -\hypertarget{refs}{} -\hypertarget{ref-aumasson2013blake2}{} -Aumasson, Jean-Philippe, Samuel Neves, Zooko Wilcox-O'Hearn, and -Christian Winnerlein. 2013. ``BLAKE2: Simpler, Smaller, Fast as Md5.'' -In \emph{International Conference on Applied Cryptography and Network -Security}, 119--35. Springer. - -\hypertarget{ref-bakker2015peer}{} -Bakker, A, R Petrocco, and V Grishchenko. 2015. ``Peer-to-Peer Streaming -Peer Protocol (Ppspp).'' - -\hypertarget{ref-bernstein2012high}{} -Bernstein, Daniel J, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin -Yang. 2012. ``High-Speed High-Security Signatures.'' \emph{Journal of -Cryptographic Engineering}. Springer, 1--13. - -\hypertarget{ref-laurie2013certificate}{} -Laurie, Ben, Adam Langley, and Emilia Kasper. 2013. ``Certificate -Transparency.'' - -\hypertarget{ref-maymounkov2002kademlia}{} -Maymounkov, Petar, and David Mazieres. 2002. ``Kademlia: A Peer-to-Peer -Information System Based on the Xor Metric.'' In \emph{International -Workshop on Peer-to-Peer Systems}, 53--65. Springer. - -\hypertarget{ref-mykletun2003providing}{} -Mykletun, Einar, Maithili Narasimha, and Gene Tsudik. 2003. ``Providing -Authentication and Integrity in Outsourced Databases Using Merkle Hash -Trees.'' \emph{UCI-SCONCE Technical Report}. - -\hypertarget{ref-sleep}{} -Ogden, Maxwell, and Mathias Buus. 2017. ``SLEEP - the Dat Protocol on -Disk Format.'' In. - -\hypertarget{ref-rossi2010ledbat}{} -Rossi, Dario, Claudio Testa, Silvio Valenti, and Luca Muscariello. 2010. -``LEDBAT: The New Bittorrent Congestion Control Protocol.'' In -\emph{ICCCN}, 1--6. - -\end{document} |