aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--papers/dat-paper.txt1765
1 files changed, 1765 insertions, 0 deletions
diff --git a/papers/dat-paper.txt b/papers/dat-paper.txt
new file mode 100644
index 0000000..2b30120
--- /dev/null
+++ b/papers/dat-paper.txt
@@ -0,0 +1,1765 @@
+\documentclass[a4paperpaper,twocolumn]{article}
+\usepackage{lmodern}
+\usepackage{amssymb,amsmath}
+\usepackage{ifxetex,ifluatex}
+\usepackage{fixltx2e} % provides \textsubscript
+\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
+ \usepackage[T1]{fontenc}
+ \usepackage[utf8]{inputenc}
+\else % if luatex or xelatex
+ \ifxetex
+ \usepackage{mathspec}
+ \else
+ \usepackage{fontspec}
+ \fi
+ \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}
+\fi
+% use upquote if available, for straight quotes in verbatim environments
+\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
+% use microtype if available
+\IfFileExists{microtype.sty}{%
+\usepackage{microtype}
+\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
+}{}
+\usepackage{hyperref}
+\hypersetup{unicode=true,
+ pdfborder={0 0 0},
+ breaklinks=true}
+\urlstyle{same} % don't use monospace font for urls
+\IfFileExists{parskip.sty}{%
+\usepackage{parskip}
+}{% else
+\setlength{\parindent}{0pt}
+\setlength{\parskip}{6pt plus 2pt minus 1pt}
+}
+\setlength{\emergencystretch}{3em} % prevent overfull lines
+\providecommand{\tightlist}{%
+ \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
+\setcounter{secnumdepth}{0}
+% Redefines (sub)paragraphs to behave more like sections
+\ifx\paragraph\undefined\else
+\let\oldparagraph\paragraph
+\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
+\fi
+\ifx\subparagraph\undefined\else
+\let\oldsubparagraph\subparagraph
+\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
+\fi
+
+\title{Dat - Distributed Dataset Synchronization And Versioning}
+\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen}
+\date{April 2017, Code for Science}
+
+\begin{document}
+\maketitle
+
+\section{Abstract}\label{abstract}
+
+Dat is a protocol designed for syncing folders of data, even if they are
+large or changing constantly. A cryptographically secure register of
+changes to the data over time is used to ensure dataset versions are
+distributed as exact copies of exact versions. Any byte range of any
+version of any file can be efficiently streamed from a Dat repository
+over a network connection. Consumers can choose to fully or partially
+replicate the contents of a remote Dat repository, and can also
+subscribe to live changes. Dat uses built-in public key cryptography to
+encrypt network traffic, allowing it to make certain privacy and
+security guarantees. A group of Dat clients can connect to each other to
+form a public or private decentralized network to exchange data amognst
+each other in a swarm. A reference implementation is provided in
+JavaScript.
+
+\section{1. Background}\label{background}
+
+Many datasets are shared online today using HTTP and FTP, which lack
+built in support for version control or content addressing of data. This
+results in link rot and content drift as files are moved, updated or
+deleted, leading to an alarming rate of disappearing data references in
+areas such as
+\href{http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253}{published
+scientific literature}.
+
+Cloud storage services like S3 ensure availability of data, but they
+have a centralized hub-and-spoke networking model and are therefore
+limited by their bandwidth, meaning popular files can be come very
+expensive to share. Services like Dropbox and Google Drive provide
+version control and synchronization on top of cloud storage services
+which fixes many issues with broken links but rely on proprietary code
+and services requiring users to store their data on centralized cloud
+infrastructure which has implications on cost, transfer speeds, vendor
+lock-in and user privacy.
+
+Distributed file sharing tools can become faster as files become more
+popular, removing the bandwidth bottleneck and making file distribution
+cheaper. They also use link resolution and discovery systems which can
+prevent broken links meaning if the original source goes offline other
+backup sources can be automatically discovered. However these file
+sharing tools today are not supported by Web browsers, do not have good
+privacy guarantees, and do not provide a mechanism for updating files
+without redistributing a new dataset which could mean entire
+redownloading data you already have.
+
+\section{2. Dat}\label{dat}
+
+Dat is a dataset synchronization protocol that does not assume a dataset
+is static or that the entire dataset will be downloaded. The main
+reference implementation is available from from npm as
+\texttt{npm\ install\ dat\ -g}.
+
+The protocol is agnostic to the underlying transport e.g.~you could
+implement Dat over carrier pigeon. The key properties of the Dat design
+are explained in this section.
+
+\begin{itemize}
+\tightlist
+\item
+ 2.1 \textbf{Content Integrity} - Data and publisher integrity is
+ verified through use of signed hashes of the content.
+\item
+ 2.2 \textbf{Decentralized Mirroring} - Users sharing the same Dat
+ automatically discover each other and exhange data in a swarm.
+\item
+ 2.3 \textbf{Network Privacy} - Dat provides certain privacy guarantees
+ including end-to-end encryption.
+\item
+ 2.4 \textbf{Incremental Versioning} - Datasets can be efficiently
+ synced, even in real time, to other peers.
+\end{itemize}
+
+\subsection{2.1 Content Integrity}\label{content-integrity}
+
+Content integrity means being able to verify the data you received is
+the exact same version of the data that you expected. This is imporant
+in a distributed system as this mechanism will catch incorrect data sent
+by bad peers. It also has implications for reproducibility as it lets
+you refer to a specific version of a dataset.
+
+Link rot, when links online stop resolving, and content drift, when data
+changes but the link to the data remains the same, are two common issues
+in data analysis. For example, one day a file called data.zip might
+change, but a typical HTTP link to the file does not include a hash of
+the content, or provide a way to get updated metadata, so clients that
+only have the HTTP link have no way to check if the file changed without
+downloading the entire file again. Referring to a file by the hash of
+its content is called content addressability, and lets users not only
+verify that the data they receive is the version of the data they want,
+but also lets people cite specific versions of the data by referring to
+a specific hash.
+
+Dat uses BLAKE2b cryptographically secure hashes to address content.
+Hashes are arranged in a Merkle tree, a tree where each non-leaf node is
+the hash of all child nodes. Leaf nodes contain pieces of the dataset.
+Due to the properties of secure cryptographic hashes the top hash can
+only be produced if all data below it matches exactly. If two trees have
+matching top hashes then you know that all other nodes in the tree must
+match as well, and you can conclude that your dataset is synchronized.
+Trees are chosen as the primary data structure in Dat as they have a
+number of properties that allow for efficient access to subsets of the
+metadata, which allows Dat to work efficiently over a network
+connection.
+
+\subsubsection{Dat Links}\label{dat-links}
+
+Dat links are Ed25519 public keys which have a length of 32 bytes (64
+characters when Hex encoded). You can represent your Dat link in the
+following ways and Dat clients will be able to understand them:
+
+\begin{itemize}
+\tightlist
+\item
+ The standalone public key:
+\end{itemize}
+
+\texttt{8e1c7189b1b2dbb5c4ec2693787884771201da9...}
+
+\begin{itemize}
+\tightlist
+\item
+ Using the dat:// protocol:
+\end{itemize}
+
+\texttt{dat://8e1c7189b1b2dbb5c4ec2693787884771...}
+
+\begin{itemize}
+\tightlist
+\item
+ As part of an HTTP URL:
+\end{itemize}
+
+\texttt{https://datproject.org/8e1c7189b1b2dbb5...}
+
+All messages in the Dat protocol are encrypted and signed using the
+public key during transport. This means that unless you know the public
+key (e.g.~unless the Dat link was shared with you) then you will not be
+able to discover or communicate with any member of the swarm for that
+Dat. Anyone with the public key can verify that messages (such as
+entries in a Dat Stream) were created by a holder of the private key.
+
+Every Dat repository has corresponding a private key that kept in your
+home folder and never shared. Dat never exposes either the public or
+private key over the network. During the discovery phase the BLAKE2b
+hash of the public key is used as the discovery key. This means that the
+original key is impossible to discover (unless it was shared publicly
+through a separate channel) since only the hash of the key is exposed
+publicly.
+
+Dat does not provide an authentication mechanism at this time. Instead
+it provides a capability system. Anyone with the Dat link is currently
+considered able to discover and access data. Do not share your Dat links
+publicly if you do not want them to be accessed.
+
+\subsubsection{Hypercore and Hyperdrive}\label{hypercore-and-hyperdrive}
+
+The Dat storage, content integrity, and networking protocols are
+implemented in a module called
+\href{https://npmjs.org/hypercore}{Hypercore}. Hypercore is agnostic to
+the format of the input data, it operates on any stream of binary data.
+For the Dat use case of synchronizing datasets we use a file system
+module on top of Hypercore called
+\href{https://npmjs.org/hyperdrive}{Hyperdrive}.
+
+Dat has a layered abstraction so that users can use Hypercore directly
+to have full control over how they model their data. Hyperdrive works
+well when your data can be represented as files on a filesystem, which
+is the main use case with Dat.
+
+\subsubsection{Hypercore Registers}\label{hypercore-registers}
+
+Hypercore Registers are the core mechanism used in Dat. They are binary
+append-only streams whose contents are cryptographically hashed and
+signed and therefore can be verified by anyone with access to the public
+key of the writer. They are an implemenation of the concept known as a
+register, a digital ledger you can trust
+
+Dat uses two registers, \texttt{content} and \texttt{metadata}. The
+\texttt{content} register contains the files in your repository and
+\texttt{metadata} contains the metadata about the files including name,
+size, last modified time, etc. Dat replicates them both when
+synchronizing with another peer.
+
+When files are added to Dat, each file gets split up into some number of
+chunks, and the chunks are then arranged into a Merkle tree, which is
+used later for version control and replication processed.
+
+\subsection{2.2 Decentralized Mirroring}\label{decentralized-mirroring}
+
+Dat is a peer to peer protocol designed to exchange pieces of a dataset
+amongst a swarm of peers. As soon as a peer acquires their first piece
+of data in the dataset they can choose to become a partial mirror for
+the dataset. If someone else contacts them and needs the piece they
+have, they can choose to share it. This can happen simultaneously while
+the peer is still downloading the pieces they want from others.
+
+\subsubsection{Source Discovery}\label{source-discovery}
+
+An important aspect of mirroring is source discovery, the techniques
+that peers use to find each other. Source discovery means finding the IP
+and port of data sources online that have a copy of that data you are
+looking for. You can then connect to them and begin exchanging data. By
+using source discovery techniques Dat is able to create a network where
+data can be discovered even if the original data source disappears.
+
+Source discovery can happen over many kinds of networks, as long as you
+can model the following actions:
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{join(key,\ {[}port{]})} - Begin performing regular lookups on
+ an interval for \texttt{key}. Specify \texttt{port} if you want to
+ announce that you share \texttt{key} as well.
+\item
+ \texttt{leave(key,\ {[}port{]})} - Stop looking for \texttt{key}.
+ Specify \texttt{port} to stop announcing that you share \texttt{key}
+ as well.
+\item
+ \texttt{foundpeer(key,\ ip,\ port)} - Called when a peer is found by a
+ lookup
+\end{itemize}
+
+In the Dat implementation we implement the above actions on top of three
+types of discovery networks:
+
+\begin{itemize}
+\tightlist
+\item
+ DNS name servers - An Internet standard mechanism for resolving keys
+ to addresses
+\item
+ Multicast DNS - Useful for discovering peers on local networks
+\item
+ Kademlia Mainline Distributed Hash Table - Less central points of
+ failure, increases probability of Dat working even if DNS servers are
+ unreachable
+\end{itemize}
+
+Additional discovery networks can be implemented as needed. We chose the
+above three as a starting point to have a complementary mix of
+strategies to increase the probability of source discovery. Additionally
+you can specify a Dat via HTTPS link, which runs the Dat protocol in
+``single-source'' mode, meaning the above discovery networks are not
+used, and instead only that one HTTPS server is used as the only peer.
+
+\subsubsection{Peer Connections}\label{peer-connections}
+
+After the discovery phase, Dat should have a list of potential data
+sources to try and contact. Dat uses either TCP,
+\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP}, or
+HTTP. UTP is designed to not take up all available bandwidth on a
+network (e.g.~so that other people sharing wifi can still use the
+Internet), and is still based on UDP so works with NAT traversal
+techniques like UDP hole punching. HTTP is support for compatibility
+with static file servers and web browser clients. Note that these are
+the protocols we support in the reference Dat implementation, but the
+Dat protocol itself is transport agnostic.
+
+If an HTTP source is specified Dat will prefer that one over other
+sources. Otherwise when Dat gets the IP and port for a potential TCP or
+UTP source it tries to connect using both protocols. If one connects
+first, Dat aborts the other one. If none connect, Dat will try again
+until it decides that source is offline or unavailable and then stops
+trying to connect to them. Sources Dat is able to connect to go into a
+list of known good sources, so that the Internet connection goes down
+Dat can use that list to reconnect to known good sources again quickly.
+
+If Dat gets a lot of potential sources it picks a handful at random to
+try and connect to and keeps the rest around as additional sources to
+use later in case it decides it needs more sources.
+
+Once a duplex binary connection to a remote source is open Dat then
+layers on the Hypercore protocol, a message based replication protocol
+that allows two peers to communicate over a stateless channel to request
+and exchange data. You open separate replication channels with many
+peers at once which allows clients to parallelize data requests across
+the entire pool of peers they have established connections with.
+
+\subsection{2.3 Network Privacy}\label{network-privacy}
+
+On the Web today, with SSL, there is a guarantee that the traffic
+between your computer and the server is private. As long as you trust
+the server to not leak your logs, attackers who intercept your network
+traffic will not be able to read the HTTP traffic exchanged between you
+and the server. This is a fairly straightforward model as clients only
+have to trust a single server for some domain.
+
+There is an inherent tradeoff in peer to peer systems of source
+discovery vs.~user privacy. The more sources you contact and ask for
+some data, the more sources you trust to keep what you asked for
+private. Our goal is to have Dat be configurable in respect to this
+tradeoff to allow application developers to meet their own privacy
+guidelines.
+
+It is up to client programs to make design decisions around which
+discovery networks they trust. For example if a Dat client decides to
+use the BitTorrent DHT to discover peers, and they are searching for a
+publicly shared Dat key (e.g.~a key cited publicly in a published
+scientific paper) with known contents, then because of the privacy
+design of the BitTorrent DHT it becomes public knowledge what key that
+client is searching for.
+
+A client could choose to only use discovery networks with certain
+privacy guarantees. For example a client could only connect to an
+approved list of sources that they trust, similar to SSL. As long as
+they trust each source, the encryption built into the Dat network
+protocol will prevent the Dat key they are looking for from being
+leaked.
+
+\subsection{2.4 Incremental Versioning}\label{incremental-versioning}
+
+Given a stream of binary data, Dat splits the stream into chunks, hashes
+each chunk, and arranges the hashes in a specific type of Merkle tree
+that allows for certain replication properties.
+
+Dat is also able to fully or partially synchronize streams in a
+distributed setting even if the stream is being appended to. This is
+accomplished by using the messaging protocol to traverse the Merkle tree
+of remote sources and fetch a strategic set of nodes. Due to the low
+level message oriented design of the replication protocol different node
+traversal strategies can be implemented.
+
+There are two types of versioning performed automatically by Dat.
+Metadata is stored in a folder called \texttt{.dat} in the root folder
+of a repository, and data is stored as normal files in the root folder.
+
+\subsubsection{Metadata Versioning}\label{metadata-versioning}
+
+Dat tries as much as possible to act as a one-to-one mirror of the state
+of a folder and all it's contents. When importing files, Dat uses a
+sorted depth-first recursion to list all the files in the tree. For each
+file it finds, it grabs the filesystem metadata (filename, Stat object,
+etc) and checks if there is already an entry for this filename with this
+exact metadata already represented in the Dat repository metadata. If
+the file with this metadata matches exactly the newest version of the
+file metadata stored in Dat, then this file will be skipped (no change).
+
+If the metadata differs from the current existing one (or there are no
+entries for this filename at all in the history), then this new metadata
+entry will be appended as the new `latest' version for this file in the
+append-only SLEEP metadata content register (described below).
+
+\subsubsection{Content Versioning}\label{content-versioning}
+
+In addition to storing a historical record of filesystem metadata, the
+content of the files themselves are also capable of being stored in a
+version controlled manner. The default storage system used in Dat stores
+the files as files. This has the advantage of being very straightforward
+for users to understand, but the downside of not storing old versions of
+content by default.
+
+In contrast to other version control systems like Git, Dat by default
+only stores the current set of checked out files on disk in the
+repository folder, not old versions. It does store all previous metadata
+for old versions in \texttt{.dat}. Git for example stores all previous
+content versions and all previous metadata versions in the \texttt{.git}
+folder. Because Dat is designed for larger datasets, if it stored all
+previous file versions in \texttt{.dat}, then the \texttt{.dat} folder
+could easily fill up the users hard drive inadverntently. Therefore Dat
+has multiple storage modes based on usage.
+
+Hypercore registers include an optional \texttt{data} file that stores
+all chunks of data. In Dat, only the \texttt{metadata.data} file is
+used, but the \texttt{content.data} file is not used. The default
+behavior is to store the current files only as normal files. If you want
+to run an `archival' node that keeps all previous versions, you can
+configure Dat to use the \texttt{content.data} file instead. For
+example, on a shared server with lots of storage you probably want to
+store all versions. However on a workstation machine that is only
+accessing a subset of one version, the default mode of storing all
+metadata plus the current set of downloaded files is acceptable, because
+you know the server has the full history.
+
+\subsubsection{Merkle Trees}\label{merkle-trees}
+
+Registers in Dat use a specific method of encoding a Merkle tree where
+hashes are positioned by a scheme called binary in-order interval
+numbering or just ``bin'' numbering. This is just a specific,
+deterministic way of laying out the nodes in a tree. For example a tree
+with 7 nodes will always be arranged like this:
+
+\begin{verbatim}
+0
+ 1
+2
+ 3
+4
+ 5
+6
+\end{verbatim}
+
+In Dat, the hashes of the chunks of files are always even numbers, at
+the wide end of the tree. So the above tree had four original values
+that become the even numbers:
+
+\begin{verbatim}
+chunk0 -> 0
+chunk1 -> 2
+chunk2 -> 4
+chunk3 -> 6
+\end{verbatim}
+
+In the resulting Merkle tree, the even and odd nodes store different
+information:
+
+\begin{itemize}
+\tightlist
+\item
+ Evens - List of data hashes {[}chunk0, chunk1, chunk2, \ldots{}{]}
+\item
+ Odds - List of Merkle hashes (hashes of child even nodes) {[}hash0,
+ hash1, hash2, \ldots{}{]}
+\end{itemize}
+
+These two lists get interleaved into a single register such that the
+indexes (position) in the register are the same as the bin numbers from
+the Merkle tree.
+
+All odd hashes are derived by hashing the two child nodes, e.g.~given
+hash0 is \texttt{hash(chunk0)} and hash2 is \texttt{hash(chunk1)}, hash1
+is \texttt{hash(hash0\ +\ hash2)}.
+
+For example a register with two data entries would look something like
+this (pseudocode):
+
+\begin{verbatim}
+0. hash(value0)
+1. hash(hash(chunk0) + hash(chunk1))
+2. hash(value1)
+\end{verbatim}
+
+It is possible for the in-order Merkle tree to have multiple roots at
+once. A root is defined as a parent node with a full set of child node
+slots filled below it.
+
+For example, this tree hash 2 roots (1 and 4)
+
+\begin{verbatim}
+0
+ 1
+2
+
+4
+\end{verbatim}
+
+This tree hash one root (3):
+
+\begin{verbatim}
+0
+ 1
+2
+ 3
+4
+ 5
+6
+\end{verbatim}
+
+This one has one root (1):
+
+\begin{verbatim}
+0
+ 1
+2
+\end{verbatim}
+
+\subsubsection{Replication Example}\label{replication-example}
+
+This section describes in high level the replication flow of a Dat. Note
+that the low level details are available by reading the SLEEP section
+below. For the sake of illustrating how this works in practice in a
+networked replication scenario, consider a folder with two files:
+
+\begin{verbatim}
+bat.jpg
+cat.jpg
+\end{verbatim}
+
+To send these files to another machine using Dat, you would first add
+them to a Dat repository by splitting them into chunks and constructing
+SLEEP files representing the chunks and filesystem metadata.
+
+Let's assume \texttt{cat.jpg} produces three chunks, and
+\texttt{bat.jpg} produces four chunks, each around 64KB. Dat stores in a
+representation called SLEEP, but here we will show a
+pseudo-representation for the purposes of illustrating the replication
+process. The seven chunks get sorted into a list like this:
+
+\begin{verbatim}
+bat-1
+bat-2
+bat-3
+cat-1
+cat-2
+cat-3
+\end{verbatim}
+
+These chunks then each get hashed, and the hashes get arranged into a
+Merkle tree (the content register):
+
+\begin{verbatim}
+0 - hash(bat-1)
+ 1 - hash(0 + 2)
+2 - hash(bat-2)
+ 3 - hash(1 + 5)
+4 - hash(bat-3)
+ 5 - hash(4 + 6)
+6 - hash(cat-1)
+8 - hash(cat-2)
+ 9 - hash()
+10 - hash(cat-3)
+\end{verbatim}
+
+Next we calculate the root hashes of our tree, in this case 3 and 9. We
+then hash them together, and cryptographically sign the hash. This
+signed hash now can be used to verify all nodes in the tree, and the
+signature proves it was produced by us, the holder of the private key
+for this Dat.
+
+This tree is for the hashes of the contents of the photos. There is also
+a second Merkle tree that Dat generates that represents the list of
+files and their metadata and looks something like this (the metadata
+register):
+
+\begin{verbatim}
+0 - hash({contentRegister: '9e29d624...'})
+ 1 - hash(0 + 2)
+2 - hash({"bat.png", first: 0, last: 3})
+4 - hash({"cat.png", first: 3, last: 3})
+\end{verbatim}
+
+The first entry in this feed is a special metadata entry that tells Dat
+the address of the second feed (the content register). Note that node 3
+is not included yet, because 3 is the hash of \texttt{1\ +\ 5}, but 5
+does not exist yet, so will be written at a later update.
+
+Now we're ready to send our metadata to the other peer. The first
+message is a \texttt{Register} message with the key that was shared for
+this Dat. Let's call ourselves Alice and the other peer Bob. Alice sends
+Bob a \texttt{Want} message that declares they want all nodes in the
+file list (the metadata register). Bob replies with a single
+\texttt{Have} message that indicates he has 2 nodes of data. Alice sends
+three \texttt{Request} messages, one for each leaf node
+(\texttt{0,\ 2,\ 4}). Bob sends back three \texttt{Data} messages. The
+first \texttt{Data} message contains the content register key, the hash
+of the sibling, in this case node \texttt{2}, the hash of the uncle root
+\texttt{4}, as well as a signature for the root hashes (in this case
+\texttt{1,\ 4}). Alice verifies the integrity of this first
+\texttt{Data} message by hashing the metadata received for the content
+register metadata to produce the hash for node \texttt{0}. They then
+hash the hash \texttt{0} with the hash \texttt{2} that was included to
+reproduce hash \texttt{1}, and hashes their \texttt{1} with the value
+for \texttt{4} they received which they can use the signature they
+received to verify it was the same data. When the next \texttt{Data}
+message is received, a similar process is performed to verify the
+content.
+
+Now Alice has the full list of files in the Dat, but decides they only
+want to download \texttt{cat.png}. Alice knows they want blocks 3
+through 6 from the content register. First Alice sends another
+\texttt{Register} message with the content key to open a new replication
+channel over the connection. Then Alice sends three \texttt{Request}
+messages, one for each of blocks \texttt{4,\ 5,\ 6}. Bob sends back
+three \texttt{Data} messages with the data for each block, as well as
+the hashes needed to verify the content in a way similar to the process
+described above for the metadata feed.
+
+\subsection{3. SLEEP Specification}\label{sleep-specification}
+
+This section is a technical description of the SLEEP format intended for
+implementers. SLEEP is the the on-disk format that Dat produces and
+uses. It is a set of 9 files that hold all of the metadata needed to
+list the contents of a Dat repository and verify the integrity of the
+data you receive. SLEEP is designed to work with REST, allowing servers
+to be plain HTTP file servers serving the static SLEEP files, meaning
+you can implement a Dat protocol client using HTTP with a static HTTP
+file server as the backend.
+
+SLEEP files contain metadata about the data inside a Dat repository,
+including cryptographic hashes, cryptographic signatures, filenames and
+file permissions. The SLEEP format is specifically engineered to allow
+efficient access to subsets of the metadat and/or data in the
+repository, even on very large repositories, which enables Dat's peer to
+peer networking to be fast.
+
+The acronym SLEEP is a slumber related pun on REST and stands for
+Syncable Lightweight Event Emitting Persistence. The Event Emitting part
+refers to how SLEEP files are append-only in nature, meaning they grow
+over time and new updates can be subscribed to as a realtime feed of
+events through the Dat protocol.
+
+The SLEEP version described here, used in Dat as of 2017 is SLEEP V2.
+SLEEP V1 is documented at http://specs.okfnlabs.org/sleep.
+
+\subsubsection{SLEEP Files}\label{sleep-files}
+
+SLEEP is a set of 9 files that should be stored with the following
+names. In Dat, the files are stored in a folder called \texttt{.dat} in
+the top level of the repository.
+
+\begin{verbatim}
+metadata.key
+metadata.signatures
+metadata.bitfield
+metadata.tree
+metadata.data
+content.key
+content.signatures
+content.bitfield
+content.tree
+\end{verbatim}
+
+The files prefixed with \texttt{content} store metadata about the
+primary data in a Dat repository, for example the raw binary contents of
+the files. The files prefixed with \texttt{metadata} store metadata
+about the files in the repository, for example the filenames, file
+sizes, and file permissions. The \texttt{content} and \texttt{metadata}
+files are both Hypercore registers, making SLEEP a set of two Hypercore
+registers.
+
+\subsubsection{SLEEP File Headers}\label{sleep-file-headers}
+
+The following structured binary format is used for \texttt{signatures},
+\texttt{bitfield}, and \texttt{tree} files. The header contains metadata
+as well as information needed to decode the rest of the files after the
+header. SLEEP files are designed to be easy to append new data to at the
+end, easy to read arbitrary byte offsets in the middle, and are
+relatively flat, simple files that rely on the filesystem for the heavy
+lifting.
+
+SLEEP files are laid out like this:
+
+\begin{verbatim}
+<32 byte header>
+<fixed-size entry 1>
+<fixed-size entry 2>
+<fixed-size entry ...>
+<fixed-size entry n>
+\end{verbatim}
+
+\begin{itemize}
+\tightlist
+\item
+ 32 byte header
+\item
+ 4 bytes - magic byte (value varies depending on which file, used to
+ quickly identify which file type it is)
+\item
+ 1 byte - version number of the file header protocol, current version
+ is 0
+\item
+ 2 byte Uint16BE - entry size, describes how long each entry in the
+ file is
+\item
+ 1 byte - length prefix for body
+\item
+ rest of 32 byte header - string describing key algorithm (in dat
+ `ed25519'). length of this string matches the length in the previous
+ length prefix field. This string must fit within the 32 byte header
+ limitation (24 bytes reserved for string). Unused bytes should be
+ filled with zeroes.
+\end{itemize}
+
+Possible values in the Dat implementation for the body field are:
+
+\begin{verbatim}
+Ed25519
+BLAKE2b
+\end{verbatim}
+
+To calculate the offset of some entry position, first read the header
+and get the entry size, then do
+\texttt{32\ +\ entrySize\ *\ entryIndex}. To calculate how many entries
+are in a file, you can use the entry size and the filesize on disk and
+do \texttt{(fileSize\ -\ 32)\ /\ entrySize}.
+
+As mentioned above, \texttt{signatures}, \texttt{bitfield} and
+\texttt{tree} are the three SLEEP files. There are two additional files,
+\texttt{key}, and \texttt{data}, which do not contain SLEEP file headers
+and store plain serialized data for easy access. \texttt{key} stores the
+public key that is described by the \texttt{signatures} file, and
+\texttt{data} stores the raw chunk data that the \texttt{tree} file
+contains the hashes and metadata for.
+
+\subsubsection{File Descriptions}\label{file-descriptions}
+
+\paragraph{key}\label{key}
+
+The public key used to verify the signatures in the \texttt{signatures}
+file. Stored in binary as a single buffer written to disk. To find out
+what format of key is stored in this file, read the header of
+\texttt{signatures}. In Dat, it's always a ed25519 public key, but other
+implementations can specify other key types using a string value in that
+header.
+
+\paragraph{tree}\label{tree}
+
+A SLEEP formatted 32 byte header with data entries representing a
+serialized merkle tree based on the data in the data storage layer. All
+the fixed size nodes written in in-order tree notation. The header
+algorithm string for \texttt{tree} files is \texttt{BLAKE2b}. The entry
+size is 40 bytes. Entries are formatted like this:
+
+\begin{verbatim}
+<32 byte header>
+ <4 byte magic string: 0x05025702>
+ <1 byte version number: 0>
+ <2 byte entry size: 40>
+ <1 byte algorithm name length prefix: 7>
+ <7 byte algorithm name: BLAKE2b>
+ <17 zeroes>
+<40 byte entries>
+ <32 byte BLAKE2b hash>
+ <8 byte Uint64BE children leaf byte length>
+\end{verbatim}
+
+The children leaf byte length is the byte size containing the sum byte
+length of all leaf nodes in the tree below this node.
+
+This file uses the in-order notation, meaning even entries are leaf
+nodes and odd entries are parent nodes (non-leaf).
+
+To prevent pre-image attacks, all hashes start with a one byte type
+descriptor:
+
+\begin{verbatim}
+0 - LEAF
+1 - PARENT
+2 - ROOT
+\end{verbatim}
+
+To calculate leaf node entries (the hashes of the data entries) we hash
+this data:
+
+\begin{verbatim}
+BLAKE2b(
+ <1 byte type>
+ 0
+ <8 bytes Uint64BE>
+ length of entry data
+ <entry data>
+)
+\end{verbatim}
+
+Then we take this 32 byte hash and write it to the tree as 40 bytes like
+this:
+
+\begin{verbatim}
+<32 bytes>
+ BLAKE2b hash
+<8 bytes Uint64BE>
+ length of data
+\end{verbatim}
+
+Note that the Uint64 of length of data is included both in the hashed
+data and written at the end of the entry. This is to expose more
+metadata to Dat for advanced use cases such as verifying data length in
+sparse replication scenarios.
+
+To calculate parent node entries (the hashes of the leaf nodes) we hash
+this data:
+
+\begin{verbatim}
+BLAKE2b(
+ <1 byte>
+ 1
+ <8 bytes Uint64BE>
+ left child length + right child length
+ <32 bytes>
+ left child hash
+ <32 bytes>
+ right child hash
+)
+\end{verbatim}
+
+Then we take this 32 byte hash and write it to the tree as 40 bytes like
+this:
+
+\begin{verbatim}
+<32 bytes>
+ BLAKE2b hash
+<8 bytes Uint64BE>
+ left child length + right child length
+\end{verbatim}
+
+The reason the tree entries contain data lengths is to allow for sparse
+mode replication. Encoding lengths (and including lengths in all hashes)
+means you can verify the merkle subtrees independent of the rest of the
+tree, which happens during sparse replication scenarios.
+
+The tree file corresponds directly to the \texttt{data} file.
+
+\paragraph{data}\label{data}
+
+The \texttt{data} file is only included in the SLEEP format for the
+\texttt{metadata.*} prefixed files which contains filesystem metadata
+and not actual file data. For the \texttt{content.*} files, the data is
+stored externally (in Dat it is stored as normal files on the filesystem
+and not in a SLEEP file). However you can configure Dat to use a
+\texttt{content.data} file if you want and it will still work. If you
+want to store the full history of all versions of all files, using the
+\texttt{content.data} file would provide that guarantee, but would have
+the disadvantage of storing files as chunks merged into one huge file
+(not as user friendly).
+
+The \texttt{data} file does not contain a SLEEP file header. It just
+contains a bunch of concatenated data entries. Entries are written in
+the same order as they appear in the \texttt{tree} file. To read a
+\texttt{data} file, first decode the \texttt{tree} file and for every
+leaf in the \texttt{tree} file you can calculate a data offset for the
+data described by that leaf node in the \texttt{data} file.
+
+\subparagraph{Index Lookup}\label{index-lookup}
+
+For example, if we wanted to seek to a specific entry offset (say entry
+42):
+
+\begin{itemize}
+\tightlist
+\item
+ First, read the header of the \texttt{tree} file and get the entry
+ size, then do \texttt{32\ +\ entrySize\ *\ 42} to get the raw tree
+ index: \texttt{32\ +\ (40\ *\ 42)}
+\item
+ Since we want the leaf entry (even node in the in-order layout), we
+ multiply the entry index by 2: \texttt{32\ +\ (40\ *\ (42\ *\ 2))}
+\item
+ Read the 40 bytes at that offset in the \texttt{tree} file to get the
+ leaf node entry.
+\item
+ Read the last 8 bytes of the entry to get the length of the data entry
+\item
+ To calculate the offset of where in the \texttt{data} file your entry
+ begins, you need to sum all the lengths of all the earlier entries in
+ the tree. The most efficient way to do this is to sum all the previous
+ parent node (non-leaf) entry lengths. You can also sum all leaf node
+ lengths, but parent nodes contain the sum of their childrens lengths
+ so it's more efficient to use parents. During Dat replication, these
+ nodes are fetched as part of the Merkle tree verification so you will
+ already have them locally. This is a log(N) operation where N is the
+ entry index. Entries are also small and therefore easily cacheable.
+\item
+ Once you get the offset, you use the length you decoded above and read
+ N bytes (where N is the decoded length) at the offset in the
+ \texttt{data} file. You can verify the data integrity using the 32
+ byte hash from the \texttt{tree} entry.
+\end{itemize}
+
+\subparagraph{Byte Lookup}\label{byte-lookup}
+
+The above method illustrates how to resolve a chunk position index to a
+byte offset. You can also do the reverse operation, resolving a byte
+offset to a chunk position index. This is used to stream arbitrary
+random access regions of files in sparse replication scenarios.
+
+\begin{itemize}
+\tightlist
+\item
+ First, you start by calculating the current Merkle roots
+\item
+ Each node in the tree (including these root nodes) stores the
+ aggregate file size of all byte sizes of the nodes below it. So the
+ roots cumulatively will describe all possible byte ranges for this
+ repository.
+\item
+ Find the root that contains the byte range of the offset you are
+ looking for and get the node information for all of that nodes
+ children using the Index Lookup method, and recursively repeat this
+ step until you find the lowest down child node that describes this
+ byte range.
+\item
+ The chunk described by this child node will contain the byte range you
+ are looking for. You can use the \texttt{byteOffset} property in the
+ \texttt{Stat} metadata object to seek into the right position in the
+ content for the start of this chunk.
+\end{itemize}
+
+\subparagraph{Metadata Overhead}\label{metadata-overhead}
+
+Using this scheme, if you write 4GB of data using on average 64KB data
+chunks (note: chunks can be variable length and do not need to be the
+same size), your tree file will be around 5MB (0.0125\% overhead).
+
+\paragraph{signatures}\label{signatures}
+
+A SLEEP formatted 32 byte header with data entries being 64 byte
+signatures.
+
+\begin{verbatim}
+<32 byte header>
+ <4 byte magic string: 0x05025701>
+ <1 byte version number: 0>
+ <2 byte entry size: 64>
+ <1 byte algorithm name length prefix: 7>
+ <7 byte algorithm name: Ed25519>
+ <17 zeroes>
+<64 byte entries>
+ <64 byte Ed25519 signature>
+\end{verbatim}
+
+Every time the tree is updated we sign the current roots of the Merkle
+tree, and append them to the signatures file. The signatures file starts
+with no entries. Each time a new leaf is appended to the \texttt{tree}
+file (aka whenever data is added to a Dat), we take all root hashes at
+the current state of the Merkle tree and hash and sign them, then append
+them as a new entry to the signatures file.
+
+\begin{verbatim}
+Ed25519 sign(
+ BLAKE2b(
+ <1 byte>
+ 2 // root type
+ for (every root node left-to-right) {
+ <32 byte root hash>
+ <8 byte Uint64BE root tree index>
+ <8 byte Uint64BE child byte lengths>
+ }
+ )
+)
+\end{verbatim}
+
+The reason we hash all the root nodes is that the BLAKE2b hash above is
+only calculateable if you have all of the pieces of data required to
+generate all the intermediate hashes. This is the crux of Dat's data
+integrity guarantees.
+
+\paragraph{bitfield}\label{bitfield}
+
+A SLEEP formatted 32 byte header followed by a series of 3328 byte long
+entries.
+
+\begin{verbatim}
+<32 byte header>
+ <4 byte magic string: 0x05025700>
+ <1 byte version number: 0>
+ <2 byte entry size: 3328>
+ <1 byte algorithm name length: 0>
+ <1 byte algorithm name: 0>
+ <24 zeroes>
+<3328 byte entries> // (2048 + 1024 + 256)
+\end{verbatim}
+
+The bitfield describes which pieces of data you have, and which nodes in
+the \texttt{tree} file have been written. This file exists as an index
+of the \texttt{tree} and \texttt{data} to quickly figure out which
+pieces of data you have or are missing. This file can be regenerated if
+you delete it, so it is considered a materialized index.
+
+The \texttt{bitfield} file actually contains three bitfields of
+different sizes. A bitfield (AKA bitmap) is defined as a set of bits
+where each bit (0 or 1) represents if you have or do not have a piece of
+data at that bit index. So if there is a dataset of 10 cat pictures, and
+you have pictures 1, 3, and 5 but are missing the rest, your bitfield
+would look like \texttt{1010100000}.
+
+Each entry contains three objects:
+
+\begin{itemize}
+\tightlist
+\item
+ Data Bitfield (1024 bytes) - 1 bit for for each data entry that you
+ have synced (1 for every entry in \texttt{data}).
+\item
+ Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in
+ \texttt{tree})
+\item
+ Bitfield Index (256 bytes) - This is an index of the Data Bitfield
+ that makes it efficient to figure out which pieces of data are missing
+ from the Data Bitfield without having to do a linear scan.
+\end{itemize}
+
+The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that
+because most filesystems work in 4Kb chunk sizes, we can fit the Data,
+Tree and Index in less then 4Kb of data for efficient writes to the
+filesystem. The Tree and Index sizes are based on the Data size (the
+Tree has twice the entries as the Data, odd and even nodes vs just even
+nodes in \texttt{tree}, and Index is always 1/4th the size).
+
+To generate the Index, you pairs of 2 bytes at a time from the Data
+Bitfield, check if all bites in the 2 bytes are the same, and generate 4
+bits of Index metadata~for every 2 bytes of Data (hence how 1024 bytes
+of Data ends up as 256 bytes of Index).
+
+First you generate a 2 bit tuple for the 2 bytes of Data:
+
+\begin{verbatim}
+if (data is all 1's) then [1,1]
+if (data is all 0's) then [0,0]
+if (data is not all the same) then [1, 0]
+\end{verbatim}
+
+The Index itself is an in-order binary tree, not a traditional bitfield.
+To generate the tree, you take the tuples you generate above and then
+write them into a tree like the following example, where non-leaf nodes
+are generated using the above scheme by looking at the results of the
+relative even child tuples for each odd parent tuple:
+
+\begin{verbatim}
+// for e.g. 16 bytes (8 tuples) of
+// sparsely replicated data
+0 - [00 00 00 00]
+1 - [10 10 10 10]
+2 - [11 11 11 11]
+\end{verbatim}
+
+The tuples at entry \texttt{1} above are \texttt{{[}1,0{]}} because the
+relative child tuples are not uniform. In the following example, all
+non-leaf nodes are \texttt{{[}1,1{]}} because their relative children
+are all uniform (\texttt{{[}1,1{]}})
+
+\begin{verbatim}
+// for e.g. 32 bytes (16 tuples) of
+// fully replicated data (all 1's)
+0 - [11 11 11 11]
+1 - [11 11 11 11]
+2 - [11 11 11 11]
+3 - [11 11 11 11]
+4 - [11 11 11 11]
+5 - [11 11 11 11]
+6 - [11 11 11 11]
+\end{verbatim}
+
+Using this scheme, to represent 32 bytes of data it takes at most 8
+bytes of Index. In this example it compresses nicely as its all
+contiguous ones on disk, similarly for an empty bitfield it would be all
+zeroes.
+
+If you write 4GB of data using on average 64KB data chunk size, your
+bitfield will be at most 32KB.
+
+\paragraph{metadata.data}\label{metadata.data}
+
+This file is used to store content described by the rest of the
+\texttt{metadata.*} hypercore SLEEP files. Whereas the
+\texttt{content.*} SLEEP files describe the data stored in the actual
+data cloned in the Dat repository filesystem, the \texttt{metadata} data
+feed is stored inside the \texttt{.dat} folder along with the rest of
+the SLEEP files.
+
+The contents of this file is a series of versions of the Dat filesystem
+tree. As this is a hypercore data feed, it's just an append only log of
+binary data entries. The challenge is representing a tree in an one
+dimensional way to make it representable as a Hypercore register. For
+example, imagine three files:
+
+\begin{verbatim}
+~/dataset $ ls
+figures
+ graph1.png
+ graph2.png
+results.csv
+
+1 directory, 3 files
+\end{verbatim}
+
+We want to take this structure and map it to a serialized representation
+that gets written into an append only log in a way that still allows for
+efficient random access by file path.
+
+To do this, we convert the filesystem metadata into entries in a feed
+like this:
+
+\begin{verbatim}
+{
+ "path": "/results.csv",
+ children: [[]],
+ sequence: 0
+}
+{
+ "path": "/figures/graph1.png",
+ children: [[0], []],
+ sequence: 1
+}
+{
+ "path": "/figures/graph2",
+ children: [[0], [1]],
+ sequence: 2
+}
+\end{verbatim}
+
+\subparagraph{Filename Resolution}\label{filename-resolution}
+
+Each sequence represents adding one of the files to the register, so at
+sequence 0 the filesystem state only has a single file,
+\texttt{results.csv} in it. At sequence 1, there are only 2 files added
+to the register, and at sequence 3 all files are finally added. The
+\texttt{children} field represents a shorthand way of declaring which
+other files at every level of the directory hierarchy exist alongside
+the file being added at that revision. For example at the time of
+sequence 1, children is \texttt{{[}{[}0{]},\ {[}{]}{]}}. The first
+sub-array, \texttt{{[}0{]}}, represents the first folder in the
+\texttt{path}, which is the root folder \texttt{/}. In this case
+\texttt{{[}0{]}} means the root folder at this point in time only has a
+single file, the file that is the subject of sequence \texttt{0}. The
+second subarray is empty \texttt{{[}{]}} because there are no other
+existing files in the second folder in the \texttt{path},
+\texttt{figures}.
+
+To look up a file by filename, you fetch the latest entry in the log,
+then use the \texttt{children} metadata in that entry to look up the
+longest common ancestor based on the parent folders of the filename you
+are querying. You can then recursively repeat this operation until you
+find the \texttt{path} entry you are looking for (or you exhaust all
+options which means the file does not exist). This is a
+\texttt{O(number\ of\ slashes\ in\ your\ path)} operation.
+
+For example, if you wanted to look up \texttt{/results.csv} given the
+above register, you would start by grabbing the metadata at sequence 2.
+The longest common ancestor between \texttt{/results.csv} and
+\texttt{/figures/graph2} is \texttt{/}. You then grab the corresponding
+entry in the children array for \texttt{/}, which in this case is the
+first entry, \texttt{{[}0{]}}. You then repeat this with all of the
+chilren entries until you find a child that is closer to the entry you
+are looking for. In this example, the first entry happens to be the
+match we are looking for.
+
+You can also perform lookups relative to a point in time by starting
+from a specific sequence number in the register. For example to get the
+state of some file relative to an old sequence number, similar to
+checking out an old version of a repository in Git.
+
+\subparagraph{Data Serialization}\label{data-serialization}
+
+The format of the \texttt{metadata.data} file is as follows:
+
+\begin{verbatim}
+<Header>
+<Node 1>
+<Node 2>
+<Node ...>
+<Node N>
+\end{verbatim}
+
+Each entry in the file is encoded using Protocol Buffers.
+
+The first message we write to the file is of a type called Header which
+uses this schema:
+
+\begin{verbatim}
+message Header {
+ required string type = 1;
+ optional bytes content = 2;
+}
+\end{verbatim}
+
+This is used to declare two pieces of metadata used by Dat. It includes
+a \texttt{type} string with the value \texttt{hyperdrive} and
+\texttt{content} binary value that holds the public key of the content
+register that this metadata register represents. When you share a Dat,
+the metadata key is the main key that gets used, and the content
+register key is linked from here in the metadata.
+
+After the header the file will contain many filesystem \texttt{Node}
+entries:
+
+\begin{verbatim}
+message Node {
+ required string path = 1;
+ optional Stat value = 2;
+ optional bytes children = 3;
+}
+\end{verbatim}
+
+The \texttt{Node} object has three fields
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{path} - the string of the absolute file path of this file.
+\item
+ \texttt{Stat} - a Stat encoded object representing the file metadata
+\item
+ \texttt{children} - a compressed list of the sequence numbers as
+ described earlier
+\end{itemize}
+
+The \texttt{children} value is encoded by starting with the nested array
+of sequence numbers, e.g. \texttt{{[}{[}3{]},\ {[}2,\ 1{]}{]}}. You then
+sort the individual arrays, in this case resulting in
+\texttt{{[}{[}3{]},\ {[}1,\ 2{]}{]}}. You then delta compress each
+subarray by storing the difference between each integer. In this case it
+would be \texttt{{[}{[}3{]},\ {[}1,\ 1{]}{]}} because \texttt{3} is 3
+more than 0, \texttt{1} is 1 more than than 0, and \texttt{2} is 1 more
+than \texttt{1}.
+
+When we write these delta compressed subarrays we write them using
+variable width integers (varints), using a repeating pattern like this,
+one for each array:
+
+\begin{verbatim}
+<varint of first subarray element length>
+<varint of the first delta in this array>
+<varint of the next delta ...>
+<varint of the last delta>
+\end{verbatim}
+
+This encoding is designed for efficiency as it reduces the filesystem
+path metadata down to a series of small integers.
+
+The \texttt{Stat} objects use this encoding:
+
+\begin{verbatim}
+message Stat {
+ required uint32 mode = 1;
+ optional uint32 uid = 2;
+ optional uint32 gid = 3;
+ optional uint64 size = 4;
+ optional uint64 blocks = 5;
+ optional uint64 offset = 6;
+ optional uint64 byteOffset = 7;
+ optional uint64 mtime = 8;
+ optional uint64 ctime = 9;
+}
+\end{verbatim}
+
+These are the field defintions:
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{mode} - posix file mode bitmask
+\item
+ \texttt{uid} - posix user id
+\item
+ \texttt{gid} - posix group id
+\item
+ \texttt{size} - file size in bytes
+\item
+ \texttt{blocks} - number of data chunks that make up this file
+\item
+ \texttt{offset} - the data feed entry index for the first chunk in
+ this file
+\item
+ \texttt{byteOffset} - the data feed file byte offset for the first
+ chunk in this file
+\item
+ \texttt{mtime} - posix modified\_at time
+\item
+ \texttt{mtime} - posix created\_at time
+\end{itemize}
+
+\subsection{4. Dat Network Protocol}\label{dat-network-protocol}
+
+The SLEEP format is designed to allow for sparse replication, meaning
+you can efficiently download only the metadata and data required to
+resolve a single byte region of a single file, which makes Dat suitable
+for a wide variety of streaming, real time and large dataset use cases.
+
+To take advantage of this, Dat includes a network protocol. It is
+message based and stateless, making it possible to implement on a
+variety of network transport protocols including UDP and TCP. Both
+metadata and content registers in SLEEP share the exact same replication
+protocol.
+
+Individual messages are encoded using Protocol Buffers and there are ten
+message types using the following schema:
+
+\subsubsection{Wire Protocol}\label{wire-protocol}
+
+Over the wire messages are packed in the following lightweight container
+format
+
+\begin{verbatim}
+<varint - length of rest of message>
+ <varint - header>
+ <message>
+\end{verbatim}
+
+The \texttt{header} value is a single varint that has two pieces of
+information, the integer \texttt{type} that declares a 4-bit message
+type (used below), and a channel identifier, \texttt{0} for metadata and
+\texttt{1} for content.
+
+To generate this varint, you bitshift the 4-bit type integer onto the
+end of the channel identifier, e.g.
+\texttt{channel\ \textless{}\textless{}\ 4\ \textbar{}\ \textless{}4-bit-type\textgreater{}}.
+
+\subsubsection{Register}\label{register}
+
+Type 0, should be the first message sent on a channel.
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore'
+ using the public key of the metadata register as the key.
+\item
+ \texttt{nonce} - 32 bytes of random binary data, used in our
+ encryption scheme
+\end{itemize}
+
+\begin{verbatim}
+message Register {
+ required bytes discoveryKey = 1;
+ optional bytes nonce = 2;
+}
+\end{verbatim}
+
+\subsubsection{Handshake}\label{handshake}
+
+Type 1. Overall connection handshake. Should be sent just after the
+register message on the first channel only (metadata).
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{id} - 32 byte random data used as a identifier for this peer
+ on the network, useful for checking if you are connected to yourself
+ or another peer more than once
+\item
+ \texttt{live} - Whether or not you want to operate in live
+ (continuous) replication mode or end after the initial sync
+\end{itemize}
+
+\begin{verbatim}
+message Handshake {
+ optional bytes id = 1;
+ optional bool live = 2;
+}
+\end{verbatim}
+
+\subsubsection{Status}\label{status}
+
+Type 2. Message indicating state changes. Used to indicate whether you
+are uploading and/or downloading.
+
+Initial state for uploading/downloading is true. If both ends are not
+downloading and not live it is safe to consider the stream ended.
+
+\begin{verbatim}
+message Status {
+ optional bool uploading = 1;
+ optional bool downloading = 2;
+}
+\end{verbatim}
+
+\subsubsection{Have}\label{have}
+
+Type 3. How you tell the other peer what chunks of data you have or
+don't have. You should only send Have messages to peers who have
+expressed interest in this region with Want messages.
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{start} - If you only specify \texttt{start}, it means you are
+ telling the other side you only have 1 chunk at the position at the
+ value in \texttt{start}.
+\item
+ \texttt{length} - If you specify length, you can describe a range of
+ values that you have all of, starting from \texttt{start}.
+\item
+ \texttt{bitfield} - If you would like to send a range of sparse data
+ about haves/don't haves via bitfield, relative to \texttt{start}.
+\end{itemize}
+
+\begin{verbatim}
+message Have {
+ required uint64 start = 1;
+ optional uint64 length = 2 [default = 1];
+ optional bytes bitfield = 3;
+}
+\end{verbatim}
+
+When sending bitfields you must run length encode them. The encoded
+bitfield is a series of compressed and uncompressed bit sequences. All
+sequences start with a header that is a varint.
+
+If the last bit is set in the varint (it is an odd number) then a header
+represents a compressed bit sequence.
+
+\begin{verbatim}
+compressed-sequence = varint(
+ byte-length-of-sequence
+ << 2 | bit << 1 | 1
+)
+\end{verbatim}
+
+If the last bit is \emph{not} set then a header represents an non
+compressed sequence
+
+\begin{verbatim}
+uncompressed-sequence = varint(
+ byte-length-of-bitfield << 1 | 0
+) + (bitfield)
+\end{verbatim}
+
+\subsubsection{Unhave}\label{unhave}
+
+Type 4. How you communicate that you deleted or removed a chunk you used
+to have.
+
+\begin{verbatim}
+message Unhave {
+ required uint64 start = 1;
+ optional uint64 length = 2 [default = 1];
+}
+\end{verbatim}
+
+\subsubsection{Want}\label{want}
+
+Type 5. How you ask the other peer to subscribe you to Have messages for
+a region of chunks. The \texttt{length} value defaults to Infinity or
+feed.length (if not live).
+
+\begin{verbatim}
+message Want {
+ required uint64 start = 1;
+ optional uint64 length = 2;
+}
+\end{verbatim}
+
+\subsubsection{Unwant}\label{unwant}
+
+Type 6. How you ask to unsubscribe from Have messages for a region of
+chunks from the other peer. You should only Unwant previously Wanted
+regions, but if you do Unwant something that hasn't been Wanted it won't
+have any effect. The \texttt{length} value defaults to Infinity or
+feed.length (if not live).
+
+\begin{verbatim}
+message Unwant {
+ required uint64 start = 1;
+ optional uint64 length = 2;
+}
+\end{verbatim}
+
+\subsubsection{Request}\label{request}
+
+Type 7. Request a single chunk of data.
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{index} - The chunk index for the chunk you want. You should
+ only ask for indexes that you have received the Have messages for.
+\item
+ \texttt{bytes} - You can also optimistically specify a byte offset,
+ and in the case the remote is able to resolve the chunk for this byte
+ offset depending on their Merkle tree state, they will ignore the
+ \texttt{index} and send the chunk that resolves for this byte offset
+ instead. But if they cannot resolve the byte request, \texttt{index}
+ will be used.
+\item
+ \texttt{hash} - If you only want the hash of the chunk and not the
+ chunk data itself.
+\item
+ \texttt{nodes} - A 64 bit long bitfield representing which parent
+ nodes you have.
+\end{itemize}
+
+The \texttt{nodes} bitfield is an optional optimization to reduce the
+amount of duplicate nodes exchanged during the replication lifecycle. It
+indicates which parents you have or don't have. You have a maximum of 64
+parents you can specify. Because \texttt{uint64} in Protocol Buffers is
+implemented as a varint, over the wire this does not take up 64 bits in
+most cases. The first bit is reserved to signify whether or not you need
+a signature in response. The rest of the bits represent whether or not
+you have (\texttt{1}) or don't have (\texttt{0}) the information at this
+node already. The ordering is determined by walking parent, sibling up
+the tree all the way to the root.
+
+\begin{verbatim}
+message Request {
+ required uint64 index = 1;
+ optional uint64 bytes = 2;
+ optional bool hash = 3;
+ optional uint64 nodes = 4;
+}
+\end{verbatim}
+
+\subsubsection{Cancel}\label{cancel}
+
+Type 8. Cancel a previous Request message that you haven't received yet.
+
+\begin{verbatim}
+message Cancel {
+ required uint64 index = 1;
+ optional uint64 bytes = 2;
+ optional bool hash = 3;
+}
+\end{verbatim}
+
+\subsubsection{Data}\label{data-1}
+
+Type 9. Sends a single chunk of data to the other peer. You can send it
+in response to a Request or unsolicited on it's own as a friendly gift.
+The data includes all of the Merkle tree parent nodes needed to verify
+the hash chain all the way up to the Merkle roots for this chunk.
+Because you can produce the direct parents by hashing the chunk, only
+the roots and `uncle' hashes are included (the siblings to all of the
+parent nodes).
+
+\begin{itemize}
+\tightlist
+\item
+ \texttt{index} - The chunk position for this chunk.
+\item
+ \texttt{value} - The chunk binary data. Empty if you are sending only
+ the hash.
+\item
+ \texttt{Node.index} - The index for this chunk in in-order notation
+\item
+ \texttt{Node.hash} - The hash of this chunk
+\item
+ \texttt{Node.size}- The aggregate chunk size for all children below
+ this node (The sum of all chunk sizes of all children)
+\item
+ \texttt{signature} - If you are sending a root node, all root nodes
+ must have the signature included.
+\end{itemize}
+
+\begin{verbatim}
+message Data {
+ required uint64 index = 1;
+ optional bytes value = 2;
+ repeated Node nodes = 3;
+ optional bytes signature = 4;
+
+ message Node {
+ required uint64 index = 1;
+ required bytes hash = 2;
+ required uint64 size = 3;
+ }
+}
+\end{verbatim}
+
+\section{5. Existing Work}\label{existing-work}
+
+Dat is inspired by a number of features from existing systems.
+
+\subsection{Git}\label{git}
+
+Git popularized the idea of a directed acyclic graph (DAG) combined with
+a Merkle tree, a way to represent changes to data where each change is
+addressed by the secure hash of the change plus all ancestor hashes in a
+graph. This provides a way to trust data integrity, as the only way a
+specific hash could be derived by another peer is if they have the same
+data and change history required to reproduce that hash. This is
+important for reproducibility as it lets you trust that a specific git
+commit hash refers to a specific source code state.
+
+Decentralized version control tools for source code like Git provide a
+protocol for efficiently downloading changes to a set of files, but are
+optimized for text files and have issues with large files. Solutions
+like Git-LFS solve this by using HTTP to download large files, rather
+than the Git protocol. GitHub offers Git-LFS hosting but charges
+repository owners for bandwidth on popular files. Building a distributed
+distribution layer for files in a Git repository is difficult due to
+design of Git Packfiles which are delta compressed repository states
+that do not easily support random access to byte ranges in previous file
+versions.
+
+\subsection{BitTorrent}\label{bittorrent}
+
+BitTorrent implements a swarm based file sharing protocol for static
+datasets. Data is split into fixed sized chunks, hashed, and then that
+hash is used to discover peers that have the same data. An advantage of
+using BitTorrent for dataset transfers is that download bandwidth can be
+fully saturated. Since the file is split into pieces, and peers can
+efficiently discover which pieces each of the peers they are connected
+to have, it means one peer can download non-overlapping regions of the
+dataset from many peers at the same time in parallel, maximizing network
+throughput.
+
+Fixed sized chunking has drawbacks for data that changes (see LBFS
+above). BitTorrent assumes all metadata will be transferred up front
+which makes it impractical for streaming or updating content. Most
+BitTorrent clients divide data into 1024 pieces meaning large datasets
+could have a very large chunk size which impacts random access
+performance (e.g.~for streaming video).
+
+Another drawback of BitTorrent is due to the way clients advertise and
+discover other peers in absence of any protocol level privacy or trust.
+From a user privacy standpoint, BitTorrent leaks what users are
+accessing or attempting to access, and does not provide the same
+browsing privacy functions as systems like SSL.
+
+\subsection{Kademlia Distributed Hash
+Table}\label{kademlia-distributed-hash-table}
+
+Kademlia is a distributed hash table, a distributed key/value store that
+can serve a similar purpose to DNS servers but has no hard coded server
+addresses. All clients in Kademlia are also servers. As long as you know
+at least one address of another peer in the network, you can ask them
+for the key you are trying to find and they will either have it or give
+you some other people to talk to that are more likely to have it.
+
+If you don't have an initial peer to talk to you, most clients use a
+bootstrap server that randomly gives you a peer in the network to start
+with. If the bootstrap server goes down, the network still functions as
+long as other methods can be used to bootstrap new peers (such as
+sending them peer addresses through side channels like how .torrent
+files include tracker addresses to try in case Kademlia finds no peers).
+
+Kademlia is distinct from previous DHT designs due to its simplicity. It
+uses a very simple XOR operation between two keys as its ``distance''
+metric to decide which peers are closer to the data being searched for.
+On paper it seems like it wouldn't work as it doesn't take into account
+things like ping speed or bandwidth. Instead its design is very simple
+on purpose to minimize the amount of control/gossip messages and to
+minimize the amount of complexity required to implement it. In practice
+Kademlia has been extremely successful and is widely deployed as the
+``Mainline DHT'' for BitTorrent, with support in all popular BitTorrent
+clients today.
+
+Due to the simplicity in the original Kademlia design a number of
+attacks such as DDOS and/or sybil have been demonstrated. There are
+protocol extensions (BEPs) which in certain cases mitigate the effects
+of these attacks, such as BEP 44 which includes a DDOS mitigation
+technique. Nonetheless anyone using Kademlia should be aware of the
+limitations.
+
+\subsection{Peer to Peer Streaming Peer Protocol
+(PPSPP)}\label{peer-to-peer-streaming-peer-protocol-ppspp}
+
+PPSPP
+(\href{https://datatracker.ietf.org/doc/rfc7574/?include_text=1}{IETF
+RFC 7574}) is a protocol for live streaming content over a peer to peer
+network. In it they define a specific type of Merkle Tree that allows
+for subsets of the hashes to be requested by a peer in order to reduce
+the time-till-playback for end users. BitTorrent for example transfers
+all hashes up front, which is not suitable for live streaming.
+
+Their Merkle trees are ordered using a scheme they call ``bin
+numbering'', which is a method for deterministically arranging an
+append-only log of leaf nodes into an in-order layout tree where
+non-leaf nodes are derived hashes. If you want to verify a specific
+node, you only need to request its sibling's hash and all its uncle
+hashes. PPSPP is very concerned with reducing round trip time and
+time-till-playback by allowing for many kinds of optimizations, such as
+to pack as many hashes into datagrams as possible when exchanging tree
+information with peers.
+
+Although PPSPP was designed with streaming video in mind, the ability to
+request a subset of metadata from a large and/or streaming dataset is
+very desirable for many other types of datasets.
+
+\subsection{WebTorrent}\label{webtorrent}
+
+With WebRTC browsers can now make peer to peer connections directly to
+other browsers. BitTorrent uses UDP sockets which aren't available to
+browser JavaScript, so can't be used as-is on the Web.
+
+WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC
+as the transport. This includes the BitTorrent block exchange protocol
+as well as the tracker protocol implemented in a way that can enable
+hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent
+swarms (if a client is capable of making both UDP sockets as well as
+WebRTC sockets, such as Node.js). Trackers are exposed to web clients
+over HTTP or WebSockets.
+
+\subsection{InterPlanetary File
+System}\label{interplanetary-file-system}
+
+IPFS is a family of application and network protocols that have peer to
+peer file sharing and data permanence baked in. IPFS abstracts network
+protocols and naming systems to provide an alternative application
+delivery platform to todays Web. For example, instead of using HTTP and
+DNS directly, in IPFS you would use LibP2P streams and IPNS in order to
+gain access to the features of the IPFS platform.
+
+\subsection{Certificate Transparency/Secure
+Registers}\label{certificate-transparencysecure-registers}
+
+The UK Government Digital Service have developed the concept of a
+register which they define as a digital public ledger you can trust. In
+the UK government registers are beginning to be piloted as a way to
+expose essential open data sets in a way where consumers can verify the
+data has not been tampered with, and allows the data publishers to
+update their data sets over time.
+
+The design of registers was inspired by the infrastructure backing the
+Certificate Transparency project, initated at Google, which provides a
+service on top of SSL certificates that enables service providers to
+write certificates to a distributed public ledger. Anyone client or
+service provider can verify if a certificate they received is in the
+ledger, which protects against so called ``rogue certificates''.
+
+\section{6. Reference Implementation}\label{reference-implementation}
+
+The connection logic is implemented in a module called
+\href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}.
+This builds on discovery-channel and adds connection establishment,
+management and statistics. It provides statistics such as how many
+sources are currently connected, how many good and bad behaving sources
+have been talked to, and it automatically handles connecting and
+reconnecting to sources. UTP support is implemented in the module
+\href{https://www.npmjs.com/package/utp-native}{utp-native}.
+
+Our implementation of source discovery is called
+\href{https://npmjs.org/discovery-channel}{discovery-channel}. We also
+run a \href{https://www.npmjs.com/package/dns-discovery}{custom DNS
+server} that Dat clients use (in addition to specifying their own if
+they need to), as well as a
+\href{https://github.com/bittorrent/bootstrap-dht}{DHT bootstrap}
+server. These discovery servers are the only centralized infrastructure
+we need for Dat to work over the Internet, but they are redundant,
+interchangeable, never see the actual data being shared, anyone can run
+their own and Dat will still work even if they all are unavailable. If
+this happens discovery will just be manual (e.g.~manually sharing
+IP/ports).
+
+\section{Acknowledgements}\label{acknowledgements}
+
+This work was made possible through grants from the John S. and James L.
+Knight and Alfred P. Sloan Foundations.
+
+\end{document}