aboutsummaryrefslogtreecommitdiffstats
path: root/papers
diff options
context:
space:
mode:
Diffstat (limited to 'papers')
-rw-r--r--papers/dat-paper.bib60
-rw-r--r--papers/dat-paper.md23
-rw-r--r--papers/dat-paper.pdfbin256904 -> 270758 bytes
-rw-r--r--papers/dat-paper.txt132
4 files changed, 170 insertions, 45 deletions
diff --git a/papers/dat-paper.bib b/papers/dat-paper.bib
new file mode 100644
index 0000000..e6c7d1d
--- /dev/null
+++ b/papers/dat-paper.bib
@@ -0,0 +1,60 @@
+@inproceedings{aumasson2013blake2,
+ title={BLAKE2: simpler, smaller, fast as MD5},
+ author={Aumasson, Jean-Philippe and Neves, Samuel and Wilcox-O’Hearn, Zooko and Winnerlein, Christian},
+ booktitle={International Conference on Applied Cryptography and Network Security},
+ pages={119--135},
+ year={2013},
+ organization={Springer}
+}
+
+@article{bernstein2012high,
+ title={High-speed high-security signatures},
+ author={Bernstein, Daniel J and Duif, Niels and Lange, Tanja and Schwabe, Peter and Yang, Bo-Yin},
+ journal={Journal of Cryptographic Engineering},
+ pages={1--13},
+ year={2012},
+ publisher={Springer}
+}
+
+@article{mykletun2003providing,
+ title={Providing authentication and integrity in outsourced databases using Merkle hash trees},
+ author={Mykletun, Einar and Narasimha, Maithili and Tsudik, Gene},
+ journal={UCI-SCONCE Technical Report},
+ year={2003}
+}
+
+@inproceedings{rossi2010ledbat,
+ title={LEDBAT: The New BitTorrent Congestion Control Protocol.},
+ author={Rossi, Dario and Testa, Claudio and Valenti, Silvio and Muscariello, Luca},
+ booktitle={ICCCN},
+ pages={1--6},
+ year={2010}
+}
+
+@article{varda2008protocol,
+ title={Protocol buffers: Google’s data interchange format},
+ author={Varda, Kenton},
+ journal={Google Open Source Blog, Available at least as early as Jul},
+ year={2008}
+}
+
+@inproceedings{maymounkov2002kademlia,
+ title={Kademlia: A peer-to-peer information system based on the xor metric},
+ author={Maymounkov, Petar and Mazieres, David},
+ booktitle={International Workshop on Peer-to-Peer Systems},
+ pages={53--65},
+ year={2002},
+ organization={Springer}
+}
+
+@techreport{bakker2015peer,
+ title={Peer-to-peer streaming peer protocol (ppspp)},
+ author={Bakker, A and Petrocco, R and Grishchenko, V},
+ year={2015}
+}
+
+@techreport{laurie2013certificate,
+ title={Certificate transparency},
+ author={Laurie, Ben and Langley, Adam and Kasper, Emilia},
+ year={2013}
+} \ No newline at end of file
diff --git a/papers/dat-paper.md b/papers/dat-paper.md
index 157aef0..9f2952c 100644
--- a/papers/dat-paper.md
+++ b/papers/dat-paper.md
@@ -1,3 +1,10 @@
+---
+title: "Dat - Distributed Dataset Synchronization And Versioning"
+date: "May 2017"
+author: "Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science"
+---
+
+
# Abstract
Dat is a protocol designed for syncing folders of data, even if they are large or changing constantly. Dat uses a cryptographically secure register of changes to prove that the requested data version is distributed. A byte range of any file's version can be efficiently streamed from a Dat repository over a network connection. Consumers can choose to fully or partially replicate the contents of a remote Dat repository, and can also subscribe to live changes. To ensure writer and reader privacy, Dat uses public key cryptography to encrypt network traffic. A group of Dat clients can connect to each other to form a public or private decentralized network to exchange data between each other. A reference implementation is provided in JavaScript.
@@ -27,11 +34,11 @@ Content integrity means being able to verify the data you received is the exact
Link rot, when links online stop resolving, and content drift, when data changes but the link to the data remains the same, are two common issues in data analysis. For example, one day a file called data.zip might change, but a typical HTTP link to the file does not include a hash of the content, or provide a way to get updated metadata, so clients that only have the HTTP link have no way to check if the file changed without downloading the entire file again. Referring to a file by the hash of its content is called content addressability, and lets users not only verify that the data they receive is the version of the data they want, but also lets people cite specific versions of the data by referring to a specific hash.
-Dat uses BLAKE2b cryptographically secure hashes to address content. Hashes are arranged in a Merkle tree, a tree where each non-leaf node is the hash of all child nodes. Leaf nodes contain pieces of the dataset. Due to the properties of secure cryptographic hashes the top hash can only be produced if all data below it matches exactly. If two trees have matching top hashes then you know that all other nodes in the tree must match as well, and you can conclude that your dataset is synchronized. Trees are chosen as the primary data structure in Dat as they have a number of properties that allow for efficient access to subsets of the metadata, which allows Dat to work efficiently over a network connection.
+Dat uses BLAKE2b [@aumasson2013blake2] cryptographically secure hashes to address content. Hashes are arranged in a Merkle tree [@mykletun2003providing], a tree where each non-leaf node is the hash of all child nodes. Leaf nodes contain pieces of the dataset. Due to the properties of secure cryptographic hashes the top hash can only be produced if all data below it matches exactly. If two trees have matching top hashes then you know that all other nodes in the tree must match as well, and you can conclude that your dataset is synchronized. Trees are chosen as the primary data structure in Dat as they have a number of properties that allow for efficient access to subsets of the metadata, which allows Dat to work efficiently over a network connection.
### Dat Links
-Dat links are Ed25519 public keys which have a length of 32 bytes (64 characters when Hex encoded). You can represent your Dat link in the following ways and Dat clients will be able to understand them:
+Dat links are Ed25519 [@bernstein2012high] public keys which have a length of 32 bytes (64 characters when Hex encoded). You can represent your Dat link in the following ways and Dat clients will be able to understand them:
- The standalone public key:
@@ -89,7 +96,7 @@ Additional discovery networks can be implemented as needed. We chose the above t
### Peer Connections
-After the discovery phase, Dat should have a list of potential data sources to try and contact. Dat uses either TCP, [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol), or HTTP. UTP is designed to not take up all available bandwidth on a network (e.g. so that other people sharing wifi can still use the Internet), and is still based on UDP so works with NAT traversal techniques like UDP hole punching. HTTP is support for compatibility with static file servers and web browser clients. Note that these are the protocols we support in the reference Dat implementation, but the Dat protocol itself is transport agnostic.
+After the discovery phase, Dat should have a list of potential data sources to try and contact. Dat uses either TCP, HTTP or [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) [@rossi2010ledbat]. UTP uses LEDBAT which is designed to not take up all available bandwidth on a network (e.g. so that other people sharing wifi can still use the Internet), and is still based on UDP so works with NAT traversal techniques like UDP hole punching. HTTP is support for compatibility with static file servers and web browser clients. Note that these are the protocols we support in the reference Dat implementation, but the Dat protocol itself is transport agnostic.
If an HTTP source is specified Dat will prefer that one over other sources. Otherwise when Dat gets the IP and port for a potential TCP or UTP source it tries to connect using both protocols. If one connects first, Dat aborts the other one. If none connect, Dat will try again until it decides that source is offline or unavailable and then stops trying to connect to them. Sources Dat is able to connect to go into a list of known good sources, so that the Internet connection goes down Dat can use that list to reconnect to known good sources again quickly.
@@ -591,7 +598,7 @@ The format of the `metadata.data` file is as follows:
<Node N>
```
-Each entry in the file is encoded using Protocol Buffers.
+Each entry in the file is encoded using Protocol Buffers [@varda2008protocol].
The first message we write to the file is of a type called Header which uses this schema:
@@ -872,7 +879,7 @@ Another drawback of BitTorrent is due to the way clients advertise and discover
## Kademlia Distributed Hash Table
-Kademlia is a distributed hash table, a distributed key/value store that can serve a similar purpose to DNS servers but has no hard coded server addresses. All clients in Kademlia are also servers. As long as you know at least one address of another peer in the network, you can ask them for the key you are trying to find and they will either have it or give you some other people to talk to that are more likely to have it.
+Kademlia [@maymounkov2002kademlia] is a distributed hash table, a distributed key/value store that can serve a similar purpose to DNS servers but has no hard coded server addresses. All clients in Kademlia are also servers. As long as you know at least one address of another peer in the network, you can ask them for the key you are trying to find and they will either have it or give you some other people to talk to that are more likely to have it.
If you don't have an initial peer to talk to you, most clients use a bootstrap server that randomly gives you a peer in the network to start with. If the bootstrap server goes down, the network still functions as long as other methods can be used to bootstrap new peers (such as sending them peer addresses through side channels like how .torrent files include tracker addresses to try in case Kademlia finds no peers).
@@ -882,7 +889,7 @@ Due to the simplicity in the original Kademlia design a number of attacks such a
## Peer to Peer Streaming Peer Protocol (PPSPP)
-PPSPP ([IETF RFC 7574](https://datatracker.ietf.org/doc/rfc7574/?include_text=1)) is a protocol for live streaming content over a peer to peer network. In it they define a specific type of Merkle Tree that allows for subsets of the hashes to be requested by a peer in order to reduce the time-till-playback for end users. BitTorrent for example transfers all hashes up front, which is not suitable for live streaming.
+PPSPP ([IETF RFC 7574](https://datatracker.ietf.org/doc/rfc7574/?include_text=1), [@bakker2015peer]) is a protocol for live streaming content over a peer to peer network. In it they define a specific type of Merkle Tree that allows for subsets of the hashes to be requested by a peer in order to reduce the time-till-playback for end users. BitTorrent for example transfers all hashes up front, which is not suitable for live streaming.
Their Merkle trees are ordered using a scheme they call "bin numbering", which is a method for deterministically arranging an append-only log of leaf nodes into an in-order layout tree where non-leaf nodes are derived hashes. If you want to verify a specific node, you only need to request its sibling's hash and all its uncle hashes. PPSPP is very concerned with reducing round trip time and time-till-playback by allowing for many kinds of optimizations, such as to pack as many hashes into datagrams as possible when exchanging tree information with peers.
@@ -902,7 +909,7 @@ IPFS is a family of application and network protocols that have peer to peer fil
The UK Government Digital Service have developed the concept of a register which they define as a digital public ledger you can trust. In the UK government registers are beginning to be piloted as a way to expose essential open data sets in a way where consumers can verify the data has not been tampered with, and allows the data publishers to update their data sets over time.
-The design of registers was inspired by the infrastructure backing the Certificate Transparency project, initated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Anyone client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates".
+The design of registers was inspired by the infrastructure backing the Certificate Transparency [@laurie2013certificate] project, initated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Anyone client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates".
# 6. Reference Implementation
@@ -913,3 +920,5 @@ Our implementation of source discovery is called [discovery-channel](https://npm
# Acknowledgements
This work was made possible through grants from the John S. and James L. Knight and Alfred P. Sloan Foundations.
+
+# References
diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf
index 673d52c..62b9058 100644
--- a/papers/dat-paper.pdf
+++ b/papers/dat-paper.pdf
Binary files differ
diff --git a/papers/dat-paper.txt b/papers/dat-paper.txt
index 2b30120..9ece98f 100644
--- a/papers/dat-paper.txt
+++ b/papers/dat-paper.txt
@@ -21,8 +21,10 @@
\usepackage{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
-\usepackage{hyperref}
-\hypersetup{unicode=true,
+\usepackage[unicode=true]{hyperref}
+\hypersetup{
+ pdftitle={Dat - Distributed Dataset Synchronization And Versioning},
+ pdfauthor={Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science},
pdfborder={0 0 0},
breaklinks=true}
\urlstyle{same} % don't use monospace font for urls
@@ -46,9 +48,15 @@
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
+% set default figure placement to htbp
+\makeatletter
+\def\fps@figure{htbp}
+\makeatother
+
+
\title{Dat - Distributed Dataset Synchronization And Versioning}
-\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen}
-\date{April 2017, Code for Science}
+\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science}
+\date{May 2017}
\begin{document}
\maketitle
@@ -146,11 +154,12 @@ verify that the data they receive is the version of the data they want,
but also lets people cite specific versions of the data by referring to
a specific hash.
-Dat uses BLAKE2b cryptographically secure hashes to address content.
-Hashes are arranged in a Merkle tree, a tree where each non-leaf node is
-the hash of all child nodes. Leaf nodes contain pieces of the dataset.
-Due to the properties of secure cryptographic hashes the top hash can
-only be produced if all data below it matches exactly. If two trees have
+Dat uses BLAKE2b (Aumasson et al. 2013) cryptographically secure hashes
+to address content. Hashes are arranged in a Merkle tree (Mykletun,
+Narasimha, and Tsudik 2003), a tree where each non-leaf node is the hash
+of all child nodes. Leaf nodes contain pieces of the dataset. Due to the
+properties of secure cryptographic hashes the top hash can only be
+produced if all data below it matches exactly. If two trees have
matching top hashes then you know that all other nodes in the tree must
match as well, and you can conclude that your dataset is synchronized.
Trees are chosen as the primary data structure in Dat as they have a
@@ -160,9 +169,10 @@ connection.
\subsubsection{Dat Links}\label{dat-links}
-Dat links are Ed25519 public keys which have a length of 32 bytes (64
-characters when Hex encoded). You can represent your Dat link in the
-following ways and Dat clients will be able to understand them:
+Dat links are Ed25519 (Bernstein et al. 2012) public keys which have a
+length of 32 bytes (64 characters when Hex encoded). You can represent
+your Dat link in the following ways and Dat clients will be able to
+understand them:
\begin{itemize}
\tightlist
@@ -303,15 +313,15 @@ used, and instead only that one HTTPS server is used as the only peer.
\subsubsection{Peer Connections}\label{peer-connections}
After the discovery phase, Dat should have a list of potential data
-sources to try and contact. Dat uses either TCP,
-\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP}, or
-HTTP. UTP is designed to not take up all available bandwidth on a
-network (e.g.~so that other people sharing wifi can still use the
-Internet), and is still based on UDP so works with NAT traversal
-techniques like UDP hole punching. HTTP is support for compatibility
-with static file servers and web browser clients. Note that these are
-the protocols we support in the reference Dat implementation, but the
-Dat protocol itself is transport agnostic.
+sources to try and contact. Dat uses either TCP, HTTP or
+\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP}
+(Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up
+all available bandwidth on a network (e.g.~so that other people sharing
+wifi can still use the Internet), and is still based on UDP so works
+with NAT traversal techniques like UDP hole punching. HTTP is support
+for compatibility with static file servers and web browser clients. Note
+that these are the protocols we support in the reference Dat
+implementation, but the Dat protocol itself is transport agnostic.
If an HTTP source is specified Dat will prefer that one over other
sources. Otherwise when Dat gets the IP and port for a potential TCP or
@@ -1187,7 +1197,7 @@ The format of the \texttt{metadata.data} file is as follows:
<Node N>
\end{verbatim}
-Each entry in the file is encoded using Protocol Buffers.
+Each entry in the file is encoded using Protocol Buffers (Varda 2008).
The first message we write to the file is of a type called Header which
uses this schema:
@@ -1635,12 +1645,13 @@ browsing privacy functions as systems like SSL.
\subsection{Kademlia Distributed Hash
Table}\label{kademlia-distributed-hash-table}
-Kademlia is a distributed hash table, a distributed key/value store that
-can serve a similar purpose to DNS servers but has no hard coded server
-addresses. All clients in Kademlia are also servers. As long as you know
-at least one address of another peer in the network, you can ask them
-for the key you are trying to find and they will either have it or give
-you some other people to talk to that are more likely to have it.
+Kademlia (Maymounkov and Mazieres 2002) is a distributed hash table, a
+distributed key/value store that can serve a similar purpose to DNS
+servers but has no hard coded server addresses. All clients in Kademlia
+are also servers. As long as you know at least one address of another
+peer in the network, you can ask them for the key you are trying to find
+and they will either have it or give you some other people to talk to
+that are more likely to have it.
If you don't have an initial peer to talk to you, most clients use a
bootstrap server that randomly gives you a peer in the network to start
@@ -1672,11 +1683,12 @@ limitations.
PPSPP
(\href{https://datatracker.ietf.org/doc/rfc7574/?include_text=1}{IETF
-RFC 7574}) is a protocol for live streaming content over a peer to peer
-network. In it they define a specific type of Merkle Tree that allows
-for subsets of the hashes to be requested by a peer in order to reduce
-the time-till-playback for end users. BitTorrent for example transfers
-all hashes up front, which is not suitable for live streaming.
+RFC 7574}, (Bakker, Petrocco, and Grishchenko 2015)) is a protocol for
+live streaming content over a peer to peer network. In it they define a
+specific type of Merkle Tree that allows for subsets of the hashes to be
+requested by a peer in order to reduce the time-till-playback for end
+users. BitTorrent for example transfers all hashes up front, which is
+not suitable for live streaming.
Their Merkle trees are ordered using a scheme they call ``bin
numbering'', which is a method for deterministically arranging an
@@ -1727,11 +1739,12 @@ data has not been tampered with, and allows the data publishers to
update their data sets over time.
The design of registers was inspired by the infrastructure backing the
-Certificate Transparency project, initated at Google, which provides a
-service on top of SSL certificates that enables service providers to
-write certificates to a distributed public ledger. Anyone client or
-service provider can verify if a certificate they received is in the
-ledger, which protects against so called ``rogue certificates''.
+Certificate Transparency (Laurie, Langley, and Kasper 2013) project,
+initated at Google, which provides a service on top of SSL certificates
+that enables service providers to write certificates to a distributed
+public ledger. Anyone client or service provider can verify if a
+certificate they received is in the ledger, which protects against so
+called ``rogue certificates''.
\section{6. Reference Implementation}\label{reference-implementation}
@@ -1762,4 +1775,47 @@ IP/ports).
This work was made possible through grants from the John S. and James L.
Knight and Alfred P. Sloan Foundations.
+\section*{References}\label{references}
+\addcontentsline{toc}{section}{References}
+
+\hypertarget{refs}{}
+\hypertarget{ref-aumasson2013blake2}{}
+Aumasson, Jean-Philippe, Samuel Neves, Zooko Wilcox-O'Hearn, and
+Christian Winnerlein. 2013. ``BLAKE2: Simpler, Smaller, Fast as Md5.''
+In \emph{International Conference on Applied Cryptography and Network
+Security}, 119--35. Springer.
+
+\hypertarget{ref-bakker2015peer}{}
+Bakker, A, R Petrocco, and V Grishchenko. 2015. ``Peer-to-Peer Streaming
+Peer Protocol (Ppspp).''
+
+\hypertarget{ref-bernstein2012high}{}
+Bernstein, Daniel J, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin
+Yang. 2012. ``High-Speed High-Security Signatures.'' \emph{Journal of
+Cryptographic Engineering}. Springer, 1--13.
+
+\hypertarget{ref-laurie2013certificate}{}
+Laurie, Ben, Adam Langley, and Emilia Kasper. 2013. ``Certificate
+Transparency.''
+
+\hypertarget{ref-maymounkov2002kademlia}{}
+Maymounkov, Petar, and David Mazieres. 2002. ``Kademlia: A Peer-to-Peer
+Information System Based on the Xor Metric.'' In \emph{International
+Workshop on Peer-to-Peer Systems}, 53--65. Springer.
+
+\hypertarget{ref-mykletun2003providing}{}
+Mykletun, Einar, Maithili Narasimha, and Gene Tsudik. 2003. ``Providing
+Authentication and Integrity in Outsourced Databases Using Merkle Hash
+Trees.'' \emph{UCI-SCONCE Technical Report}.
+
+\hypertarget{ref-rossi2010ledbat}{}
+Rossi, Dario, Claudio Testa, Silvio Valenti, and Luca Muscariello. 2010.
+``LEDBAT: The New Bittorrent Congestion Control Protocol.'' In
+\emph{ICCCN}, 1--6.
+
+\hypertarget{ref-varda2008protocol}{}
+Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange
+Format.'' \emph{Google Open Source Blog, Available at Least as Early as
+Jul}.
+
\end{document}