aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorbnewbold <bnewbold@robocracy.org>2018-01-12 16:45:40 -0500
committerJoe Hand <joe@joeahand.com>2018-01-12 13:45:40 -0800
commit9f0ce82c58741a7c01176b2d6bb2049b8cf52e05 (patch)
tree092507aa819301dfac57c3bfcf6d1f223b5dcd89
parent5b37b1b8bd0615c1d487abfc4b1543dfdacbfd36 (diff)
downloaddat-docs-9f0ce82c58741a7c01176b2d6bb2049b8cf52e05.zip
dat-docs-9f0ce82c58741a7c01176b2d6bb2049b8cf52e05.tar.gz
Paper Cleanups (#102)HEADmasterbnewbold-core-modules
* yet more grammar tweaks * drop stub hyperdrive paper This file had no content and was confusing. * papers: clarify that build output is latex * dat paper: remove stubs and multi-writer Consensus is to document multi-writer via a DEP (RFC-like) process first, and not propose new changes in the whitepaper first. * dat-paper: note that paper has been updated This could probably be clarified better (minor revisions and bug-fixes vs. substantial updates), and maybe we want to pull out the original "1.0" paper for historical reference. Let's wait until DEP process settles down first. * papers: re-build PDFs
-rwxr-xr-xpapers/buildpapers.sh6
-rw-r--r--papers/dat-paper.latex (renamed from papers/dat-paper.txt)84
-rw-r--r--papers/dat-paper.md28
-rw-r--r--papers/dat-paper.pdfbin250116 -> 248958 bytes
-rw-r--r--papers/hyperdrive.md13
-rw-r--r--papers/sleep.latex (renamed from papers/sleep.txt)35
-rw-r--r--papers/sleep.md4
-rw-r--r--papers/sleep.pdfbin202643 -> 207531 bytes
8 files changed, 38 insertions, 132 deletions
diff --git a/papers/buildpapers.sh b/papers/buildpapers.sh
index 420e916..089ed89 100755
--- a/papers/buildpapers.sh
+++ b/papers/buildpapers.sh
@@ -1,9 +1,9 @@
#!/usr/bin/env sh
-pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -t latex -o dat-paper.txt
+pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -t latex -o dat-paper.latex
pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -o dat-paper.pdf
-pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -t latex -o sleep.txt
+pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -t latex -o sleep.latex
-pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -o sleep.pdf \ No newline at end of file
+pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -o sleep.pdf
diff --git a/papers/dat-paper.txt b/papers/dat-paper.latex
index f0e71c2..5d90b07 100644
--- a/papers/dat-paper.txt
+++ b/papers/dat-paper.latex
@@ -48,15 +48,9 @@
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
-% set default figure placement to htbp
-\makeatletter
-\def\fps@figure{htbp}
-\makeatother
-
-
\title{Dat - Distributed Dataset Synchronization And Versioning}
\author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science}
-\date{May 2017}
+\date{May 2017 (last updated: Jan 2018)}
\begin{document}
\maketitle
@@ -103,7 +97,7 @@ backup sources can be automatically discovered. However these file
sharing tools today are not supported by Web browsers, do not have good
privacy guarantees, and do not provide a mechanism for updating files
without redistributing a new dataset which could mean entirely
-redownloading data you already have.
+re-downloading data you already have.
\section{2. Dat}\label{dat}
@@ -114,7 +108,7 @@ reference implementation is available from npm as
The protocol is agnostic to the underlying transport e.g.~you could
implement Dat over carrier pigeon. Data is stored in a format called
-SLEEP (Ogden and Buus 2017), described in it's own paper. The key
+SLEEP (Ogden and Buus 2017), described in its own paper. The key
properties of the Dat design are explained in this section.
\begin{itemize}
@@ -319,7 +313,7 @@ sources to try and contact. Dat uses either TCP, HTTP or
\href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP}
(Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up
all available bandwidth on a network (e.g.~so that other people sharing
-wifi can still use the Internet), and is still based on UDP so works
+WiFi can still use the Internet), and is still based on UDP so works
with NAT traversal techniques like UDP hole punching. HTTP is supported
for compatibility with static file servers and web browser clients. Note
that these are the protocols we support in the reference Dat
@@ -495,16 +489,16 @@ For example a register with two data entries would look something like
this (pseudocode):
\begin{verbatim}
-0. hash(value0)
+0. hash(chunk0)
1. hash(hash(chunk0) + hash(chunk1))
-2. hash(value1)
+2. hash(chunk1)
\end{verbatim}
It is possible for the in-order Merkle tree to have multiple roots at
once. A root is defined as a parent node with a full set of child node
slots filled below it.
-For example, this tree hash 2 roots (1 and 4)
+For example, this tree has 2 roots (1 and 4)
\begin{verbatim}
0
@@ -514,7 +508,7 @@ For example, this tree hash 2 roots (1 and 4)
4
\end{verbatim}
-This tree hash one root (3):
+This tree has one root (3):
\begin{verbatim}
0
@@ -560,7 +554,7 @@ list like this:
bat-1
bat-2
bat-3
-cat-1
+cat-1
cat-2
cat-3
\end{verbatim}
@@ -702,7 +696,7 @@ matching metadata entry. This is the un-optimized resolution that uses
having Alice send additional sequence numbers to Bob that help him
traverse in less round trips.
-In the metadata record Bob recieved for \texttt{cat\_dna.csv} there is
+In the metadata record Bob received for \texttt{cat\_dna.csv} there is
the byte offset to the beginning of the file in the data feed. Bob adds
his +30MB offset to this value and starts requesting pieces of data
starting at that byte offset using the SLEEP protocol as described
@@ -712,16 +706,6 @@ This method tries to allow any byte range of any file to be accessed
without the need to synchronize the full metadata for all files up
front.
-\subsubsection{Scenario: Syncing live changes to files at a specific
-path}\label{scenario-syncing-live-changes-to-files-at-a-specific-path}
-
-TODO
-
-\subsubsection{Scenario: Syncing an entire
-archive}\label{scenario-syncing-an-entire-archive}
-
-TODO
-
\subsection{3. Dat Network Protocol}\label{dat-network-protocol}
The SLEEP format is designed to allow for sparse replication, meaning
@@ -768,8 +752,8 @@ Type 0. Should be the first message sent on a channel.
\texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore'
using the public key of the metadata register as the key.
\item
- \texttt{nonce} - 32 bytes of random binary data, used in our
- encryption scheme
+ \texttt{nonce} - 24 bytes (192 bits) of random binary data, used in
+ our encryption scheme
\end{itemize}
\begin{verbatim}
@@ -1006,7 +990,7 @@ message Data {
optional bytes value = 2;
repeated Node nodes = 3;
optional bytes signature = 4;
-
+
message Node {
required uint64 index = 1;
required bytes hash = 2;
@@ -1015,45 +999,7 @@ message Data {
}
\end{verbatim}
-\section{4. Multi-Writer}\label{multi-writer}
-
-The design of Dat up to this point assumes you have a single keyholder
-writing and signing data and appending it to the metadata and content
-feed. However having the ability for multiple keyholders to be able to
-write to a single repository allows for many interesting use cases such
-as forking and collaborative workflows.
-
-In order to do this, we use one \texttt{metadata.data} feed for each
-writer. Each writer kets their own keypair. Each writer is responsible
-for storing their private key. To add a new writer to your feed, you
-include their key in a metadata feed entry.
-
-For example, if Alice wants to add Bob to have write access to a Dat
-repository, Alice would take Bob's public key and writes it to the
-`local' metadata feed (the feed that Alice owns, e.g.~the original
-feed). Now anyone else who replicates from Alice will find Bob's key in
-the history. If in the future Bob distributes a version of the Dat that
-he added new data to, everyone who has a copy of the Dat from Alice will
-have a copy of Bob's key that they can use to verify that Bob's writes
-are valid.
-
-On disk, each users feed is stored in a separate hyperdrive. The
-original hyperdrive (owned by Alice) is called the `local' hyperdrive.
-Bob's hyperdrive would be stored separately in the SLEEP folder
-addressed by Bob's public key.
-
-In case Bob and Alice write different values for the same file (e.g.~Bob
-creates a ``fork''), when they sync up with each other replication will
-still work, but for the forked value the Dat client will return an array
-of values for that key instead of just one value. The values are linked
-to the writer that wrote them, so in the case of receiving multiple
-values, clients can choose to choose the value from Alice, or Bob, or
-the latest value, or whatever other strategy they prefer.
-
-If a writer updates the value of a forked key with new value they are
-performing a merge.
-
-\section{5. Existing Work}\label{existing-work}
+\section{4. Existing Work}\label{existing-work}
Dat is inspired by a number of features from existing systems.
@@ -1208,7 +1154,7 @@ public ledger. Any client or service provider can verify if a
certificate they received is in the ledger, which protects against so
called ``rogue certificates''.
-\section{6. Reference Implementation}\label{reference-implementation}
+\section{5. Reference Implementation}\label{reference-implementation}
The connection logic is implemented in a module called
\href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}.
diff --git a/papers/dat-paper.md b/papers/dat-paper.md
index b62c9b3..257cf10 100644
--- a/papers/dat-paper.md
+++ b/papers/dat-paper.md
@@ -1,6 +1,6 @@
---
title: "Dat - Distributed Dataset Synchronization And Versioning"
-date: "May 2017"
+date: "May 2017 (last updated: Jan 2018)"
author: "Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science"
---
@@ -294,14 +294,6 @@ In the metadata record Bob received for `cat_dna.csv` there is the byte offset t
This method tries to allow any byte range of any file to be accessed without the need to synchronize the full metadata for all files up front.
-### Scenario: Syncing live changes to files at a specific path
-
-TODO
-
-### Scenario: Syncing an entire archive
-
-TODO
-
## 3. Dat Network Protocol
The SLEEP format is designed to allow for sparse replication, meaning you can efficiently download only the metadata and data required to resolve a single byte region of a single file, which makes Dat suitable for a wide variety of streaming, real time and large dataset use cases.
@@ -497,21 +489,7 @@ message Data {
}
```
-# 4. Multi-Writer
-
-The design of Dat up to this point assumes you have a single keyholder writing and signing data and appending it to the metadata and content feed. However having the ability for multiple keyholders to be able to write to a single repository allows for many interesting use cases such as forking and collaborative workflows.
-
-In order to do this, we use one `metadata.data` feed for each writer. Each writer gets their own keypair. Each writer is responsible for storing their private key. To add a new writer to your feed, you include their key in a metadata feed entry.
-
-For example, if Alice wants to add Bob to have write access to a Dat repository, Alice would take Bob's public key and write it to the 'local' metadata feed (the feed that Alice owns, e.g. the original feed). Now anyone else who replicates from Alice will find Bob's key in the history. If in the future Bob distributes a version of the Dat that he added new data to, everyone who has a copy of the Dat from Alice will have a copy of Bob's key that they can use to verify that Bob's writes are valid.
-
-On disk, each users feed is stored in a separate hyperdrive. The original hyperdrive (owned by Alice) is called the 'local' hyperdrive. Bob's hyperdrive would be stored separately in the SLEEP folder addressed by Bob's public key.
-
-In case Bob and Alice write different values for the same file (e.g. Bob creates a "fork"), when they sync up with each other replication will still work, but for the forked value the Dat client will return an array of values for that key instead of just one value. The values are linked to the writer that wrote them, so in the case of receiving multiple values, clients can choose to choose the value from Alice, or Bob, or the latest value, or whatever other strategy they prefer.
-
-If a writer updates the value of a forked key with new value they are performing a merge.
-
-# 5. Existing Work
+# 4. Existing Work
Dat is inspired by a number of features from existing systems.
@@ -563,7 +541,7 @@ The UK Government Digital Service have developed the concept of a register which
The design of registers was inspired by the infrastructure backing the Certificate Transparency [@laurie2013certificate] project, initiated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates".
-# 6. Reference Implementation
+# 5. Reference Implementation
The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. It provides statistics such as how many sources are currently connected, how many good and bad behaving sources have been talked to, and it automatically handles connecting and reconnecting to sources. UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native).
diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf
index 5a7e758..500427e 100644
--- a/papers/dat-paper.pdf
+++ b/papers/dat-paper.pdf
Binary files differ
diff --git a/papers/hyperdrive.md b/papers/hyperdrive.md
deleted file mode 100644
index a0d70b3..0000000
--- a/papers/hyperdrive.md
+++ /dev/null
@@ -1,13 +0,0 @@
----
-title: "Hyperdrive - A distributed web filesystem with incremental sync"
-date: "August 2017"
-author: "Mathias Buus Madsen, Maxwell Ogden, Code for Science"
----
-
-# Abstract
-
-# Acknowledgements
-
-This work was made possible through grants from the John S. and James L. Knight and Alfred P. Sloan Foundations.
-
-# References
diff --git a/papers/sleep.txt b/papers/sleep.latex
index 241347f..cd695d0 100644
--- a/papers/sleep.txt
+++ b/papers/sleep.latex
@@ -48,12 +48,6 @@
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
-% set default figure placement to htbp
-\makeatletter
-\def\fps@figure{htbp}
-\makeatother
-
-
\title{SLEEP - Syncable Ledger of Exact Events Protocol}
\author{Mathias Buus Madsen, Maxwell Ogden, Code for Science}
\date{August 2017}
@@ -138,8 +132,8 @@ SLEEP files are laid out like this:
\item
32 byte header
\item
- 4 bytes - magic byte (value varies depending on which file, used to
- quickly identify which file type it is)
+ 4 bytes Uint32BE (``Big-Endian'') - magic byte (value varies depending
+ on which file, used to quickly identify which file type it is)
\item
1 byte - version number of the file header protocol, current version
is 0
@@ -149,11 +143,10 @@ SLEEP files are laid out like this:
\item
1 byte - length prefix for body
\item
- rest of 32 byte header - string describing key algorithm (in dat
- `ed25519'). length of this string matches the length in the previous
- length prefix field. This string must fit within the 32 byte header
- limitation (24 bytes reserved for string). Unused bytes should be
- filled with zeroes.
+ rest of 32 byte header - string describing key or hash algorithm.
+ length of this string matches the length in the previous length prefix
+ field. This string must fit within the 32 byte header limitation (24
+ bytes reserved for string). Unused bytes should be filled with zeroes.
\end{itemize}
Possible values in the Dat implementation for the body field are:
@@ -365,9 +358,9 @@ random access regions of files in sparse replication scenarios.
byte range.
\item
The chunk described by this child node will contain the byte range you
- are looking for. You can use the \texttt{byteOffset} property in the
- \texttt{Stat} metadata object to seek into the right position in the
- content for the start of this chunk.
+ are looking for. You can use the \texttt{byteOffset} field in the
+ \texttt{Stat} metadata object to seek to the correct position in the
+ content file for the start of this chunk.
\end{itemize}
\subparagraph{Metadata Overhead}\label{metadata-overhead}
@@ -515,10 +508,9 @@ are all uniform (\texttt{{[}1,1{]}})
6 - [11 11 11 11]
\end{verbatim}
-Using this scheme, to represent 32 bytes of data it takes at most 8
-bytes of Index. In this example it compresses nicely as its all
-contiguous ones on disk, similarly for an empty bitfield it would be all
-zeroes.
+Using this scheme, it takes at most 8 bytes of Index to represent 32
+bytes of data. In this example the Index can compresses well because it
+consists of all one bits. Similarly, an empty bitfield is all zero bits.
If you write 4GB of data using on average 64KB data chunk size, your
bitfield will be at most 32KB.
@@ -747,6 +739,9 @@ These are the field definitions:
\texttt{mtime} - POSIX created\_at time
\end{itemize}
+\subsection*{References}\label{references}
+\addcontentsline{toc}{subsection}{References}
+
\hypertarget{refs}{}
\hypertarget{ref-varda2008protocol}{}
Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange
diff --git a/papers/sleep.md b/papers/sleep.md
index d7349f4..5797360 100644
--- a/papers/sleep.md
+++ b/papers/sleep.md
@@ -175,7 +175,7 @@ The above method illustrates how to resolve a chunk position index to a byte off
- First, you start by calculating the current Merkle roots
- Each node in the tree (including these root nodes) stores the aggregate file size of all byte sizes of the nodes below it. So the roots cumulatively will describe all possible byte ranges for this repository.
- Find the root that contains the byte range of the offset you are looking for and get the node information for all of that nodes children using the Index Lookup method, and recursively repeat this step until you find the lowest down child node that describes this byte range.
-- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` property in the `Stat` metadata object to seek into the right position in the content for the start of this chunk.
+- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` field in the `Stat` metadata object to seek to the correct position in the content file for the start of this chunk.
##### Metadata Overhead
@@ -276,7 +276,7 @@ The tuples at entry `1` above are `[1,0]` because the relative child tuples are
6 - [11 11 11 11]
```
-Using this scheme, to represent 32 bytes of data it takes at most 8 bytes of Index. In this example it compresses nicely as its all contiguous ones on disk, similarly for an empty bitfield it would be all zeroes.
+Using this scheme, it takes at most 8 bytes of Index to represent 32 bytes of data. In this example the Index can compresses well because it consists of all one bits. Similarly, an empty bitfield is all zero bits.
If you write 4GB of data using on average 64KB data chunk size, your bitfield will be at most 32KB.
diff --git a/papers/sleep.pdf b/papers/sleep.pdf
index a4281e8..1c59c91 100644
--- a/papers/sleep.pdf
+++ b/papers/sleep.pdf
Binary files differ