diff options
author | bnewbold <bnewbold@robocracy.org> | 2018-01-12 16:45:40 -0500 |
---|---|---|
committer | Joe Hand <joe@joeahand.com> | 2018-01-12 13:45:40 -0800 |
commit | 9f0ce82c58741a7c01176b2d6bb2049b8cf52e05 (patch) | |
tree | 092507aa819301dfac57c3bfcf6d1f223b5dcd89 /papers | |
parent | 5b37b1b8bd0615c1d487abfc4b1543dfdacbfd36 (diff) | |
download | dat-docs-master.tar.gz dat-docs-master.zip |
Paper Cleanups (#102)HEADmasterbnewbold-core-modules
* yet more grammar tweaks
* drop stub hyperdrive paper
This file had no content and was confusing.
* papers: clarify that build output is latex
* dat paper: remove stubs and multi-writer
Consensus is to document multi-writer via a DEP (RFC-like) process
first, and not propose new changes in the whitepaper first.
* dat-paper: note that paper has been updated
This could probably be clarified better (minor revisions and bug-fixes
vs. substantial updates), and maybe we want to pull out the original
"1.0" paper for historical reference. Let's wait until DEP process
settles down first.
* papers: re-build PDFs
Diffstat (limited to 'papers')
-rwxr-xr-x | papers/buildpapers.sh | 6 | ||||
-rw-r--r-- | papers/dat-paper.latex (renamed from papers/dat-paper.txt) | 84 | ||||
-rw-r--r-- | papers/dat-paper.md | 28 | ||||
-rw-r--r-- | papers/dat-paper.pdf | bin | 250116 -> 248958 bytes | |||
-rw-r--r-- | papers/hyperdrive.md | 13 | ||||
-rw-r--r-- | papers/sleep.latex (renamed from papers/sleep.txt) | 35 | ||||
-rw-r--r-- | papers/sleep.md | 4 | ||||
-rw-r--r-- | papers/sleep.pdf | bin | 202643 -> 207531 bytes |
8 files changed, 38 insertions, 132 deletions
diff --git a/papers/buildpapers.sh b/papers/buildpapers.sh index 420e916..089ed89 100755 --- a/papers/buildpapers.sh +++ b/papers/buildpapers.sh @@ -1,9 +1,9 @@ #!/usr/bin/env sh -pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -t latex -o dat-paper.txt +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -t latex -o dat-paper.latex pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -o dat-paper.pdf -pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -t latex -o sleep.txt +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -t latex -o sleep.latex -pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -o sleep.pdf
\ No newline at end of file +pandoc --filter pandoc-citeproc --bibliography=dat-paper.bib --variable classoption=twocolumn --variable papersize=a4paper -s sleep.md -o sleep.pdf diff --git a/papers/dat-paper.txt b/papers/dat-paper.latex index f0e71c2..5d90b07 100644 --- a/papers/dat-paper.txt +++ b/papers/dat-paper.latex @@ -48,15 +48,9 @@ \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} \fi -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - \title{Dat - Distributed Dataset Synchronization And Versioning} \author{Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science} -\date{May 2017} +\date{May 2017 (last updated: Jan 2018)} \begin{document} \maketitle @@ -103,7 +97,7 @@ backup sources can be automatically discovered. However these file sharing tools today are not supported by Web browsers, do not have good privacy guarantees, and do not provide a mechanism for updating files without redistributing a new dataset which could mean entirely -redownloading data you already have. +re-downloading data you already have. \section{2. Dat}\label{dat} @@ -114,7 +108,7 @@ reference implementation is available from npm as The protocol is agnostic to the underlying transport e.g.~you could implement Dat over carrier pigeon. Data is stored in a format called -SLEEP (Ogden and Buus 2017), described in it's own paper. The key +SLEEP (Ogden and Buus 2017), described in its own paper. The key properties of the Dat design are explained in this section. \begin{itemize} @@ -319,7 +313,7 @@ sources to try and contact. Dat uses either TCP, HTTP or \href{https://en.wikipedia.org/wiki/Micro_Transport_Protocol}{UTP} (Rossi et al. 2010). UTP uses LEDBAT which is designed to not take up all available bandwidth on a network (e.g.~so that other people sharing -wifi can still use the Internet), and is still based on UDP so works +WiFi can still use the Internet), and is still based on UDP so works with NAT traversal techniques like UDP hole punching. HTTP is supported for compatibility with static file servers and web browser clients. Note that these are the protocols we support in the reference Dat @@ -495,16 +489,16 @@ For example a register with two data entries would look something like this (pseudocode): \begin{verbatim} -0. hash(value0) +0. hash(chunk0) 1. hash(hash(chunk0) + hash(chunk1)) -2. hash(value1) +2. hash(chunk1) \end{verbatim} It is possible for the in-order Merkle tree to have multiple roots at once. A root is defined as a parent node with a full set of child node slots filled below it. -For example, this tree hash 2 roots (1 and 4) +For example, this tree has 2 roots (1 and 4) \begin{verbatim} 0 @@ -514,7 +508,7 @@ For example, this tree hash 2 roots (1 and 4) 4 \end{verbatim} -This tree hash one root (3): +This tree has one root (3): \begin{verbatim} 0 @@ -560,7 +554,7 @@ list like this: bat-1 bat-2 bat-3 -cat-1 +cat-1 cat-2 cat-3 \end{verbatim} @@ -702,7 +696,7 @@ matching metadata entry. This is the un-optimized resolution that uses having Alice send additional sequence numbers to Bob that help him traverse in less round trips. -In the metadata record Bob recieved for \texttt{cat\_dna.csv} there is +In the metadata record Bob received for \texttt{cat\_dna.csv} there is the byte offset to the beginning of the file in the data feed. Bob adds his +30MB offset to this value and starts requesting pieces of data starting at that byte offset using the SLEEP protocol as described @@ -712,16 +706,6 @@ This method tries to allow any byte range of any file to be accessed without the need to synchronize the full metadata for all files up front. -\subsubsection{Scenario: Syncing live changes to files at a specific -path}\label{scenario-syncing-live-changes-to-files-at-a-specific-path} - -TODO - -\subsubsection{Scenario: Syncing an entire -archive}\label{scenario-syncing-an-entire-archive} - -TODO - \subsection{3. Dat Network Protocol}\label{dat-network-protocol} The SLEEP format is designed to allow for sparse replication, meaning @@ -768,8 +752,8 @@ Type 0. Should be the first message sent on a channel. \texttt{discoveryKey} - A BLAKE2b keyed hash of the string `hypercore' using the public key of the metadata register as the key. \item - \texttt{nonce} - 32 bytes of random binary data, used in our - encryption scheme + \texttt{nonce} - 24 bytes (192 bits) of random binary data, used in + our encryption scheme \end{itemize} \begin{verbatim} @@ -1006,7 +990,7 @@ message Data { optional bytes value = 2; repeated Node nodes = 3; optional bytes signature = 4; - + message Node { required uint64 index = 1; required bytes hash = 2; @@ -1015,45 +999,7 @@ message Data { } \end{verbatim} -\section{4. Multi-Writer}\label{multi-writer} - -The design of Dat up to this point assumes you have a single keyholder -writing and signing data and appending it to the metadata and content -feed. However having the ability for multiple keyholders to be able to -write to a single repository allows for many interesting use cases such -as forking and collaborative workflows. - -In order to do this, we use one \texttt{metadata.data} feed for each -writer. Each writer kets their own keypair. Each writer is responsible -for storing their private key. To add a new writer to your feed, you -include their key in a metadata feed entry. - -For example, if Alice wants to add Bob to have write access to a Dat -repository, Alice would take Bob's public key and writes it to the -`local' metadata feed (the feed that Alice owns, e.g.~the original -feed). Now anyone else who replicates from Alice will find Bob's key in -the history. If in the future Bob distributes a version of the Dat that -he added new data to, everyone who has a copy of the Dat from Alice will -have a copy of Bob's key that they can use to verify that Bob's writes -are valid. - -On disk, each users feed is stored in a separate hyperdrive. The -original hyperdrive (owned by Alice) is called the `local' hyperdrive. -Bob's hyperdrive would be stored separately in the SLEEP folder -addressed by Bob's public key. - -In case Bob and Alice write different values for the same file (e.g.~Bob -creates a ``fork''), when they sync up with each other replication will -still work, but for the forked value the Dat client will return an array -of values for that key instead of just one value. The values are linked -to the writer that wrote them, so in the case of receiving multiple -values, clients can choose to choose the value from Alice, or Bob, or -the latest value, or whatever other strategy they prefer. - -If a writer updates the value of a forked key with new value they are -performing a merge. - -\section{5. Existing Work}\label{existing-work} +\section{4. Existing Work}\label{existing-work} Dat is inspired by a number of features from existing systems. @@ -1208,7 +1154,7 @@ public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called ``rogue certificates''. -\section{6. Reference Implementation}\label{reference-implementation} +\section{5. Reference Implementation}\label{reference-implementation} The connection logic is implemented in a module called \href{https://www.npmjs.com/package/discovery-swarm}{discovery-swarm}. diff --git a/papers/dat-paper.md b/papers/dat-paper.md index b62c9b3..257cf10 100644 --- a/papers/dat-paper.md +++ b/papers/dat-paper.md @@ -1,6 +1,6 @@ --- title: "Dat - Distributed Dataset Synchronization And Versioning" -date: "May 2017" +date: "May 2017 (last updated: Jan 2018)" author: "Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen, Code for Science" --- @@ -294,14 +294,6 @@ In the metadata record Bob received for `cat_dna.csv` there is the byte offset t This method tries to allow any byte range of any file to be accessed without the need to synchronize the full metadata for all files up front. -### Scenario: Syncing live changes to files at a specific path - -TODO - -### Scenario: Syncing an entire archive - -TODO - ## 3. Dat Network Protocol The SLEEP format is designed to allow for sparse replication, meaning you can efficiently download only the metadata and data required to resolve a single byte region of a single file, which makes Dat suitable for a wide variety of streaming, real time and large dataset use cases. @@ -497,21 +489,7 @@ message Data { } ``` -# 4. Multi-Writer - -The design of Dat up to this point assumes you have a single keyholder writing and signing data and appending it to the metadata and content feed. However having the ability for multiple keyholders to be able to write to a single repository allows for many interesting use cases such as forking and collaborative workflows. - -In order to do this, we use one `metadata.data` feed for each writer. Each writer gets their own keypair. Each writer is responsible for storing their private key. To add a new writer to your feed, you include their key in a metadata feed entry. - -For example, if Alice wants to add Bob to have write access to a Dat repository, Alice would take Bob's public key and write it to the 'local' metadata feed (the feed that Alice owns, e.g. the original feed). Now anyone else who replicates from Alice will find Bob's key in the history. If in the future Bob distributes a version of the Dat that he added new data to, everyone who has a copy of the Dat from Alice will have a copy of Bob's key that they can use to verify that Bob's writes are valid. - -On disk, each users feed is stored in a separate hyperdrive. The original hyperdrive (owned by Alice) is called the 'local' hyperdrive. Bob's hyperdrive would be stored separately in the SLEEP folder addressed by Bob's public key. - -In case Bob and Alice write different values for the same file (e.g. Bob creates a "fork"), when they sync up with each other replication will still work, but for the forked value the Dat client will return an array of values for that key instead of just one value. The values are linked to the writer that wrote them, so in the case of receiving multiple values, clients can choose to choose the value from Alice, or Bob, or the latest value, or whatever other strategy they prefer. - -If a writer updates the value of a forked key with new value they are performing a merge. - -# 5. Existing Work +# 4. Existing Work Dat is inspired by a number of features from existing systems. @@ -563,7 +541,7 @@ The UK Government Digital Service have developed the concept of a register which The design of registers was inspired by the infrastructure backing the Certificate Transparency [@laurie2013certificate] project, initiated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates". -# 6. Reference Implementation +# 5. Reference Implementation The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. It provides statistics such as how many sources are currently connected, how many good and bad behaving sources have been talked to, and it automatically handles connecting and reconnecting to sources. UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf Binary files differindex 5a7e758..500427e 100644 --- a/papers/dat-paper.pdf +++ b/papers/dat-paper.pdf diff --git a/papers/hyperdrive.md b/papers/hyperdrive.md deleted file mode 100644 index a0d70b3..0000000 --- a/papers/hyperdrive.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: "Hyperdrive - A distributed web filesystem with incremental sync" -date: "August 2017" -author: "Mathias Buus Madsen, Maxwell Ogden, Code for Science" ---- - -# Abstract - -# Acknowledgements - -This work was made possible through grants from the John S. and James L. Knight and Alfred P. Sloan Foundations. - -# References diff --git a/papers/sleep.txt b/papers/sleep.latex index 241347f..cd695d0 100644 --- a/papers/sleep.txt +++ b/papers/sleep.latex @@ -48,12 +48,6 @@ \renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}} \fi -% set default figure placement to htbp -\makeatletter -\def\fps@figure{htbp} -\makeatother - - \title{SLEEP - Syncable Ledger of Exact Events Protocol} \author{Mathias Buus Madsen, Maxwell Ogden, Code for Science} \date{August 2017} @@ -138,8 +132,8 @@ SLEEP files are laid out like this: \item 32 byte header \item - 4 bytes - magic byte (value varies depending on which file, used to - quickly identify which file type it is) + 4 bytes Uint32BE (``Big-Endian'') - magic byte (value varies depending + on which file, used to quickly identify which file type it is) \item 1 byte - version number of the file header protocol, current version is 0 @@ -149,11 +143,10 @@ SLEEP files are laid out like this: \item 1 byte - length prefix for body \item - rest of 32 byte header - string describing key algorithm (in dat - `ed25519'). length of this string matches the length in the previous - length prefix field. This string must fit within the 32 byte header - limitation (24 bytes reserved for string). Unused bytes should be - filled with zeroes. + rest of 32 byte header - string describing key or hash algorithm. + length of this string matches the length in the previous length prefix + field. This string must fit within the 32 byte header limitation (24 + bytes reserved for string). Unused bytes should be filled with zeroes. \end{itemize} Possible values in the Dat implementation for the body field are: @@ -365,9 +358,9 @@ random access regions of files in sparse replication scenarios. byte range. \item The chunk described by this child node will contain the byte range you - are looking for. You can use the \texttt{byteOffset} property in the - \texttt{Stat} metadata object to seek into the right position in the - content for the start of this chunk. + are looking for. You can use the \texttt{byteOffset} field in the + \texttt{Stat} metadata object to seek to the correct position in the + content file for the start of this chunk. \end{itemize} \subparagraph{Metadata Overhead}\label{metadata-overhead} @@ -515,10 +508,9 @@ are all uniform (\texttt{{[}1,1{]}}) 6 - [11 11 11 11] \end{verbatim} -Using this scheme, to represent 32 bytes of data it takes at most 8 -bytes of Index. In this example it compresses nicely as its all -contiguous ones on disk, similarly for an empty bitfield it would be all -zeroes. +Using this scheme, it takes at most 8 bytes of Index to represent 32 +bytes of data. In this example the Index can compresses well because it +consists of all one bits. Similarly, an empty bitfield is all zero bits. If you write 4GB of data using on average 64KB data chunk size, your bitfield will be at most 32KB. @@ -747,6 +739,9 @@ These are the field definitions: \texttt{mtime} - POSIX created\_at time \end{itemize} +\subsection*{References}\label{references} +\addcontentsline{toc}{subsection}{References} + \hypertarget{refs}{} \hypertarget{ref-varda2008protocol}{} Varda, Kenton. 2008. ``Protocol Buffers: Google's Data Interchange diff --git a/papers/sleep.md b/papers/sleep.md index d7349f4..5797360 100644 --- a/papers/sleep.md +++ b/papers/sleep.md @@ -175,7 +175,7 @@ The above method illustrates how to resolve a chunk position index to a byte off - First, you start by calculating the current Merkle roots - Each node in the tree (including these root nodes) stores the aggregate file size of all byte sizes of the nodes below it. So the roots cumulatively will describe all possible byte ranges for this repository. - Find the root that contains the byte range of the offset you are looking for and get the node information for all of that nodes children using the Index Lookup method, and recursively repeat this step until you find the lowest down child node that describes this byte range. -- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` property in the `Stat` metadata object to seek into the right position in the content for the start of this chunk. +- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` field in the `Stat` metadata object to seek to the correct position in the content file for the start of this chunk. ##### Metadata Overhead @@ -276,7 +276,7 @@ The tuples at entry `1` above are `[1,0]` because the relative child tuples are 6 - [11 11 11 11] ``` -Using this scheme, to represent 32 bytes of data it takes at most 8 bytes of Index. In this example it compresses nicely as its all contiguous ones on disk, similarly for an empty bitfield it would be all zeroes. +Using this scheme, it takes at most 8 bytes of Index to represent 32 bytes of data. In this example the Index can compresses well because it consists of all one bits. Similarly, an empty bitfield is all zero bits. If you write 4GB of data using on average 64KB data chunk size, your bitfield will be at most 32KB. diff --git a/papers/sleep.pdf b/papers/sleep.pdf Binary files differindex a4281e8..1c59c91 100644 --- a/papers/sleep.pdf +++ b/papers/sleep.pdf |