diff options
Diffstat (limited to 'papers/dat-paper.txt')
-rw-r--r-- | papers/dat-paper.txt | 48 |
1 files changed, 24 insertions, 24 deletions
diff --git a/papers/dat-paper.txt b/papers/dat-paper.txt index 9ff3f99..cf8cd85 100644 --- a/papers/dat-paper.txt +++ b/papers/dat-paper.txt @@ -87,7 +87,7 @@ scientific literature}. Cloud storage services like S3 ensure availability of data, but they have a centralized hub-and-spoke networking model and are therefore -limited by their bandwidth, meaning popular files can be come very +limited by their bandwidth, meaning popular files can become very expensive to share. Services like Dropbox and Google Drive provide version control and synchronization on top of cloud storage services which fixes many issues with broken links but rely on proprietary code @@ -203,7 +203,7 @@ able to discover or communicate with any member of the swarm for that Dat. Anyone with the public key can verify that messages (such as entries in a Dat Stream) were created by a holder of the private key. -Every Dat repository has corresponding a private key that kept in your +Every Dat repository has a corresponding private key that is kept in your home folder and never shared. Dat never exposes either the public or private key over the network. During the discovery phase the BLAKE2b hash of the public key is used as the discovery key. This means that the @@ -327,7 +327,7 @@ UTP source it tries to connect using both protocols. If one connects first, Dat aborts the other one. If none connect, Dat will try again until it decides that source is offline or unavailable and then stops trying to connect to them. Sources Dat is able to connect to go into a -list of known good sources, so that the Internet connection goes down +list of known good sources, so that if the Internet connection goes down Dat can use that list to reconnect to known good sources again quickly. If Dat gets a lot of potential sources it picks a handful at random to @@ -392,7 +392,7 @@ of a repository, and data is stored as normal files in the root folder. \subsubsection{Metadata Versioning}\label{metadata-versioning} Dat tries as much as possible to act as a one-to-one mirror of the state -of a folder and all it's contents. When importing files, Dat uses a +of a folder and all its contents. When importing files, Dat uses a sorted depth-first recursion to list all the files in the tree. For each file it finds, it grabs the filesystem metadata (filename, Stat object, etc) and checks if there is already an entry for this filename with this @@ -421,7 +421,7 @@ for old versions in \texttt{.dat}. Git for example stores all previous content versions and all previous metadata versions in the \texttt{.git} folder. Because Dat is designed for larger datasets, if it stored all previous file versions in \texttt{.dat}, then the \texttt{.dat} folder -could easily fill up the users hard drive inadverntently. Therefore Dat +could easily fill up the user's hard drive inadvertently. Therefore Dat has multiple storage modes based on usage. Hypercore registers include an optional \texttt{data} file that stores @@ -441,7 +441,7 @@ you know the server has the full history. Registers in Dat use a specific method of encoding a Merkle tree where hashes are positioned by a scheme called binary in-order interval numbering or just ``bin'' numbering. This is just a specific, -deterministic way of laying out the nodes in a tree. For example a tree +deterministic way of laying out the nodes in a tree. For example, a tree with 7 nodes will always be arranged like this: \begin{verbatim} @@ -498,7 +498,7 @@ It is possible for the in-order Merkle tree to have multiple roots at once. A root is defined as a parent node with a full set of child node slots filled below it. -For example, this tree hash 2 roots (1 and 4) +For example, this tree has 2 roots (1 and 4) \begin{verbatim} 0 @@ -508,7 +508,7 @@ For example, this tree hash 2 roots (1 and 4) 4 \end{verbatim} -This tree hash one root (3): +This tree has one root (3): \begin{verbatim} 0 @@ -554,7 +554,7 @@ process. The seven chunks get sorted into a list like this: bat-1 bat-2 bat-3 -cat-1 +cat-1 cat-2 cat-3 \end{verbatim} @@ -583,7 +583,7 @@ for this Dat. This tree is for the hashes of the contents of the photos. There is also a second Merkle tree that Dat generates that represents the list of -files and their metadata and looks something like this (the metadata +files and their metadata, and looks something like this (the metadata register): \begin{verbatim} @@ -984,7 +984,7 @@ Ed25519 sign( \end{verbatim} The reason we hash all the root nodes is that the BLAKE2b hash above is -only calculateable if you have all of the pieces of data required to +only calculable if you have all of the pieces of data required to generate all the intermediate hashes. This is the crux of Dat's data integrity guarantees. @@ -1022,7 +1022,7 @@ Each entry contains three objects: \begin{itemize} \tightlist \item - Data Bitfield (1024 bytes) - 1 bit for for each data entry that you + Data Bitfield (1024 bytes) - 1 bit for each data entry that you have synced (1 for every entry in \texttt{data}). \item Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in @@ -1040,8 +1040,8 @@ filesystem. The Tree and Index sizes are based on the Data size (the Tree has twice the entries as the Data, odd and even nodes vs just even nodes in \texttt{tree}, and Index is always 1/4th the size). -To generate the Index, you pairs of 2 bytes at a time from the Data -Bitfield, check if all bites in the 2 bytes are the same, and generate 4 +To generate the Index, you pair 2 bytes at a time from the Data +Bitfield, check if all bits in the 2 bytes are the same, and generate 4 bits of Index metadata~for every 2 bytes of Data (hence how 1024 bytes of Data ends up as 256 bytes of Index). @@ -1103,7 +1103,7 @@ the SLEEP files. The contents of this file is a series of versions of the Dat filesystem tree. As this is a hypercore data feed, it's just an append only log of -binary data entries. The challenge is representing a tree in an one +binary data entries. The challenge is representing a tree in a one dimensional way to make it representable as a Hypercore register. For example, imagine three files: @@ -1368,7 +1368,7 @@ register message on the first channel only (metadata). \begin{itemize} \tightlist \item - \texttt{id} - 32 byte random data used as a identifier for this peer + \texttt{id} - 32 byte random data used as an identifier for this peer on the network, useful for checking if you are connected to yourself or another peer more than once \item @@ -1548,7 +1548,7 @@ message Cancel { \subsubsection{Data}\label{data-1} Type 9. Sends a single chunk of data to the other peer. You can send it -in response to a Request or unsolicited on it's own as a friendly gift. +in response to a Request or unsolicited on its own as a friendly gift. The data includes all of the Merkle tree parent nodes needed to verify the hash chain all the way up to the Merkle roots for this chunk. Because you can produce the direct parents by hashing the chunk, only @@ -1580,7 +1580,7 @@ message Data { optional bytes value = 2; repeated Node nodes = 3; optional bytes signature = 4; - + message Node { required uint64 index = 1; required bytes hash = 2; @@ -1611,7 +1611,7 @@ like Git-LFS solve this by using HTTP to download large files, rather than the Git protocol. GitHub offers Git-LFS hosting but charges repository owners for bandwidth on popular files. Building a distributed distribution layer for files in a Git repository is difficult due to -design of Git Packfiles which are delta compressed repository states +design of Git Packfiles, which are delta compressed repository states that do not easily support random access to byte ranges in previous file versions. @@ -1704,7 +1704,7 @@ very desirable for many other types of datasets. \subsection{WebTorrent}\label{webtorrent} -With WebRTC browsers can now make peer to peer connections directly to +With WebRTC, browsers can now make peer to peer connections directly to other browsers. BitTorrent uses UDP sockets which aren't available to browser JavaScript, so can't be used as-is on the Web. @@ -1722,7 +1722,7 @@ System}\label{interplanetary-file-system} IPFS is a family of application and network protocols that have peer to peer file sharing and data permanence baked in. IPFS abstracts network protocols and naming systems to provide an alternative application -delivery platform to todays Web. For example, instead of using HTTP and +delivery platform to today's Web. For example, instead of using HTTP and DNS directly, in IPFS you would use LibP2P streams and IPNS in order to gain access to the features of the IPFS platform. @@ -1731,7 +1731,7 @@ Registers}\label{certificate-transparencysecure-registers} The UK Government Digital Service have developed the concept of a register which they define as a digital public ledger you can trust. In -the UK government registers are beginning to be piloted as a way to +the UK, government registers are beginning to be piloted as a way to expose essential open data sets in a way where consumers can verify the data has not been tampered with, and allows the data publishers to update their data sets over time. @@ -1740,7 +1740,7 @@ The design of registers was inspired by the infrastructure backing the Certificate Transparency (Laurie, Langley, and Kasper 2013) project, initated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed -public ledger. Anyone client or service provider can verify if a +public ledger. Any client or service provider can verify if a certificate they received is in the ledger, which protects against so called ``rogue certificates''. @@ -1763,7 +1763,7 @@ they need to), as well as a \href{https://github.com/bittorrent/bootstrap-dht}{DHT bootstrap} server. These discovery servers are the only centralized infrastructure we need for Dat to work over the Internet, but they are redundant, -interchangeable, never see the actual data being shared, anyone can run +interchangeable, never see the actual data being shared, and anyone can run their own and Dat will still work even if they all are unavailable. If this happens discovery will just be manual (e.g.~manually sharing IP/ports). |