diff options
author | Max Ogden <max@maxogden.com> | 2016-05-08 11:35:56 +0200 |
---|---|---|
committer | Max Ogden <max@maxogden.com> | 2016-05-08 11:35:56 +0200 |
commit | f8f12c6de61270d587089256f62b3114cf29aa1f (patch) | |
tree | 703ee0297a686eb8ddfb2f973f9f8ddb6ea692a2 | |
parent | b9341322b043a445c9eeea194895b79fc75443ba (diff) | |
download | dat-docs-f8f12c6de61270d587089256f62b3114cf29aa1f.tar.gz dat-docs-f8f12c6de61270d587089256f62b3114cf29aa1f.zip |
inspiration section
-rw-r--r-- | paper.md | 52 |
1 files changed, 46 insertions, 6 deletions
@@ -21,15 +21,55 @@ Distributed file sharing tools like BitTorrent become faster as files become mor Decentralized version control tools for source code like Git provide a protocol for efficiently downloading changes to a set of files, but are optimized for text files and have issues with large files. Solutions like Git-LFS solve this by using HTTP to download large files, rather than the Git protocol. GitHub offers Git-LFS hosting but charges repository owners for bandwidth on popular files. Building a peer to peer distribution layer for files in a Git repository is difficult due to design of Git Packfiles which are delta compressed repository states that do not support random access to byte ranges in previous file versions. -One case study is science. Increasingly scientific datasets are being provided online using one of the above approaches, and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tool designed for text files do not keep up with the demands of large data analysis in science today. +Science is an example of an important community that would benefit from better approaches in this area. Increasingly scientific datasets are being provided online using one of the above approaches, and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tool designed for text files do not keep up with the demands of large data analysis in science today. ## 2. INSPIRATION -- git -- lbfs -- bittorrent -- webtorrent -- ipfs +Dat is inspired by a number of features from existing systems. + +### Git + +Git popularized the idea of a Merkle DAG, a way to represent changes to data where each change is addressed by the secure hash of the change plus all previous hashes. This provides a way to trust data integrity, as the only way a specific hash could be derived by another peer is if they have the same data and change history required to reproduce that hash. This is important for reproducibility as it lets you trust that a specific git commit hash refers to a specific source code state. + +### LBFS + +LBFS is a networked file system that avoids transferring redundant data by deduplicating common regions of files and only transferring unique regions once. The deduplication algorithm they use is called Rabin fingerprinting and works by hashing the contents of the file using a sliding window and looking for content defined chunk boundaries that probabilistically appear at the desired byte offsets (e.g. every 1kb). + +Content defined chunking has the benefit of being shift resistant, meaning if you insert a byte into the middle of a file only the first chunk boundary to the right of the insert will change, but all other boundaries will remain the same. With a fixed size chunking strategy, such as the one used by rsync, all chunk boundaries to the right of the insert will be shifted by one byte, meaning half of the chunks of the file would need to be retransmitted. + +### BitTorrent + +BitTorrent implements a swarm based file sharing protocol for static datasets. Data is split into fixed sized chunks, hashed, and then that hash is used to discover peers that have the same data. An advantage of using BitTorrent for dataset transfers is that download speeds can be saturated. Since the file is split into pieces, and peers can efficiently discover which pieces each of the peers they are connected to have, it means one peer can download non-overlapping regions of the dataset from many peers at the same time in parallel, maximizing network throughput. + +Fixed sized chunking has drawbacks for data that changes (see LBFS above). Additionally, BitTorrent always divides data into 1024 pieces, meaning large datasets could have a very large chunk size which impacts random access performance (e.g. for streaming video over BitTorrent). + +### Kademlia Distributed Hash Table + +Kademlia is a distributed hash table, in other words a distributed key/value store that can serve a similar purpose to DNS servers but has no hard coded server addresses. All clients in Kademlia are also servers. As long as you know at least one address of another peer in the network, you can ask them for the key you are trying to find and they will either have it or give you some other people to talk to that are more likely to have it. + +If you don't have an initial peer to talk to you have to use something like a bootstrap server that just randomly gives you a peer in the network to start with. If the bootstrap server goes down, the network still functions, and other methods can be used to bootstrap new peers (such as sending them peer addresses through side channels like how .torrent files include tracker addresses to try in case Kademlia finds no peers). + +Kademlia is distinct from previous DHT designs such as Chord due to its simplicity. It uses a very simple XOR operation between two keys as it's distance metric to decide which peers are closer to the data being searched for. On paper it seems like it wouldn't work, as it doesn't take into account things like ping speed or bandwidth. Instead it's design is very simple on purpose, to minimize the amount of control/gossip messages, and to minimize the amount of complexity required to implement it. In practice Kademlia has been extremely successful and is widely deployed as the "Mainline DHT" for BitTorrent, with support in all popular BitTorrent clients today. + +### Peer to Peer Streaming Peer Protocol (PPSPP) + +PPSPP ([IETF RFC 7574](https://datatracker.ietf.org/doc/rfc7574/?include_text=1)) is a protocol for live streaming content over a peer to peer network. In it they define a specific type of Merkle Tree that allows for subsets of the hashes to be requested by a peer in order to reduce the time-till-playback for end users. BitTorrent for example transfers all hashes up front, which is not suitable for live streaming. + +Their Merkle trees are ordered using a scheme they call "bin numbering", which is a method for deterministically arranging an append-only log of leaf nodes into an in-order layout tree where non-leaf nodes are derived hashes. If you want to verify a specific node, you only need to request its sibling's hash and all its uncle hashes. PPSPP is very concerned with reducing round trip time and time-till-playback by allowing for many kinds of optimizations to pack as many hashes into datagrams as possible when exchanging tree information with peers. + +The ability to request a subset of metadata from a large and/or streaming dataset is very desirable for the Dat use case. + +### WebTorrent + +With WebRTC, browsers can now make peer to peer connections directly to other browsers. BitTorrent uses UDP sockets which aren't available to browser JavaScript, so can't be used as-is on the Web. + +WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC as the transport. This includes the BitTorrent block exchange protocol as well as the tracker protocol implemented in a way that can enable hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent swarms (if a peer is capable of making both UDP sockets as well as WebRTC sockets). Trackers are exposed to web clients over HTTP or WebSockets. In a normal web browser you can only use WebRTC to exchange data with peers. + +### InterPlanetary File System + +IPFS also builds on many of the concepts from this section and presents a new platform similar in scope to the Web that has content integrity, peer to peer file sharing, version history and data permanence baked in as a sort of upgrade to the current Web. Whereas Dat is one application of these ideas that is specifically focused on sharing datasets but is agnostic to what platform it is built on, IPFS goes lower level and abstracts network sockets and naming systems so that any application built on the Web can alternatively be built on IPFS to inherit it's properties, as long as their hyperlinks can be expressed as content addressed addresses to the IPFS global Merkle DAG. + +The research behind IPFS has coalesced many of these ideas into a more accessible format. ## 3. DESIGN |