P2P computing is the sharing of computer resources and services by direct exchange between systems. These resources and services include the exchange of data, processing cycles, cache storage, and disk storage for files. P2P computing takes advantage of existing computing power, computer storage and networking connectivity, allowing users to leverage their collective power to the ‘benefit’ of all.

All nodes are both clients and servers Node Provide and consume Any node can initiate a connection No centralized data source “The ultimate form of democracy on the Internet” “The ultimate threat to copyright protection on the Internet” In practice, hybrid models are popular Combination of client‐ server & peer‐to‐peer E.g., Skype (early days, now unknown) Spotify

Efficient use of resources Unused bandwidth, storage, processing power at the edge of the network Scalability Consumers of resources also donate resources Aggregate resources grow naturally with utilization Organic scaling Infrastructure‐less scaling Caveat: It is not a one size fits all Large companies are not switching to p2p Reliability (in aggregate) Replicas Redundancy Geographic distribution No single point of failure Ease of administration Nodes self‐organize No need to deploy servers to satisfy demand Built‐in fault‐tolerance, replication, and load balancing

Chapter 11: Large Scale Systems and Overlay Routing Flashcards by Johannes Waldorf

Large‐scale storage applications: Web indexing (Google), Web archives, Motivation and Goals

Web indexing:

Goal: Index the entire Web
Estimate: Google has 250,000‐node cluster!
- Worldwide & massively distributed
- Organized as datacenters of clusters (racks upon racks)

Web archives:

Goal: Make and archive a daily checkpoint of the Web
Estimates
- Web is about 57 Tbyte, compressed HTML+img
- New data per day: 580 Gbyte
- ~1000 Tbyte per year with 5 replicas (just for new data)
Design
- 10,000 nodes: 100 Gbyte disk each (today: Maybe ~4 TB each)

How well did you know this?

Not at all

Perfectly

Client server limitations

Scalability is expensive
Presents a single point of failure
Requires administration
Unused resources at the network edge
P2P systems try to address these limitations and leverage (otherwise) unused resources

How well did you know this?

Not at all

Perfectly

P2P computing

P2P computing is the sharing of computer resources and services by direct exchange between systems.
These resources and services include the exchange of data, processing cycles, cache storage, and disk storage for files.
P2P computing takes advantage of existing computing power, computer storage and networking connectivity, allowing users to leverage their collective power to the ‘benefit’ of all.

How well did you know this?

Not at all

Perfectly

What is a P2P system?

A distributed system architecture
- No centralized control
- Nodes are symmetric in function
Larger number of unreliable nodes
Enabled by technology improvements

How well did you know this?

Not at all

Perfectly

P2P architecture

All nodes are both clients and servers Node
- Provide and consume
- Any node can initiate a connection
No centralized data source
- “The ultimate form of democracy on the Internet”
- “The ultimate threat to copyright protection on the Internet”
In practice, hybrid models are popular
- Combination of client‐ server & peer‐to‐peer
- E.g., Skype (early days, now unknown) Spotify

How well did you know this?

Not at all

Perfectly

P2P benefits

Efficient use of resources
- Unused bandwidth, storage, processing power at the edge of the network
Scalability
- Consumers of resources also donate resources
- Aggregate resources grow naturally with utilization
  - Organic scaling
  - Infrastructure‐less scaling
Caveat: It is not a one size fits all
- Large companies are not switching to p2p
Reliability (in aggregate)
- Replicas
- Redundancy
- Geographic distribution
- No single point of failure
Ease of administration
- Nodes self‐organize
- No need to deploy servers to satisfy demand
- Built‐in fault‐tolerance, replication, and load balancing

How well did you know this?

Not at all

Perfectly

Popular P2P systems (first generation)

Unstructured p2p systems: Napster, Gnutella, FastTrack, Freenet, eDonkey, BitTorrent
Large‐scalesharingoffiles
- User A makes files (music, video, etc.) on their computer available to others
- User B connects to the network, searches for files and downloads files directly from User A
Issues of copyright infringement

How well did you know this?

Not at all

Perfectly

Napster: June’1999‐July’2001

A way to share (music) files with others (maybe the first)
Users upload their list of files to Napster server
Users send queries to Napster server for files of interest
- Keyword search (artist, song, album, bit rate, etc.)
Napster server replies with IP address of users with matching files
Querying users connect directly to file providing user for download

How well did you know this?

Not at all

Perfectly

Gnutella: 2000 – today

Share any type of files (not just music)
Decentralized search, unlike Napster
Ask neighbors for files of interest
Neighbors ask their neighbors, and so on
- TTL field quenches messages after a number of hops
Users with matching files reply to you

How well did you know this?

Not at all

Perfectly

Freenetproject.org, since 2000

Goals by founder: “Providing freedom of speech with strong anonymity protection.”
Protects anonymity of participants
Platform for censorship‐resistant communication
Decentralized, highly survivable, distributed cache (blogs, pages, files, etc.)
Fully peer‐to‐peer, no dedicated clients or servers
Only enables access to information, previously inserted (it is not a Web proxy)
Every node contributes a configurable amount of storage
Not possible for a node to rate another node (except on insert/retrieve capacity)

How well did you know this?

Not at all

Perfectly

Freenet Anonymity requirement & implications

Anonymity for information upload & download
Source does not remain on the network after upload
Files are broken into encrypted blocks and are redundantly stored across network
For download, blocks are found and reassembled
Node requesting a datum does not connect directly to node that has datum
Datum routed across intermediaries, none of which know request originator or location
Higher bandwidth use required, slower transfers

How well did you know this?

Not at all

Perfectly

Freenet Key disadvantage of storage model

No one node is responsible for any block of data
If data is not retrieved for some time, old data might be dropped, if space is exceeded by newly arriving data
Therefore, Freenet tends to‘forget’ data, not retrieved regularly
There is no way to delete data (unless it is “forgotten”)

How well did you know this?

Not at all

Perfectly

Comparison of file sharing networks

Napster (centralized)
- Bottleneck (scalability, failure, denial of service)
- Correct search results (centralized search)
Gnutella (distributed)
- No central bottleneck, but large cost due to flooding query
- No guarantee on search results
Freenet (distributed)
- Anonymity
- Less efficient data transfer
- No guarantee on search result

How well did you know this?

Not at all

Perfectly

Structured peer‐to‐peer systems

Second generation peer‐to‐peer overlay networks
Self‐organizing, load balanced, fault‐tolerant
Guarantees on numbers of hops to answer a query
Based on a (distributed) hash table interface
- Put(Key, Data)
- Get(Key)
Systems: Chord, CAN, Pastry, etc.

How well did you know this?

Not at all

Perfectly

Distributed hash tables (DHT)

Distributed version of a hash table data structure
Store and retrieve (key, value)‐pairs
- Key is like a filename, hash of name, hash of content (since name could change)
- Value is file content

How well did you know this?

Not at all

Perfectly

A DHT has a simple interface

Study These Flashcards

Put (key, value) and get(key) value
- Simple interface!
API supports a wide range of applications
- DHT imposes neither structure nor meaning on keys
Key‐value pairs are persisted and globally available
- Can store keys in other DHT values
- Thus, build complex data structures

A DHT makes a good shared infrastructure

Study These Flashcards

Many applications can share single DHT service
Eases deployment of new applications
Pools resources from many participants
Essentially, a middleware service

DHT‐based projects

Study These Flashcards

File sharing [CFS, OceanStore, PAST, Ivy, …]
Web cache [Squirrel, ..]
Archival/Backup store [HiveNet,Mojo,Pastiche]
Censor‐resistant stores [Eternity, FreeNet,..]
DB query and indexing [PIER, …]
Event notification [Scribe]
Naming systems [ChordDNS, Twine, ..]
Communication primitives [I3, …]
Plethora of key‐value stores [BigTable, Dynamo, PNUTS, …]
Common denominator:
• Data is location‐independent • All leverage DHT abstraction

CFS: Cooperative file sharing

Study These Flashcards

DHT is a robust block store
Client of DHT implements file system
- Read‐only: CFS, PAST
- Read‐write: OceanStore, Ivy

DHT desirable properties

Study These Flashcards

Keys mapped evenly to all nodes in the network
Node arrival & departures only affect a few nodes
Each node maintains information about only a few other nodes
Messages can be routed to a node efficiently

Chord identifier circle

Study These Flashcards

Nodes organized in an identifier circle based on node identifiers
Keys assigned to their successor node in the identifier circle
Hash function ensures even distribution of nodes and keys on the circle
Cf. consistent hashing
With N nodes and K keys each node is responsible for roughly K/N keys

CHORD Node joins & leaves

Study These Flashcards

Successor (or predecessor) node may disappear from the network (e.g., failure, departure)
Each node records a whole segment of the circle adjacent to it, i.e., nodes preceding and following it
With high probability a node is able to correctly locate its successor or predecessor (even under high node failure)
When a new node joins or leaves the network, responsibility for O(K/N) keys changes hands.

Searching in Chord

Study These Flashcards

With success or knowledge of nodes, a linear search over network could locate a particular key (naïve search)
Any given message may potentially have to be relayed through most of the network
Faster search method requires each node to keep a “finger table” containing up to m entries
- i‐th entry of node n contains the address of successor((n + 2i‐1) mod 2m)
- number of nodes that must be contacted to find a successor in an n‐node network is O(log n)

Chord key location

Study These Flashcards

Lookup in finger table the furthest node that precedes key
Query homes in on target in O(log n) hops
Each hop at least halves distance to destination

CAN: Content addressable network

* CAN is designed to be **scalable**, **fault** **tolerant** and **self**‐**organizing** * Design is based on **virtual** **multi**‐**dimensional** **Cartesian** **coordinate** space to organize overlay * **Nodes** are **layered** on a **multi**‐**torus** (i.e., coordinates at edges wrap around) * d‐dimensional coordinate space is a **virtual** **logical** **address** **space** * **Nodes** map to points in space * **Address** space is **independent** **of** **physical** **location** and physical connectivity of nodes * **Keys** map to points in space * Points in the space are identified with coordinates * A **hash** **function** is used for this mapping

CAN principles

* Entire coordinate space is **dynamically** **partitioned** among all nodes in system * Each **node** **owns** a **distinct** **zone** in the space * Each **key** **hashes** to a **point** in the space

CAN routing

* Put(key, data), get(key) * **Greedily** **forward** **message** **to** **neighbor** **closest** to **destination** in Cartesian coordinate space * **Nodes** **maintain** a **routing** **table** that holds IP address and zone of its neighbours

Node joining a CAN

* **Find** a **node** already **in** the **overlay** **network** * **Identify** a **zone** **that** **can** **be** **split** * **Pick** **random** **point** * **Route join request to node managing the point’s zone** * **Initiate** **split** of zone at that node * **Update** **routing** **tables** **of** **nodes** **neighbouring** the newly split zone

DHT routing summary

* **Chord** * **Finger** **table** routing * Each **hop** at least **halves** **distance** (in identifier circle) to destination * **CAN** * **Neighbour** routing * Forward to **neighbor** that is **closest** (in **Cartesian** **coordinate** space) to destination * **Pastry** * **Prefix** routing * Each **hop** **matches** **destination** **identifier** by at least one more digit

Peer‐to‐peer systems review

* Two key functions of P2P systems * **Sharing** content * **Finding** **content** * Sharing content * **Direct** **transfer** **between** **peers** * **Structured** vs. **unstructured** **placement** of data * **Automatic** **replication** of data * Finding content * **Centralized** (Napster) * **Decentralized** (Gnutella) * **Guarantee** **bounds** (DHTs)

Chapter 11: Large Scale Systems and Overlay Routing Flashcards

(31 cards)