Chapter 11: Large Scale Systems and Overlay Routing Flashcards
1
Q
Large‐scale storage applications: Web indexing (Google), Web archives, Motivation and Goals
A
Web indexing:
- Goal: Index the entire Web
- Estimate: Google has 250,000‐node cluster!
- Worldwide & massively distributed
- Organized as datacenters of clusters (racks upon racks)
Web archives:
- Goal: Make and archive a daily checkpoint of the Web
- Estimates
- Web is about 57 Tbyte, compressed HTML+img
- New data per day: 580 Gbyte
- ~1000 Tbyte per year with 5 replicas (just for new data)
- Design
- 10,000 nodes: 100 Gbyte disk each (today: Maybe ~4 TB each)
2
Q
Client server limitations
A
- Scalability is expensive
- Presents a single point of failure
- Requires administration
- Unused resources at the network edge
- P2P systems try to address these limitations and leverage (otherwise) unused resources
3
Q
P2P computing
A
- P2P computing is the sharing of computer resources and services by direct exchange between systems.
- These resources and services include the exchange of data, processing cycles, cache storage, and disk storage for files.
- P2P computing takes advantage of existing computing power, computer storage and networking connectivity, allowing users to leverage their collective power to the ‘benefit’ of all.
4
Q
What is a P2P system?
A
- A distributed system architecture
- No centralized control
- Nodes are symmetric in function
- Larger number of unreliable nodes
- Enabled by technology improvements
5
Q
P2P architecture
A
- All nodes are both clients and servers Node
- Provide and consume
- Any node can initiate a connection
-
No centralized data source
- “The ultimate form of democracy on the Internet”
- “The ultimate threat to copyright protection on the Internet”
- In practice, hybrid models are popular
- Combination of client‐ server & peer‐to‐peer
- E.g., Skype (early days, now unknown) Spotify
6
Q
P2P benefits
A
-
Efficient use of resources
- Unused bandwidth, storage, processing power at the edge of the network
-
Scalability
- Consumers of resources also donate resources
-
Aggregate resources grow naturally with utilization
- Organic scaling
- Infrastructure‐less scaling
-
Caveat: It is not a one size fits all
- Large companies are not switching to p2p
-
Reliability (in aggregate)
- Replicas
- Redundancy
- Geographic distribution
- No single point of failure
-
Ease of administration
- Nodes self‐organize
- No need to deploy servers to satisfy demand
- Built‐in fault‐tolerance, replication, and load balancing
7
Q
Popular P2P systems (first generation)
A
- Unstructured p2p systems: Napster, Gnutella, FastTrack, Freenet, eDonkey, BitTorrent
- Large‐scalesharingoffiles
- User A makes files (music, video, etc.) on their computer available to others
- User B connects to the network, searches for files and downloads files directly from User A
- Issues of copyright infringement
8
Q
Napster: June’1999‐July’2001
A
- A way to share (music) files with others (maybe the first)
- Users upload their list of files to Napster server
- Users send queries to Napster server for files of interest
- Keyword search (artist, song, album, bit rate, etc.)
- Napster server replies with IP address of users with matching files
- Querying users connect directly to file providing user for download
9
Q
Gnutella: 2000 – today
A
- Share any type of files (not just music)
- Decentralized search, unlike Napster
- Ask neighbors for files of interest
-
Neighbors ask their neighbors, and so on
- TTL field quenches messages after a number of hops
- Users with matching files reply to you
10
Q
Freenetproject.org, since 2000
A
- Goals by founder: “Providing freedom of speech with strong anonymity protection.”
- Protects anonymity of participants
- Platform for censorship‐resistant communication
- Decentralized, highly survivable, distributed cache (blogs, pages, files, etc.)
- Fully peer‐to‐peer, no dedicated clients or servers
- Only enables access to information, previously inserted (it is not a Web proxy)
- Every node contributes a configurable amount of storage
- Not possible for a node to rate another node (except on insert/retrieve capacity)
11
Q
Freenet Anonymity requirement & implications
A
- Anonymity for information upload & download
- Source does not remain on the network after upload
- Files are broken into encrypted blocks and are redundantly stored across network
- For download, blocks are found and reassembled
- Node requesting a datum does not connect directly to node that has datum
- Datum routed across intermediaries, none of which know request originator or location
- Higher bandwidth use required, slower transfers
12
Q
Freenet Key disadvantage of storage model
A
- No one node is responsible for any block of data
- If data is not retrieved for some time, old data might be dropped, if space is exceeded by newly arriving data
- Therefore, Freenet tends to‘forget’ data, not retrieved regularly
- There is no way to delete data (unless it is “forgotten”)
13
Q
Comparison of file sharing networks
A
-
Napster (centralized)
- Bottleneck (scalability, failure, denial of service)
- Correct search results (centralized search)
-
Gnutella (distributed)
- No central bottleneck, but large cost due to flooding query
- No guarantee on search results
-
Freenet (distributed)
- Anonymity
- Less efficient data transfer
- No guarantee on search result
14
Q
Structured peer‐to‐peer systems
A
- Second generation peer‐to‐peer overlay networks
- Self‐organizing, load balanced, fault‐tolerant
- Guarantees on numbers of hops to answer a query
-
Based on a (distributed) hash table interface
- Put(Key, Data)
- Get(Key)
- Systems: Chord, CAN, Pastry, etc.
15
Q
Distributed hash tables (DHT)
A
- Distributed version of a hash table data structure
-
Store and retrieve (key, value)‐pairs
- Key is like a filename, hash of name, hash of content (since name could change)
- Value is file content