Lesson 8 - Content Distribution Flashcards

Question

Where can caching occur?

Answer 1

- Browser (locally on your machine) - In network: Sometimes your local ISP may have a web cache - Content distribution networks are a special type of web cache that can be used to improve performance

Answer 2

- caches periodically expire content based on the Expires header. - caches can also check with the origin server (“cache checks”) to see whether the original content has been modified - If the content hasn’t been modified, the origin server would respond to a cache check request with a “304” (Not Modified) response.

Answer 3

Not Modified

Answer 4

- Browser configuration: You can open your browser and specifically configure it to point to a local cache, so that all HTTP requests first are directed through the local cache before the request is forwarded to the origin. - Origin server (or service hosting the content) might direct your browser to a cache. * This can be done with a special reply to a DNS request. For example, with a DNS request for google.com, the response returns a number of IP addresses. When you ping those IP addresses, you see that they’re only 1 ms away, which indicates that that server is not far away, but is in fact very likely on the local network, probably even the GT campus network

Answer 5

Reduced transit cost for local ISP, & | Improved performance for local clients

Answer 6

Content Distribution Network - Overlay network of web caches that’s designed to deliver content to a client from the optimal location - In many cases, optimal means geographically closest, but sometimes, optimal is not the geographically closest hash, and we’ll see examples of when that’s the case. - Geographically disparate groups of servers, where each group can serve all the content on the CDN

Answer 7

As close geographically to users as possible

Answer 8

-Content providers (Google) -Networks (Level 3, LimeLight, AT&T) and ISPs -Independent operators such as Akamai Note: Non-network CDN’s such as Akamai and Google can place servers in other ASes or ISP’s

Answer 9

Over 30,000 -As of about 2 years ago, the Akamai edge platform reported about 85,000 unique caching servers in nearly 1,000 unique networks around the world in 72 countries

Answer 10

* Underlying goal is to replicate content on many servers so that the content is replicated close to the clients. This leave many open questions: * How to replicate the content? * Where it should be replicated? * How clients should find the replicated content * How to choose the appropriate server replica/cache for a particular client * This problem is commonly known as server selection * How to direct clients toward the appropriate replica once it’s selected * This problem is sometimes called content routing

Answer 11

How to choose the appropriate server replica/cache for a particular client

Answer 12

How to direct clients toward the appropriate replica once it’s selected

Answer 13

Determining which server to direct the client to. one could do this based on a number of criteria: - Least loaded server - The one with the lowest latency * CDN’s typically aim for this since latency plays a hugely significant role in the web performance the client can see - Or simply to any “alive” server to help provide fault tolerance

Answer 14

- Routing system - Application-based - Naming-based

Answer 15

(type of content routing) E.g. Anycast - Number all the replicas with the same IP address and then rely on routing to take the client to the closest replica based on the routes that the internet routers choose - Simple, but provides ISP’s with very little control over which servers ultimately get redirected to, because the redirection is at the whims of the internet routing - Simple but coarse

Answer 16

(type of content routing) E.g. using an HTTP redirect - Can be effective but requires the client to first go to the origin server to get the redirect in the first place, in creasing latency - Fairly simple but incurs significant delays which operators really care about, as well as users

Answer 17

(type of content routing) E.g. using DNS - Most common method - Client looks up a particular domain name, such as google.com, and the response contains an IP address of a nearby cache - Provides significant flexibility in directing different clients to different server replicas - Provides fine-grained control, and it’s also fast

Answer 18

Looked up symantec.com from 2 different locations - CNAME tells us to look up the following domain name in Akamai - Same lookup from Boston gives us 2 IP addresses that are presumably more local to the Boston area - This is how operators use DNS to redirect clients to different caches nearby

Answer 19

- Have a fairly symbiotic relationship - CDNs like to peer with ISPs because peering directly with ISPs where a customer is located provides better throughput since there are no intermediate AS hops and network latency is lower * Having more vectors to deliver content increases reliability * During large request events, having direct connectivity to multiple networks where the content is hosted allows an ISP to spread its traffic across multiple transit links, thereby reducing the 95th percentile and lowering its transit costs - ISPs also like to peer with CDNs: * Providing content closer to the ISPs customers allows the ISP to provide customers with good performance for a particular service. For example, you can see that GaTech placed a Google cache node in its network, resulting in very low latencies to Google, and thereby happy customers * Providing good performance to popular services is a major selling point for ISPs * Another reason to peer with CDNs (AKA host cache nodes locally) is to lower the transit costs * E.g. if there was a huge demand for a particular video on Youtube and all of the requests and responses were going over expensive transit links, then the ISP’s costs would be potentially prohibitively high * Peering with a CDN would prevent all that traffic from traversing expensive links, thus reducing costs

Answer 20

Lower transit costs Better performance for customers *Note: It may actually REDUCE predictability

Answer 21

- A peer-to-peer CDN - Commonly used for file sharing and distribution of large files Suppose we have a network with a bunch of clients, all of whom want a particular file, and the file might be particularly big - They could all fetch the same file from the source/origin, but: * Origin may be overloaded * Can create congestion/overload at the network where the content is being hosted - Solution is to fetch the content from other peers * Take the original file and chop it into many different pieces and replicate different pieces on different peers in the network as soon as possible * Each peer is assembling the file, but doing so by picking up different pieces of the file, and it can retrieve the pieces that it doesn’t have from the remaining peers in the network * By trading different pieces of the same file, everyone eventually gets the full file. The idea is that hopefully we’ll be able to assemble the full file at the end by the time the clients have swapped

Answer 22

1. Peer creates a “torrent” which contains metadata about a tracker and all of the pieces of the file in question, as well as a checksum for each piece of the file at the time the torrent was created 2. Some peers need to maintain a complete initial copy of the file. These are called seeders. 3. To download a file, a client first contacts the tracker which provides this metadata about the file, including a list of seeders which contain an initial copy of the file. 4. Client starts to download parts of the file from the seeder. Once it has some chunks, hopefully they’re different than other clients in the network, and they can begin to swap chunks. * Leecher: client that contains incomplete copies of the file. * Tracker allows peers to find each other, and returns a random list of peers that any particular leecher can use to swap chunks of the file -Previous P2P systems used similar swapping techniques, but a problem that many of them faced, and that Bit Torrent solved, is called Freeloading/Freeriding

Answer 23

a client might leave the network as soon as it finished downloading a copy of the file, not providing any benefit to other clients who also want the file

Answer 24

Choking: a type of game theoretic strategy called tit-for-tat - Temporary refusal to upload chunks to another peer that is requesting them. Downloading works as normal * If a peer can’t download from a client, it simply doesn’t upload to that peer. * This ensures that nodes cooperate and eliminates the freerider problem. For more on the game theory behind this, read about the Repeated Prisoner’s Dilemma, where a TFT (tit for tat) strategy ensures cooperation among mutually distrustful parties.

Answer 25

Rarest Piece First - Allows client to determine which pieces are the most rare among clients, and download those pieces first. Ensures that a large variety of pieces are downloaded from the seeder. - It’s important to get a complete piece as soon as possible (assuming a client has nothing to trade). Rare pieces are typically available at fewer peers initially, so downloading a rare piece initially is maybe not a good idea. One policy that clients use is to select a random piece (first) of the file and download it from the seeder.

Answer 26

Client actively requests any missing pieces from all peers, and redundant requests are cancelled when the missing piece arrives. This ensures that a single peer with a slow transfer rated doesn’t prevent the download from completing.

Answer 27

Enable a form of content overlay called a structured overlay

Answer 28

- a scalable, distributed lookup service | - enabled by an underlying mechanism called consistent hashing

Answer 29

Scalability Provable correctness Reasonably good performance (that's also fairly easy to reason about)

Answer 30

any service that maps keys to values. Examples of lookup services on the internet include: - DNS - Directories services

Answer 31

the scalable location of data in a large distributed system -A publisher might want to publish the location of a particular piece of data such as an MP4, with a particular name. Needs to figure out where to publish it in a place that the client can find it, so that when the client performs a lookup for “Annie Hall”, its’ directed to the right location hosting the data. Key problem we need to solve here is lookup.

Answer 32

Hash table | -Specifically, a distributed hash table (DHT), which is not located in one place

Answer 33

Consistent hashing

Answer 34

keys and nodes map to the same ID space - Create a metric space such as a ring - Put nodes on this ring, each with some ID - Consistent hash function will assign the nodes and the keys an identifier in this space. - A hash function such as SHA-1 might be used to assign these ID’s - Creates ID’s that are uniformly distributed in the ID space. Now, how to map the key ID’s to the node ID’s so that we know which nodes are responsible for resolving the lookups for a particular key?

Answer 35

at its successor, which is the node with the next highest ID

Answer 36

Load balance | Flexibility

Answer 37

all nodes receive roughly the same number of keys

Answer 38

when a node joins or leaves the network, only a fraction of the keys need to be moved to a different location

Answer 39

Optimal | -the minimal number of keys need to be remapped to maintain load balance when a node joins or leaves the network

Answer 40

- One option: Every node knows the location of every other node. - Another option: Each node only knows the location of its immediate successor in the ring

Answer 41

Lookups are fast, in fact, they’re O(1), but the routing tables are large (O(n)) in particular because every node needs to know the location of every other node in the network, so the routing table must be order n where n is the # of nodes in the network.

Answer 42

Results in a small table of size O(1), but locating the content would require O(n) lookups

Answer 43

Solution that provides best of both worlds (of the 2 options to implement consistent hashing). - Every node knows m other nodes in the ring, and the distance of the nodes that it knows increases exponentially - When a new node is added, we must update the fingers of this node, and update the fingers of other nodes so that they can point to the node with the new ID - Transfer keys from the successor to the new node

Lesson 8 - Content Distribution Flashcards

(67 cards)