Lesson 8 - Content Distribution Flashcards

1
Q

CDN

A

Content Distribution Network

-An internet-wide tool that allows websites and network operators to deliver data quickly and efficiently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

HTTP

A

Hypertext transfer protocol

  • application layer protocol to transfer web content. It’s the protocol that your web browser uses to request web pages, and it’s also the protocol used when objects/pages/etc. are returned to your browser.
  • Web browser makes requests. Pages/objects on page come back as responses.
  • HTTP is typically layered on top of a byte stream protocol, which is almost always TCP
  • The server maintains no information about past client requests. Thus we say the server is stateless.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Contents of an HTTP request

A
  • Request line:
    • Indicates a method for the request
    • GET: return the content associated with a URL
      • Can also be used to send data from the content to the server
    • POST: sends data to the server
    • HEAD: returns typically only the headers of the GET response, but no the content
    • URL:
      • It’s relative (something like index.html)
    • Version number of the HTTP protocol
  • Request also contains additional headers, many of which are optional:
    • Referrer: indicates the URL that caused the page to be requested, e.g. if an object is being requested as part of embedded content in another page, the referrer might be the page that’s embedding the content.
    • User Agent: client software being used to fetch the page. For example: you might fetch a page using a particular version of Chrome, Firefox, etc. The user agent informs the server which client software is being used.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which HTTP header indicates client software?

A

User-agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HTTP response contains what parts?

A
  • Status line:
    • HTTP Version (e.g. “HTTP/1.1”)
    • Response Code (e.g. “200 OK”)
  • Other headers:
    • Location
    • Server
    • Allow
    • Content-encoding
    • Content-length
    • Expires
    • Last-modified
    • Other headers, like Set-Cookie
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Response code in 100s

A

Informational

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Response code in 200s

A

Success

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Response code 200

A

Means “OK”. Very common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Response code in 300s

A

Redirection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Response code 301

A

Moved permanently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Response code in 400s

A

Errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Response code 404

A

Page Not Found

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Response code in 500s

A

Server errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HTTP response header - Location

A

May be used in redirection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HTTP response header - Server

A

Indicates server software

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HTTP response header - Allow

A

Indicates HTTP methods that are allowed, such as GET, HEAD, and so forth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

HTTP response header - Content-encoding

A

Describes how the content is encoded (for example, if it’s compressed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

HTTP response header - Content-length

A

Indicates how long the content is in terms of bytes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

HTTP response header - Expires

A

Indicates how long the content can be cached

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

HTTP response header - Last-modified

A

Indicates the last time the page was modified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Early HTTP

A
  • v0.9/1.0
  • Early versions only had 1 request/response for every TCP connection
  • On the plus side, this was simple to implement, but the main drawback is it required a TCP connection for every request, thereby introducing a lot of overhead and slowing transfer. Every request would require a TCP handshake, and TCP must start in slow start every time the connection opens
    • This is exacerbated by the fact that short transfers are very bad for TCP because TCP is always stuck in slow start and never gets a chance to actually ramp up to steady-state transfer.
    • Also, since TCP connections are terminated after every request is completed, the servers have many connections that are forced to keep TCP connections in TIME_WAIT states until the timers expire, thus resulting in additional resources that the server needs to keep reserved even after the connections have completed
  • A solution to increase efficiency and account for many of these drawbacks is to use something called persistent connections
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Persistent Connections

A

Multiple HTTP requests/responses are multiplexed onto a single TCP connection

  • Delimiters at the end of an HTTP request indicate the ends of requests
  • Content-length allows the receiver to identify how long a response is
  • So, the server actually needs to know the size of the transfer in advance.
  • Combined with pipelining
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Pipelining

A

client sends the next request as soon as it encounters a referenced object, there is as little as 1 RTT for all referenced objects before they begin to be fetched
-Persistent connections with pipelining is the default behavior in HTTP 1.1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why cache?

A

Improve performance

* We know that TCP throughput is inversely proportional to round trip time. So, the further away the web content is, the slower the webpage will load, both because latency is bigger and throughput is lower.
    * Instead, if the client can fetch content from the local cache, performance can be drastically improved by fetching content from a nearby location.
* Caching can also improve the performance when multiple clients are requesting the same content. Not only do the clients benefit, but the ISP also saves costs on transit, because it doesn’t have to pay to keep transferring the same content over the expensive links. Instead, it can simply serve the content to the clients locally.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Where can caching occur?

A
  • Browser (locally on your machine)
  • In network: Sometimes your local ISP may have a web cache
    • Content distribution networks are a special type of web cache that can be used to improve performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to make sure clients get most recent version of web page?

A
  • caches periodically expire content based on the Expires header.
  • caches can also check with the origin server (“cache checks”) to see whether the original content has been modified
  • If the content hasn’t been modified, the origin server would respond to a cache check request with a “304” (Not Modified) response.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Response code 304

A

Not Modified

28
Q

How to direct client to a local cache?

A
  • Browser configuration: You can open your browser and specifically configure it to point to a local cache, so that all HTTP requests first are directed through the local cache before the request is forwarded to the origin.
  • Origin server (or service hosting the content) might direct your browser to a cache.
    * This can be done with a special reply to a DNS request. For example, with a DNS request for google.com, the response returns a number of IP addresses. When you ping those IP addresses, you see that they’re only 1 ms away, which indicates that that server is not far away, but is in fact very likely on the local network, probably even the GT campus network
29
Q

What are 2 benefits of caching? (choices: reduced transit cost for local ISP, more up-to-date content, Improved performance for local clients)

A

Reduced transit cost for local ISP, &

Improved performance for local clients

30
Q

What is a CDN?

A

Content Distribution Network

  • Overlay network of web caches that’s designed to deliver content to a client from the optimal location
    • In many cases, optimal means geographically closest, but sometimes, optimal is not the geographically closest hash, and we’ll see examples of when that’s the case.
  • Geographically disparate groups of servers, where each group can serve all the content on the CDN
31
Q

In many cases, there’s a concerted effort to place caches ____

A

As close geographically to users as possible

32
Q

Some CDN owners include:

A

-Content providers (Google)
-Networks (Level 3, LimeLight, AT&T) and ISPs
-Independent operators such as Akamai
Note: Non-network CDN’s such as Akamai and Google can place servers in other ASes or ISP’s

33
Q

How many unique front-end cache nodes for Google were found in the USC study?

A

Over 30,000
-As of about 2 years ago, the Akamai edge platform reported about 85,000 unique caching servers in nearly 1,000 unique networks around the world in 72 countries

34
Q

Challenges in running a CDN

A
  • Underlying goal is to replicate content on many servers so that the content is replicated close to the clients. This leave many open questions:
    * How to replicate the content?
    * Where it should be replicated?
    * How clients should find the replicated content
    * How to choose the appropriate server replica/cache for a particular client
    * This problem is commonly known as server selection
    * How to direct clients toward the appropriate replica once it’s selected
    * This problem is sometimes called content routing
35
Q

Server selection

A

How to choose the appropriate server replica/cache for a particular client

36
Q

Content routing

A

How to direct clients toward the appropriate replica once it’s selected

37
Q

Fundamental problem with server selection

A

Determining which server to direct the client to. one could do this based on a number of criteria:

  • Least loaded server
  • The one with the lowest latency
    * CDN’s typically aim for this since latency plays a hugely significant role in the web performance the client can see
  • Or simply to any “alive” server to help provide fault tolerance
38
Q

3 ways to do content routing

A
  • Routing system
  • Application-based
  • Naming-based
39
Q

Routing system

A

(type of content routing) E.g. Anycast

  • Number all the replicas with the same IP address and then rely on routing to take the client to the closest replica based on the routes that the internet routers choose
  • Simple, but provides ISP’s with very little control over which servers ultimately get redirected to, because the redirection is at the whims of the internet routing
  • Simple but coarse
40
Q

Application-based

A

(type of content routing) E.g. using an HTTP redirect

  • Can be effective but requires the client to first go to the origin server to get the redirect in the first place, in creasing latency
  • Fairly simple but incurs significant delays which operators really care about, as well as users
41
Q

Naming-based

A

(type of content routing) E.g. using DNS

  • Most common method
  • Client looks up a particular domain name, such as google.com, and the response contains an IP address of a nearby cache
  • Provides significant flexibility in directing different clients to different server replicas
  • Provides fine-grained control, and it’s also fast
42
Q

Naming-based redirection example

A

Looked up symantec.com from 2 different locations

  • CNAME tells us to look up the following domain name in Akamai
  • Same lookup from Boston gives us 2 IP addresses that are presumably more local to the Boston area
  • This is how operators use DNS to redirect clients to different caches nearby
43
Q

CDNs and ISPs

A
  • Have a fairly symbiotic relationship
  • CDNs like to peer with ISPs because peering directly with ISPs where a customer is located provides better throughput since there are no intermediate AS hops and network latency is lower
    * Having more vectors to deliver content increases reliability
    * During large request events, having direct connectivity to multiple networks where the content is hosted allows an ISP to spread its traffic across multiple transit links, thereby reducing the 95th percentile and lowering its transit costs
  • ISPs also like to peer with CDNs:
    * Providing content closer to the ISPs customers allows the ISP to provide customers with good performance for a particular service. For example, you can see that GaTech placed a Google cache node in its network, resulting in very low latencies to Google, and thereby happy customers
    * Providing good performance to popular services is a major selling point for ISPs
    * Another reason to peer with CDNs (AKA host cache nodes locally) is to lower the transit costs
    * E.g. if there was a huge demand for a particular video on Youtube and all of the requests and responses were going over expensive transit links, then the ISP’s costs would be potentially prohibitively high
    * Peering with a CDN would prevent all that traffic from traversing expensive links, thus reducing costs
44
Q

Why do ISPs want to peer with CDNs?

Choices:
Lower transit costs
Better security
Better performance for customers
More predictability
A

Lower transit costs
Better performance for customers

*Note: It may actually REDUCE predictability

45
Q

What is Bit Torrent?

A
  • A peer-to-peer CDN
  • Commonly used for file sharing and distribution of large files

Suppose we have a network with a bunch of clients, all of whom want a particular file, and the file might be particularly big

  • They could all fetch the same file from the source/origin, but:
    * Origin may be overloaded
    * Can create congestion/overload at the network where the content is being hosted
  • Solution is to fetch the content from other peers
    * Take the original file and chop it into many different pieces and replicate different pieces on different peers in the network as soon as possible
    * Each peer is assembling the file, but doing so by picking up different pieces of the file, and it can retrieve the pieces that it doesn’t have from the remaining peers in the network
    * By trading different pieces of the same file, everyone eventually gets the full file. The idea is that hopefully we’ll be able to assemble the full file at the end by the time the clients have swapped
46
Q

Steps in bit torrent publishing

A
  1. Peer creates a “torrent” which contains metadata about a tracker and all of the pieces of the file in question, as well as a checksum for each piece of the file at the time the torrent was created
  2. Some peers need to maintain a complete initial copy of the file. These are called seeders.
  3. To download a file, a client first contacts the tracker which provides this metadata about the file, including a list of seeders which contain an initial copy of the file.
  4. Client starts to download parts of the file from the seeder. Once it has some chunks, hopefully they’re different than other clients in the network, and they can begin to swap chunks.
    * Leecher: client that contains incomplete copies of the file.
    * Tracker allows peers to find each other, and returns a random list of peers that any particular leecher can use to swap chunks of the file

-Previous P2P systems used similar swapping techniques, but a problem that many of them faced, and that Bit Torrent solved, is called Freeloading/Freeriding

47
Q

Freeloading/Freeriding

A

a client might leave the network as soon as it finished downloading a copy of the file, not providing any benefit to other clients who also want the file

48
Q

Solution to Freeriding

A

Choking: a type of game theoretic strategy called tit-for-tat

  • Temporary refusal to upload chunks to another peer that is requesting them. Downloading works as normal
    * If a peer can’t download from a client, it simply doesn’t upload to that peer.
    * This ensures that nodes cooperate and eliminates the freerider problem.

For more on the game theory behind this, read about the Repeated Prisoner’s Dilemma, where a TFT (tit for tat) strategy ensures cooperation among mutually distrustful parties.

49
Q

To ensure all clients have different chunks of file, Bit Torrent uses a policy called ________

A

Rarest Piece First

  • Allows client to determine which pieces are the most rare among clients, and download those pieces first. Ensures that a large variety of pieces are downloaded from the seeder.
  • It’s important to get a complete piece as soon as possible (assuming a client has nothing to trade). Rare pieces are typically available at fewer peers initially, so downloading a rare piece initially is maybe not a good idea. One policy that clients use is to select a random piece (first) of the file and download it from the seeder.
50
Q

End-game for building file on a client in Bit Torrent

A

Client actively requests any missing pieces from all peers, and redundant requests are cancelled when the missing piece arrives. This ensures that a single peer with a slow transfer rated doesn’t prevent the download from completing.

51
Q

Distributed Hash Table

A

Enable a form of content overlay called a structured overlay

52
Q

Chord

A
  • a scalable, distributed lookup service

- enabled by an underlying mechanism called consistent hashing

53
Q

Desired properties of Chord

A

Scalability
Provable correctness
Reasonably good performance (that’s also fairly easy to reason about)

54
Q

Lookup service

A

any service that maps keys to values. Examples of lookup services on the internet include:

  • DNS
  • Directories services
55
Q

Main motivation of Chord

A

the scalable location of data in a large distributed system
-A publisher might want to publish the location of a particular piece of data such as an MP4, with a particular name. Needs to figure out where to publish it in a place that the client can find it, so that when the client performs a lookup for “Annie Hall”, its’ directed to the right location hosting the data. Key problem we need to solve here is lookup.

56
Q

What function needs to be provided for Chord?

A

Hash table

-Specifically, a distributed hash table (DHT), which is not located in one place

57
Q

What mechanism do we use to create a distributed hash table?

A

Consistent hashing

58
Q

Main idea of consistent hashing

A

keys and nodes map to the same ID space

  • Create a metric space such as a ring
  • Put nodes on this ring, each with some ID
  • Consistent hash function will assign the nodes and the keys an identifier in this space.
  • A hash function such as SHA-1 might be used to assign these ID’s
  • Creates ID’s that are uniformly distributed in the ID space. Now, how to map the key ID’s to the node ID’s so that we know which nodes are responsible for resolving the lookups for a particular key?
59
Q

Where is the key stored in Chord

A

at its successor, which is the node with the next highest ID

60
Q

What properties does consistent hashing offer?

A

Load balance

Flexibility

61
Q

Load balance (in consistent hashing)

A

all nodes receive roughly the same number of keys

62
Q

Flexibility (in consistent hashing)

A

when a node joins or leaves the network, only a fraction of the keys need to be moved to a different location

63
Q

Consistent hashing is provably _______

A

Optimal

-the minimal number of keys need to be remapped to maintain load balance when a node joins or leaves the network

64
Q

Implementing consistent hashing

A
  • One option: Every node knows the location of every other node.
  • Another option: Each node only knows the location of its immediate successor in the ring
65
Q

Consistent hashing implementation: every node knows location of every other node

A

Lookups are fast, in fact, they’re O(1), but the routing tables are large (O(n)) in particular because every node needs to know the location of every other node in the network, so the routing table must be order n where n is the # of nodes in the network.

66
Q

Consistent hashing: each node only knows the location of its immediate successor

A

Results in a small table of size O(1), but locating the content would require O(n) lookups

67
Q

Finger tables

A

Solution that provides best of both worlds (of the 2 options to implement consistent hashing).

  • Every node knows m other nodes in the ring, and the distance of the nodes that it knows increases exponentially
  • When a new node is added, we must update the fingers of this node, and update the fingers of other nodes so that they can point to the node with the new ID
  • Transfer keys from the successor to the new node