sdi 4 Flashcards

Question 1

Q

youtube - what are the different api calls we can have?

Answer

A

uploadVideo, searchVideo, streamVideo

Question 2

Q

uploadVideo api call

Answer

A

uploadVideo(api_dev_key, video_title, video_description, tags[], category_id, default_language,
recording_details, video_contents)

video_contents (stream): Video to be uploaded.

Returns: (string) - A successful upload will return HTTP 202 (request accepted) and once video encoding is completed, user is notified through email w/ a link to access the video.

Question 3

Q

searchVideo api call

Answer

A

searchVideo(api_dev_key, search_query, user_location, maximum_videos_to_return, page_token)

page_token (string): specifies a page in the result set that should be returned.

Returns: (JSON)
A JSON containing info about the list of video resources matching the search query. Each video resource will have a title, thumbnail, creation date, view count.

Question 4

Q

streamVideo api call

Answer

A

streamVideo(api_dev_key, video_id, offset, codec, resolution)

offset (number): time in seconds from beginning of video. If we support playing/pausing a video from multiple devices, we need to store the offset on the server.

codec (string) & resolution(string): needed to support play/pause from multiple devices. You’re watching a video on your TV’s
Netflix app, paused it, and started watching it on your phone’s Netflix app - these devices have a different res and use a different codec.

Returns: (STREAM)
A media stream (a video chunk) from the given offset.

Question 5

Q

youtube high level design

Answer

A

Processing Queue: Each uploaded video will be pushed to a q for encoding, thumbnail generation, and storage.
Encoder encodes uploaded videos into multiple formats.
Thumbnails generator
Video and Thumbnail storage
User Database (mysql)
Video metadata storage (also mysql): title, file path in the system, uploading user, total views, likes, dislikes, video comments.

Question 6

Q

youtube r/w ratio?

Answer

A

Read-heavy - we’ll focus on building a system that can retrieve videos quickly. 200/1

Question 7

Q

youtube - where do we store vids?

Answer

A

Videos can be stored in a distributed file storage system like HDFS or
GlusterFS.

Question 8

Q

youtube - where do we store thumbnails?

Answer

A

Thumbnails are small files, max 5KB each.
Read traffic for thumbnails will be huge compared to videos. Users watch 1 video at a time, but they might be looking at a page w/ 20 thumbnails.

If we store all the thumbnails on a disk, we have to perform a lot of seeks to different locations on the disk to read these files – inefficient and will cause low latencies.

Bigtable (key-value, wide-column store, ideal for fast access to very large amounts of data) can be a reasonable choice here as it combines multiple files into 1 block to store on the disk. Keeping hot thumbnails in the cache will also help in improving the latencies and, b/c thumbnails files are small, we can easily cache many.

Question 9

Q

youtube - How should we efficiently manage read traffic?

Answer

A

Segregate read traffic from write. Since we’ll have multiple copies of each video, we can distribute read traffic on different servers.
For metadata, we can have master-slave configurations where write go to master first and then gets applied at slaves. This can cause some staleness in data - when a
new video is added, its metadata would be inserted in the master first and before it gets applied at the slave our slaves would not be able to see it; stale results would be returned to the user.
This staleness might be acceptable as it would be very short-lived and the user would be able to see the new videos after a few ms.

Question 10

Q

youtube - sharding necessity

Answer

A

Since we have a huge number of new videos every day and our read load is extremely high, therefore,
we need to distribute our data onto multiple machines so that we can perform read/write operations
efficiently.

Question 11

Q

youtube - vid duplication

Answer

A

Huge number of users uploading a huge amount of video data - we must deal w/ widespread video duplication. Duplicate videos often differ in aspect ratios or encodings, can contain overlays or additional borders, or can be excerpts from a longer original video. The proliferation of duplicate videos:
1. wastes storage space
2. degrades cache efficiency
3. energy wastage

For the end user, these inefficiencies will be realized in the form of duplicate search results, longer video startup times, and interrupted streaming.

Question 12

Q

youtube deduplication

Answer

A

For our service, deduplication makes most sense inline, not post-process. will save us the resources that’d be used to encode, transfer, and store the duplicate copy. As soon as any user starts uploading a video, we run video matching algorithms (e.g., Block Matching, Phase
Correlation) to find duplicates. If duplicate detected, we can either stop the upload and use the existing copy OR continue upload and use new copy if it is of higher quality. If the newly uploaded video is a subpart of an existing video or vice versa, we can divide the video into smaller chunks so we only upload the missing chunks.

Question 13

Q

youtube load balancing - come back to this

Question 14

Q

youtube CDNs

Answer

A

Our service can move popular videos to CDNs:
* CDNs replicate content in multiple places. There’s a better chance of videos being closer to the
user and, with fewer hops, videos will stream from a friendlier network.
* CDN machines make heavy use of caching and can mostly serve videos out of memory.

Less popular videos (1-20 views per day) that are not cached by CDNs can be served by our servers in various data centers.

Question 15

Q

youtube fault tolerance

Answer

A

We should use Consistent Hashing for distribution among database servers. Consistent hashing will not only help in replacing a dead server, but also help in distributing load among servers.

Question 16

Q

typeahead suggestion - high level design

Answer

A

we have a lot of ‘strings’ that we need to store in such a way that users can search with any prefix. we must efficiently store our data such that it can be queried quickly - use trie

Question 17

Q

typeahead suggestion - how to find top suggestion?

Answer

A

store the count of searches that terminated at each node. So, given a prefix, we can traverse the sub-tree under it to find the top suggestions.

Question 18

Q

typeahead suggestion - how can we speed up our searches?

Answer

A

We can store top 10 suggestions at each node. We have to bear the big increase in our storage capacity to achieve the required efficiency.
We can optimize our storage by storing only references of the terminal nodes rather than the
entire phrase. To find the entire phrase, we need to traverse back using the parent reference from the terminal node. We’ll also store the frequency along with each reference.

Question 19

Q

typeahead suggestion - how do we build the trie?

Answer

A

We can efficiently build our trie bottom up. Each parent node will
recursively call their children to calculate their top suggestions and counts. Parents combine top suggestions from children to determine their top suggestions.

Question 20

Q

typeahead suggestion - What could be different ranking criteria for suggestions?

Answer

A

freshness, user location, language, demographics, personal history etc.

Question 21

Q

typeahead suggestion - how to update the trie?

Answer

A

Either we can log every query or do sampling and log every 1000th query. if we don’t want to show a term searched for <1000 times, it’s safe to log every 1000th searched term.
We can have a Map-Reduce (MR) set-up to process all the logging data periodically, say, every hour. Once we have frequencies of all searched terms in the past hour, we can update trie. 2 options:
1. We can make a copy of the trie on each server to update it offline. Once done, we can switch using the copy and discard the old one.
2. We can have a master-slave configuration for each trie server. We can update slave while master is serving traffic. Once the update is complete, we make the slave our
new master.

Question 22

Q

How can we update the frequencies of typeahead suggestions?

Answer

A

We can update only differences
in frequencies rather than recounting all search terms from scratch. Say we’re keeping count of all terms searched in last 10 days. Subtract counts from the time period no longer included and add the counts for new time period. We can add and subtract frequencies based on Exponential Moving Average (EMA) of each term. In EMA, we give more weight to the latest data.

After inserting a new term in the trie, we’ll go to the terminal node of the phrase and increase its
frequency. This new frq is in top 10; go up the tree and make updates at each parent as necessary.

Question 23

Q

How to store trie in a file so that we can rebuild our trie easily - this will be needed when a
machine restarts?

Answer

A

We can take a snapshot of our trie periodically and store it in a file. This will enable us to rebuild a trie if the server goes down. Start with the root node and save the trie level-by-level. Each node is represented by char + num of children.

Not storing top suggestions and top counts – it is hard to store
this information; since our trie is being stored top down, we don’t have child nodes created before the parent. We have to recalculate all the top terms with counts while we are building the trie.

Question 24

Q

typeahead suggestion - scale estimation

Answer

A

Since there will be a lot of duplicates in 5 billion queries, we can assume that only 20% of these will be unique. we can index top 50% search terms.
Let’s assume we will have 100 million unique terms for which we want to build an index. On avg, each query is 3 words, 5 chars long each.
we should also be removing some terms that aren’t searched anymore. assume we have 2% new queries every day and we maintain our index for the last 1 year.

Question 25

Q

typeahead suggestion - data sharding

Answer

A

Each term will be passed to a hash function, which will
generate a server number and we will store the term on that server. This will make our term distribution
random and hence minimize hotspots. To find typeahead suggestions for a term we have to ask all the
servers and then aggregate the results.

Question 26

Q

High level design for Rate Limiter

Answer

A

Rate Limiter will be responsible for deciding which request will be served by the API servers and
which request will be declined. Once a new request arrives, the Web Server first asks the Rate Limiter
to decide if it will be served or throttled. If the request is not throttled, then it’ll be passed to the API
servers.

Question 27

Q

two types of algorithms used for Rate Limiting

Answer

A

can’t used fixed window b/c Imagine if user sends 3 requests at the last second of a min, then she can send 3 more requests at the first second of the next min = 6 requests in 2 seconds.
Sliding Window takes a lot of memory compared to Fixed – scalability issue.

Question 28

Q

rate limiter algo design

Answer

A

combine fixed and sliding window algos -> sliding window w/ counts.

if we have an hourly rate limit we can keep a count for each min and calculate the sum of all counters in the past hour when we receive a new request.
We can store our counters in a Redis Hash since it offers incredibly efficient storage for fewer than 100 keys. When each request increments a counter in the hash, it also sets the hash to expire some specified time later.

Question 29

Q

rate limiter - caching

Answer

A

cache recent active users.
Write-back cache: update all counters and timestamps in cache
only. Writing to permanent storage can be done at fixed intervals. This ensures that the rate limiter adds
minimum latency to the user’s requests.
LRU

Question 30

Q

rate limiter - data sharding

Answer

A

We can shard based on the ‘UserID’. For fault tolerance and replication we should use Consistent Hashing.
If we want different throttling limits for different APIs, we can
choose to shard per user per API. If we do this, a practical consideration could be to have a separate (somewhat smaller) rate limiter for each API shard as well.

Question 31

Q

Should we rate limit by IP or by user?

Answer

A

by IP:
- when multiple users share a single public IP like in an internet cafe, one bad user can cause throttling to others
- it’s trivial to make a cache server run out of memory tracking IPv6 addresses

by user:
- hacker can perform a denial of service attack against a user by entering wrong credentials up to the limit; after that the actual user can’t login

best to do both, though this will result in more cache entries w/ more details per entry, requiring more memory and storag