sdi 4 Flashcards
youtube - what are the different api calls we can have?
uploadVideo, searchVideo, streamVideo
uploadVideo api call
uploadVideo(api_dev_key, video_title, video_description, tags[], category_id, default_language,
recording_details, video_contents)
video_contents (stream): Video to be uploaded.
Returns: (string) - A successful upload will return HTTP 202 (request accepted) and once video encoding is completed, user is notified through email w/ a link to access the video.
searchVideo api call
searchVideo(api_dev_key, search_query, user_location, maximum_videos_to_return, page_token)
page_token (string): specifies a page in the result set that should be returned.
Returns: (JSON)
A JSON containing info about the list of video resources matching the search query. Each video resource will have a title, thumbnail, creation date, view count.
streamVideo api call
streamVideo(api_dev_key, video_id, offset, codec, resolution)
offset (number): time in seconds from beginning of video. If we support playing/pausing a video from multiple devices, we need to store the offset on the server.
codec (string) & resolution(string): needed to support play/pause from multiple devices. You’re watching a video on your TV’s
Netflix app, paused it, and started watching it on your phone’s Netflix app - these devices have a different res and use a different codec.
Returns: (STREAM)
A media stream (a video chunk) from the given offset.
youtube high level design
- Processing Queue: Each uploaded video will be pushed to a q for encoding, thumbnail generation, and storage.
- Encoder encodes uploaded videos into multiple formats.
- Thumbnails generator
- Video and Thumbnail storage
- User Database (mysql)
- Video metadata storage (also mysql): title, file path in the system, uploading user, total views, likes, dislikes, video comments.
youtube r/w ratio?
Read-heavy - we’ll focus on building a system that can retrieve videos quickly. 200/1
youtube - where do we store vids?
Videos can be stored in a distributed file storage system like HDFS or
GlusterFS.
youtube - where do we store thumbnails?
- Thumbnails are small files, max 5KB each.
- Read traffic for thumbnails will be huge compared to videos. Users watch 1 video at a time, but they might be looking at a page w/ 20 thumbnails.
If we store all the thumbnails on a disk, we have to perform a lot of seeks to different locations on the disk to read these files – inefficient and will cause low latencies.
Bigtable (key-value, wide-column store, ideal for fast access to very large amounts of data) can be a reasonable choice here as it combines multiple files into 1 block to store on the disk. Keeping hot thumbnails in the cache will also help in improving the latencies and, b/c thumbnails files are small, we can easily cache many.
youtube - How should we efficiently manage read traffic?
Segregate read traffic from write. Since we’ll have multiple copies of each video, we can distribute read traffic on different servers.
For metadata, we can have master-slave configurations where write go to master first and then gets applied at slaves. This can cause some staleness in data - when a
new video is added, its metadata would be inserted in the master first and before it gets applied at the slave our slaves would not be able to see it; stale results would be returned to the user.
This staleness might be acceptable as it would be very short-lived and the user would be able to see the new videos after a few ms.
youtube - sharding necessity
Since we have a huge number of new videos every day and our read load is extremely high, therefore,
we need to distribute our data onto multiple machines so that we can perform read/write operations
efficiently.
youtube - vid duplication
Huge number of users uploading a huge amount of video data - we must deal w/ widespread video duplication. Duplicate videos often differ in aspect ratios or encodings, can contain overlays or additional borders, or can be excerpts from a longer original video. The proliferation of duplicate videos:
1. wastes storage space
2. degrades cache efficiency
3. energy wastage
For the end user, these inefficiencies will be realized in the form of duplicate search results, longer video startup times, and interrupted streaming.
youtube deduplication
For our service, deduplication makes most sense inline, not post-process. will save us the resources that’d be used to encode, transfer, and store the duplicate copy. As soon as any user starts uploading a video, we run video matching algorithms (e.g., Block Matching, Phase
Correlation) to find duplicates. If duplicate detected, we can either stop the upload and use the existing copy OR continue upload and use new copy if it is of higher quality. If the newly uploaded video is a subpart of an existing video or vice versa, we can divide the video into smaller chunks so we only upload the missing chunks.
youtube load balancing - come back to this
youtube CDNs
Our service can move popular videos to CDNs:
* CDNs replicate content in multiple places. There’s a better chance of videos being closer to the
user and, with fewer hops, videos will stream from a friendlier network.
* CDN machines make heavy use of caching and can mostly serve videos out of memory.
Less popular videos (1-20 views per day) that are not cached by CDNs can be served by our servers in various data centers.
youtube fault tolerance
We should use Consistent Hashing for distribution among database servers. Consistent hashing will not only help in replacing a dead server, but also help in distributing load among servers.