Session 5 Flashcards

Question 1

Q

You decide to build a simple web analytics application to better understand the behavior of your users.

Requirements:

Answer

A

R1 - The application should track the number of views for any video you tell it to track
-> The application’s web server gets a message every time a tracked video is watched

R2 - Additionally, the application should be able to tell you at any point what the top 100 videos are by number of views

Question 2

Q

Desired Properties of a Big Data System

Answer

A

Robustness and fault tolerance
Low latency
Minimal Maintenance
Ad hoc queries

Question 3

Q

Batch Layer

Answer

A

Manages the master dataset – an immutable, append-only set of raw data
Pre-computes arbitrary query functions – called batch views
Runs in a loop and continuously recomputes the batch views from scratch
Very simple to use and understand
Scales by adding new machines.

Question 4

Q

Speed Layer

Answer

A

Accommodates all requests that are subject to low latency requirements
Its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements
Similar to the batch layer in that it produces views based on data it receives
- > One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once
Does incremental computation instead of the recomputation done in the batch layer

Question 5

Q

Serving Layer

Answer

A

Indexes batch views so that they can be queried with low latency
The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it
When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available
It does not need to support specific record updates
- > This is a very important point, as random writes cause most of the complexity in databases

Question 6

Q

Hadoop File System is the open source alternative to Google File System

Answer

A

Comodity hardware
Tolerant to failure

(batch layer)

Question 7

Q

How distributed file systems work?

Answer

A

All files are broken into blocks (usually 64 to 256 MB)
These blocks are replicated (typically 3 copies) among the HDFS servers (datanodes)
The namenode provides a lookup service for clients accessing the data and ensures the nodes are correctly replicated across the cluster

Broken into blocks
Blocks are replicated
Lookup service

Question 8

Q

Hadoop MapReduce

Answer

A

Hadoop MapReduce is a distributed computing paradigm originally pioneered by Google
Used to process data in the batch layer

Question 9

Q

The Split-Apply-Combine Approach

Answer

A

Each “Apply” operation can be performed independently of other “Apply” operations

Question 10

Q

Big Data infrastructure as a service

Elastic clouds

Answer

A

Elastic clouds allow you to rent hardware on demand rather than own your own hardware in your own location.
Elastic clouds let you increase or decrease the size of your cluster nearly instantaneously, so if you have a big job you want to run, you can allocate the hardware temporarily.
Elastic clouds dramatically simplify system administration. They also provide additional storage and hardware allocation options that can significantly drive down the price of your infrastructure.

Rent hardware on demand
Increase or decrease the size of your cluster nearly instantaneously
Simplify system administration

Question 11

Q

Examples of suppliers elastic clouds

Answer

A

Microsoft Azure
Amazon Web Services
Digital Ocean

Question 12

Q

Many machine learning tools can be used on top of this infrastructure (elastic clouds)

Answer

A

Amazon ML
Microsoft Azure/ML
H2O (from 0xdata)

All of these tools implement state-of-the-art ML algorithms out of the box. Some of them are also extensible, i.e., you can implement your own algorithms

Question 13

Q

A Client / Server Model

Answer

A

1 Clients connect to a server over the Internet

2 Clients perform a request

3 Server issues response

4 Clients display response

Session 5 Flashcards

(13 cards)