Session 5 Flashcards

1
Q

You decide to build a simple web analytics application to better understand the behavior of your users.

Requirements:

A

R1 - The application should track the number of views for any video you tell it to track
-> The application’s web server gets a message every time a tracked video is watched

R2 - Additionally, the application should be able to tell you at any point what the top 100 videos are by number of views

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Desired Properties of a Big Data System

A
  1. Robustness and fault tolerance
  2. Low latency
  3. Minimal Maintenance
  4. Ad hoc queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Batch Layer

A
  1. Manages the master dataset – an immutable, append-only set of raw data
  2. Pre-computes arbitrary query functions – called batch views
  3. Runs in a loop and continuously recomputes the batch views from scratch
  4. Very simple to use and understand
  5. Scales by adding new machines.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Speed Layer

A
  1. Accommodates all requests that are subject to low latency requirements
  2. Its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements
  3. Similar to the batch layer in that it produces views based on data it receives
    - > One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once
  4. Does incremental computation instead of the recomputation done in the batch layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Serving Layer

A
  1. Indexes batch views so that they can be queried with low latency
  2. The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it
  3. When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available
  4. It does not need to support specific record updates
    - > This is a very important point, as random writes cause most of the complexity in databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop File System is the open source alternative to Google File System

A
  • Comodity hardware
  • Tolerant to failure

(batch layer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How distributed file systems work?

A
  1. All files are broken into blocks (usually 64 to 256 MB)
  2. These blocks are replicated (typically 3 copies) among the HDFS servers (datanodes)
  3. The namenode provides a lookup service for clients accessing the data and ensures the nodes are correctly replicated across the cluster
  1. Broken into blocks
  2. Blocks are replicated
  3. Lookup service
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hadoop MapReduce

A
  • Hadoop MapReduce is a distributed computing paradigm originally pioneered by Google
  • Used to process data in the batch layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The Split-Apply-Combine Approach

A

Each “Apply” operation can be performed independently of other “Apply” operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Big Data infrastructure as a service

Elastic clouds

A
  1. Elastic clouds allow you to rent hardware on demand rather than own your own hardware in your own location.
  2. Elastic clouds let you increase or decrease the size of your cluster nearly instantaneously, so if you have a big job you want to run, you can allocate the hardware temporarily.
  3. Elastic clouds dramatically simplify system administration. They also provide additional storage and hardware allocation options that can significantly drive down the price of your infrastructure.
  1. Rent hardware on demand
  2. Increase or decrease the size of your cluster nearly instantaneously
  3. Simplify system administration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Examples of suppliers elastic clouds

A
  • Microsoft Azure
  • Amazon Web Services
  • Digital Ocean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Many machine learning tools can be used on top of this infrastructure (elastic clouds)

A
  • Amazon ML
  • Microsoft Azure/ML
  • H2O (from 0xdata)

All of these tools implement state-of-the-art ML algorithms out of the box. Some of them are also extensible, i.e., you can implement your own algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A Client / Server Model

A

1 Clients connect to a server over the Internet

2 Clients perform a request

3 Server issues response

4 Clients display response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly