Session 5.1 Flashcards

Question 1

Q

We decide to build a simple web analytics application to better understand the behavior of our users.

What kind of system should we put in place to fulfill the requirements?

Answer

A

We start with a traditional relational schema for the pageviews

Question 2

Q

Analytics Server -> Database

Which problems can emerge from this approach?

Answer

A

Scaling problems

Our startup is a huge success and traffic is growing rapidly

Our main application is fine: we have hosted it on Amazon Web Services,
and they are able to handle the traffic

However, our analytics application is struggling to keep up with the traffic

We look at the logs and we see that the problem is in the database: there are too many requests, i.e., the database cannot keep up with the rate of requests

Question 3

Q

Analytics Server -> Database

How to deal with scaling problems?

Answer

A

The best approach is to use multiple database servers and spread the table across all servers. Each server will have a subset of the data.

Question 4

Q

Hash function

Answer

A

a function that decides which database should keep information about a user

Question 5

Q

As the application becomes more popular we only need to deploy more database servers

Answer

A

We also need to use a different hash function so that the new database servers get some load as well
We also need to redistribute the users according to the new hash function. While we do this our queue just keeps increasing
Finally we need to change the code of our application so that it knows how to find a specific user in the databases

Use different hash function
Redistribute users
Change the code

Question 6

Q

Every time we add one more database this process becomes more and more painful

Answer

A

More data takes longer to redistribute and our queue gets longer
If we forget the hash function we can start writing and reading from the wrong databases

Question 7

Q

Fault-tolerance issues

Answer

A

When we have many databases it starts to become frequent that the hard drive in one of the databases goes bad

We need to deal with having one of the databases down
We need to add backups to each of the databases

Our system is not resilient to hardware errors

Question 8

Q

Data corruption issues

Answer

A

At some point we deploy code with a bug: instead of incrementing each video viewership by one unit, our code increments by two units. We notice the mistake only 24 hours later.

Now we have corrupted data: every video watched in the past 24 hours have their viewership inflated
How do we solve this?

Our system is not resilient to human errors

Question 9

Q

The desired properties of Big Data systems are related both to

Answer

A

complexity and scalability

Question 10

Q

Complexity

Answer

A

generally used to characterize something with many parts where those parts interact with each other in multiple ways

Question 11

Q

Scalability

Answer

A

ability to maintain performance in the face of increasing data or load by adding resources to the system

Question 12

Q

A Big Data system must …

Answer

A

perform well
be resource-efficient
it must be easy to reason about

Question 13

Q

Desired properties of a Big Data system

Robustness and fault tolerance

Answer

A

Systems need to behave correctly despite:

machines going down randomly
complex semantics of consistency in distributed databases
duplicated data
concurrency
human errors

These challenges make it difficult to reason about what a system is doing
- Part of making a Big Data system robust is avoiding these complexities so that you can easily reason about the system

Question 14

Q

Desired properties of a Big Data system

Low latency

Answer

A

Latency is the time between a request and a response

The vast majority of applications require reads to be satisfied with very low latency, typically between a few milliseconds to a few hundred milliseconds

Latency requirements vary a great deal between applications:

Some applications require updates to propagate immediately, but in other applications a latency of a few hours is fine
e. g., Facebook post vs bank account balance

Question 15

Q

Desired properties of a Big Data system

Minimal maintenance

Answer

A

Maintenance is the work required to keep a system running smoothly

This includes anticipating when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production

An important part of minimizing maintenance is choosing components that have as little implementation complexity as possible

Question 16

Q

Desired properties of a Big Data system

Ad hoc queries

Answer

Study These Flashcards

A

Being able to do ad hoc queries on your data is extremely important

Nearly every large dataset has unanticipated value within it
Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications

Session 5.1 Flashcards

(16 cards)