Session 5.1 Flashcards

1
Q

We decide to build a simple web analytics application to better understand the behavior of our users.

What kind of system should we put in place to fulfill the requirements?

A

We start with a traditional relational schema for the pageviews

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Analytics Server -> Database

Which problems can emerge from this approach?

A

Scaling problems

Our startup is a huge success and traffic is growing rapidly

Our main application is fine: we have hosted it on Amazon Web Services,
and they are able to handle the traffic

However, our analytics application is struggling to keep up with the traffic

We look at the logs and we see that the problem is in the database: there are too many requests, i.e., the database cannot keep up with the rate of requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Analytics Server -> Database

How to deal with scaling problems?

A

The best approach is to use multiple database servers and spread the table across all servers. Each server will have a subset of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hash function

A

a function that decides which database should keep information about a user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

As the application becomes more popular we only need to deploy more database servers

A
  • We also need to use a different hash function so that the new database servers get some load as well
  • We also need to redistribute the users according to the new hash function. While we do this our queue just keeps increasing
  • Finally we need to change the code of our application so that it knows how to find a specific user in the databases
  1. Use different hash function
  2. Redistribute users
  3. Change the code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Every time we add one more database this process becomes more and more painful

A
  • More data takes longer to redistribute and our queue gets longer
  • If we forget the hash function we can start writing and reading from the wrong databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fault-tolerance issues

A

When we have many databases it starts to become frequent that the hard drive in one of the databases goes bad

  • We need to deal with having one of the databases down
  • We need to add backups to each of the databases

Our system is not resilient to hardware errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data corruption issues

A

At some point we deploy code with a bug: instead of incrementing each video viewership by one unit, our code increments by two units. We notice the mistake only 24 hours later.

  • Now we have corrupted data: every video watched in the past 24 hours have their viewership inflated
  • How do we solve this?

Our system is not resilient to human errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The desired properties of Big Data systems are related both to

A

complexity and scalability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Complexity

A

generally used to characterize something with many parts where those parts interact with each other in multiple ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scalability

A

ability to maintain performance in the face of increasing data or load by adding resources to the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A Big Data system must …

A
  1. perform well
  2. be resource-efficient
  3. it must be easy to reason about
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Desired properties of a Big Data system

  1. Robustness and fault tolerance
A

Systems need to behave correctly despite:

  • machines going down randomly
  • complex semantics of consistency in distributed databases
  • duplicated data
  • concurrency
  • human errors

These challenges make it difficult to reason about what a system is doing
- Part of making a Big Data system robust is avoiding these complexities so that you can easily reason about the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Desired properties of a Big Data system

  1. Low latency
A

Latency is the time between a request and a response

The vast majority of applications require reads to be satisfied with very low latency, typically between a few milliseconds to a few hundred milliseconds

Latency requirements vary a great deal between applications:

  • Some applications require updates to propagate immediately, but in other applications a latency of a few hours is fine
    e. g., Facebook post vs bank account balance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Desired properties of a Big Data system

  1. Minimal maintenance
A

Maintenance is the work required to keep a system running smoothly

  • This includes anticipating when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production

An important part of minimizing maintenance is choosing components that have as little implementation complexity as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Desired properties of a Big Data system

  1. Ad hoc queries
A

Being able to do ad hoc queries on your data is extremely important

  • Nearly every large dataset has unanticipated value within it
  • Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications