Session 5.1 Flashcards
We decide to build a simple web analytics application to better understand the behavior of our users.
What kind of system should we put in place to fulfill the requirements?
We start with a traditional relational schema for the pageviews
Analytics Server -> Database
Which problems can emerge from this approach?
Scaling problems
Our startup is a huge success and traffic is growing rapidly
Our main application is fine: we have hosted it on Amazon Web Services,
and they are able to handle the traffic
However, our analytics application is struggling to keep up with the traffic
We look at the logs and we see that the problem is in the database: there are too many requests, i.e., the database cannot keep up with the rate of requests
Analytics Server -> Database
How to deal with scaling problems?
The best approach is to use multiple database servers and spread the table across all servers. Each server will have a subset of the data.
Hash function
a function that decides which database should keep information about a user
As the application becomes more popular we only need to deploy more database servers
- We also need to use a different hash function so that the new database servers get some load as well
- We also need to redistribute the users according to the new hash function. While we do this our queue just keeps increasing
- Finally we need to change the code of our application so that it knows how to find a specific user in the databases
- Use different hash function
- Redistribute users
- Change the code
Every time we add one more database this process becomes more and more painful
- More data takes longer to redistribute and our queue gets longer
- If we forget the hash function we can start writing and reading from the wrong databases
Fault-tolerance issues
When we have many databases it starts to become frequent that the hard drive in one of the databases goes bad
- We need to deal with having one of the databases down
- We need to add backups to each of the databases
Our system is not resilient to hardware errors
Data corruption issues
At some point we deploy code with a bug: instead of incrementing each video viewership by one unit, our code increments by two units. We notice the mistake only 24 hours later.
- Now we have corrupted data: every video watched in the past 24 hours have their viewership inflated
- How do we solve this?
Our system is not resilient to human errors
The desired properties of Big Data systems are related both to
complexity and scalability
Complexity
generally used to characterize something with many parts where those parts interact with each other in multiple ways
Scalability
ability to maintain performance in the face of increasing data or load by adding resources to the system
A Big Data system must …
- perform well
- be resource-efficient
- it must be easy to reason about
Desired properties of a Big Data system
- Robustness and fault tolerance
Systems need to behave correctly despite:
- machines going down randomly
- complex semantics of consistency in distributed databases
- duplicated data
- concurrency
- human errors
These challenges make it difficult to reason about what a system is doing
- Part of making a Big Data system robust is avoiding these complexities so that you can easily reason about the system
Desired properties of a Big Data system
- Low latency
Latency is the time between a request and a response
The vast majority of applications require reads to be satisfied with very low latency, typically between a few milliseconds to a few hundred milliseconds
Latency requirements vary a great deal between applications:
- Some applications require updates to propagate immediately, but in other applications a latency of a few hours is fine
e. g., Facebook post vs bank account balance
Desired properties of a Big Data system
- Minimal maintenance
Maintenance is the work required to keep a system running smoothly
- This includes anticipating when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production
An important part of minimizing maintenance is choosing components that have as little implementation complexity as possible