Sys Design and Distributed Systems Flashcards
What is availability?
The likelihood of your system being operational and accessible to users when needed. It’s a measure of uptime, often expressed as a percentage. They’d highlight that Google Cloud offers high availability through features like regions and zones, with Service Level Agreements (SLAs) guaranteeing a specific uptime target for various services.
What is latency?
It refers to the time it takes for data to travel between two points in your system. Think of it like how long it takes for a knight’s message to get from your castle (user) to the king’s advisors (server) and back. Lower latency means messages get delivered faster, resulting in a snappier user experience.
They’d emphasize that Google Cloud prioritizes minimizing latency with features like global network infrastructure and regional deployments. You can even use tools like Cloud Monitoring to track and optimize latency within your applications.
What is RPC
Imagine you, the programmer, are a knight calling upon a powerful API (Application Programming Interface) in another castle (server). RPC, or Remote Procedure Call, is like your trusty squire.
The squire races to the castle (server) with your request (function call), waits for the API (server) to complete the task, and then sprints back with the results, all without you having to leave your comfy coding zone.
Note: The RPC runtime is responsible for transmitting messages between client and server via the network. The responsibilities of RPC runtime also include retransmission, acknowledgment, and encryption
RPC Summary
The RPC method is similar to calling a local procedure, except that the called procedure is usually executed in a different process and on a different computer.
RPC allows developers to build applications on top of distributed systems. Developers can use the RPC method without knowing the network communication details. As a result, they can concentrate on the design aspects, rather than the machine and communication-level specifics.
Local Procedure Call (LPC)
Local Procedure Call (LPC): This refers to a mechanism for communication between different parts of a program running on the same computer. It allows them to exchange data and synchronize their actions. Think of it as two colleagues working on the same project within the same office, easily passing information back and forth.
RPC + LPC For an app
- Remote Procedure Call (RPC):
Local: The mobile app (client).
Remote: The machine learning model server.
The process:
User takes a picture.
The mobile app (client) sends an RPC containing the image data to the machine learning model server (remote). This RPC acts as a messenger, carrying the image data across the network to the server.
The server receives the RPC, processes the image data using the machine learning model, and identifies the objects.
The server sends a response back to the mobile app through the same RPC channel, containing the identified objects.
The mobile app receives the response and displays the identified objects to the user.
2. Local Procedure Call (LPC):
Local: The mobile app itself.
An LPC might come into play within the mobile app’s image processing pipeline before the RPC is sent:
The app receives the image from the camera.
An LPC might be used to call a local image pre-processing function within the app. This function could resize the image, convert it to the format expected by the server, or perform other necessary transformations.
The pre-processed image data is then packaged and sent through the RPC to the server for object identification.
In essence, RPC facilitates communication between the app (client) and the separate server hosting the machine learning model, while LPC enables communication between different parts of the app itself running on the same device.
What is ACID consistency?
Imagine you’re playing a game with your friends where you all take turns adding stickers to a picture (database). ACID helps make sure the picture doesn’t get messed up:
Atomicity: It’s like adding all your stickers at once (transaction). You either finish adding them all or none at all, so the picture doesn’t end up half-decorated and confusing.
Consistency: It’s like having rules about the picture (data stays valid). Maybe you can only use specific colors or shapes, so the picture always looks good and makes sense.
Isolation: Even if your friends (other transactions) try to add stickers at the same time, ACID makes sure each person only sees the picture one way at a time (transaction isolation). This avoids any sticker fights!
Durability: Once you stick on your stickers (update the data), they stay stuck forever (data persistence). Even if you accidentally knock over the picture (system crash), the stickers stay put!
Explain the diff between ACID consistency and CAP consistency
ACID and CAP deal with consistency in different ways:
ACID (Atomicity, Consistency, Isolation, Durability):
Focuses on data integrity within a single database.
Ensures reliable updates, like your favorite game keeping your high score safe.
Imagine it as a strict teacher in the classroom (database) making sure everyone follows the rules (data stays valid) when updating the board (data).
CAP (Consistency, Availability, Partition Tolerance):
Deals with distributed systems where data is spread across multiple locations.
Focuses on trade-offs between keeping data consistent everywhere (Consistency), being always available (Availability), and tolerating network problems (Partition Tolerance).
Imagine a game with multiple scoreboards (data) in different schools (servers). You can’t have all three perfectly:
Consistent updates across all scoreboards might take time (sacrificing Availability).
Always showing the latest score might not be perfect if there’s a network issue (sacrificing Partition Tolerance).
Choosing two out of the three is the key!
Eventual Consistency
Eventual consistency is like waiting for the mail to deliver gossip in a big town. Updates are sent out (replicated) but might take a while to reach everyone (all servers). Eventually, everyone will have the latest news (consistent data), but there might be a short delay.
What is the highest availability consistency?
Eventual consistency
What is the weakest consistency model
Eventual consistency
SQL v NoSQL
Imagine your data is a collection of items in a classroom. SQL databases are like filing cabinets with neat rows and columns, perfect for things that fit in folders, like names and grades (structured data). NoSQL databases are like big boxes where you can store all sorts of things, like drawings, projects, and maybe even a toy robot (unstructured data)! They’re more flexible for messy data that doesn’t fit neatly in rows and columns.
SQL v NoSQL as explained by solution architect
Structure:
SQL: Enforces a predefined schema with rigid table structures and data types. Think of it as a strictly organized library with specific sections for books, DVDs, and audiobooks.
NoSQL: Offers flexible schema with various data models like documents, key-value pairs, or graphs. Imagine a modern library with designated areas for different media, but items within each section can be diverse.
Scalability:
SQL: Primarily scales vertically by adding more processing power to a single server. It can become expensive for massive datasets. Think of adding more shelves to a single, overflowing bookcase.
NoSQL: Scales horizontally by adding more servers to distribute the data load. Ideal for handling constantly growing datasets. Imagine adding more bookcases to a library as the collection expands.
Use Cases:
SQL: Excellent for structured data with complex queries and transactional consistency (think banking or e-commerce). It’s the go-to for relational data with established schemas.
NoSQL: Perfect for unstructured or semi-structured data with high availability and performance needs (think social media or IoT sensor data). Ideal for large, evolving datasets where flexibility is crucial.
Choosing the Right Tool:
Consider data structure, scalability requirements, and query patterns. If data is relational and requires complex joins, SQL might be ideal. For vast, evolving data with high availability needs, NoSQL could be a better fit.
Ultimately, the best choice depends on the specific needs of your application and data.
Strong Consistency:
This is the gold standard, guaranteeing that all reads always reflect the latest write across all replicas of the data. Imagine a single source of truth, like a master document everyone can access simultaneously. It offers the highest data integrity but can impact performance and scalability.
Read Your Writes Consistency:
This model ensures that a client can always read its own successful writes immediately. Think of it like writing a note and then immediately being able to read it back yourself. However, other clients might not see the update yet. This model offers a balance between availability and consistency and is suitable for scenarios where immediate access to self-generated data is important.
How do I choose my consistency model?
Data Integrity: How critical is it for all data to be immediately consistent across all replicas?
Availability: Can the system tolerate any downtime or lag in data updates?
Performance: How important are fast read and write operations?
Scalability: Will your data volume grow significantly over time?
Architecting with Consistency Models:
Strong consistency might be ideal for financial transactions or critical real-time systems requiring absolute data accuracy.
Eventual consistency is well-suited for social media platforms or e-commerce sites where immediate data updates are less crucial than high availability.
What are some types of failures?
Single Point of Failure (SPOF): This occurs when a single component’s failure cripples the entire system. Imagine a bridge with only one lane – if that lane collapses, the entire bridge is unusable. Solutions include redundancy, like building additional lanes or finding alternative routes.
Cascading Failure: This occurs when the failure of one component triggers failures in other dependent components, creating a domino effect. Think of a power outage that shuts down critical servers, leading to data loss and service disruptions throughout the system. Mitigation strategies involve isolating components, designing graceful degradation, and implementing fault tolerance mechanisms.
Resource Exhaustion: This occurs when a system runs out of critical resources like CPU, memory, or storage, causing performance degradation or complete system crashes. Imagine a car running out of gas – it simply stops functioning. Solutions involve resource monitoring, auto-scaling capabilities, and capacity planning.
Byzantine Failures: This complex model describes situations where failing components can exhibit unpredictable behavior, sending misleading or inconsistent information. Imagine a group of unreliable witnesses to an event, each providing conflicting accounts. Byzantine fault tolerance is a challenging area of distributed systems design.
How do you mitigate failures?
Redundancy: Introduce backups, failover mechanisms, or load balancing to avoid SPOFs.
Isolation: Design your system with loosely coupled components to limit the impact of cascading failures.
Monitoring and resource management: Proactively monitor resource usage and implement scaling mechanisms to prevent exhaustion.
Error handling and recovery: Build robust error handling and recovery routines to gracefully handle failures and minimize downtime.
Test and validate: Regularly test your system under simulated failure conditions to verify the effectiveness of your mitigation strategies.
Availability: What is availability?
Availability is the percentage of time that some service or infrastructure is accessible to clients and is operated upon under normal conditions. For example, if a service has 100% availability, it means that the said service functions and responds as intended (operates normally) all the time.
Non-functional Sys Char: How do we measure availability
((Total time-amount of time service was down)/total_time)*100
The nines of availability
Non-func req: What is reliability?
Prob that the service will perform its functions for a specified time. Reliability measures how the service performs under varying operating conditions.
Metrics to measure R
What is measurement of availability driven by
time loss
What is measurement of reliability driven by
frequency and impact of failures
Scalability
Ability to handle an increase in amount of workload without compromising performance. A search engine, for example, must accommodate increasing numbers of users, as well as the amount of data it indexes.
What are the two types of workload:
Request workload: This is the number of requests served by the system.
Data/storage workload: This is the amount of data stored by the system.
Dimensions of scalability
Size scalability: A system is scalable in size if we can simply add additional users and resources to it.
Administrative scalability: This is the capacity for a growing number of organizations or users to share a single distributed system with ease.
Geographical scalability: This relates to how easily the program can cater to other regions while maintaining acceptable performance constraints. In other words, the system can readily service a broad geographical region, as well as a smaller one.
Vertical Scaling
Vertical scaling, also known as “scaling up,” refers to scaling by providing additional capabilities (for example, additional CPUs or RAM) to an existing device. Vertical scaling allows us to expand our present hardware or software capacity, but we can only grow it to the limitations of our server. The dollar cost of vertical scaling is usually high because we might need exotic components to scale up.
Horizontal Scaling
Horizontal scaling, also known as “scaling out,” refers to increasing the number of machines in the network. We use commodity nodes for this purpose because of their attractive dollar-cost benefits. The catch here is that we need to build a system such that many nodes could collectively work as if we had a single, huge server.
What is maintainability?
Maintainability refers to the ease with which a system can be modified, extended, and debugged throughout its lifecycle.
What is concept of operability in maintainability?
This is the ease with which we can ensure the system’s smooth operational running under normal circumstances and achieve normal conditions under a fault.