Sys Design and Distributed Systems Flashcards

Question

What is measurement of reliability driven by

Answer 1

frequency and impact of failures

Answer 2

Ability to handle an increase in amount of workload without compromising performance. A search engine, for example, must accommodate increasing numbers of users, as well as the amount of data it indexes.

Answer 3

Request workload: This is the number of requests served by the system. Data/storage workload: This is the amount of data stored by the system.

Answer 4

Size scalability: A system is scalable in size if we can simply add additional users and resources to it. Administrative scalability: This is the capacity for a growing number of organizations or users to share a single distributed system with ease. Geographical scalability: This relates to how easily the program can cater to other regions while maintaining acceptable performance constraints. In other words, the system can readily service a broad geographical region, as well as a smaller one.

Answer 5

Vertical scaling, also known as “scaling up,” refers to scaling by providing additional capabilities (for example, additional CPUs or RAM) to an existing device. Vertical scaling allows us to expand our present hardware or software capacity, but we can only grow it to the limitations of our server. The dollar cost of vertical scaling is usually high because we might need exotic components to scale up.

Answer 6

Horizontal scaling, also known as “scaling out,” refers to increasing the number of machines in the network. We use commodity nodes for this purpose because of their attractive dollar-cost benefits. The catch here is that we need to build a system such that many nodes could collectively work as if we had a single, huge server.

Answer 7

Maintainability refers to the ease with which a system can be modified, extended, and debugged throughout its lifecycle.

Answer 8

This is the ease with which we can ensure the system’s smooth operational running under normal circumstances and achieve normal conditions under a fault.

Answer 9

This refers to the simplicity of the code. The simpler the code base, the easier it is to understand and maintain it, and vice versa.

Answer 10

This is the capability of the system to integrate modified, new, and unforeseen features without any hassle.

Answer 11

From a Google Solutions Architect's perspective, maintainability refers to the ease with which a system can be modified, extended, and debugged throughout its lifecycle. It encompasses various aspects that contribute to keeping the system efficient, adaptable, and cost-effective in the long run. Here are some key principles of maintainability as emphasized by Google Cloud Architects: Modular design: Breaking down the system into well-defined, independent modules with clear interfaces promotes easier isolation of issues and facilitates targeted modifications. Code clarity and documentation: Writing clean, well-commented code and maintaining comprehensive documentation improves understanding for future developers, reducing maintenance effort. Automated testing: Implementing unit, integration, and system tests ensures code quality, catches regressions early, and automates repetitive tasks, streamlining maintenance processes. Version control and configuration management: Using version control systems like Git and configuration management tools like Terraform enables tracking changes, reverting to previous states if needed, and simplifies managing infrastructure configurations. Infrastructure as code: Treating infrastructure as code using tools like Cloud Deployment Manager allows for automated provisioning and configuration, making deployments and updates more consistent and less error-prone. Observability and monitoring: Implementing proper monitoring and logging solutions provides deep insights into system health and performance, enabling proactive identification and resolution of potential issues. By prioritizing these principles, Google Solutions Architects aim to build and deploy systems that are not only functional but also sustainable and manageable in the long term. This translates to: Reduced development and maintenance costs: Easier modifications and fewer errors lead to faster development cycles and lower maintenance overheads. Improved scalability and adaptability: Modular design and automated processes make it easier to adapt the system to changing requirements or scale it efficiently as needed. Reduced risk of downtime: Proactive monitoring and well-documented code minimize the risk of introducing regressions or encountering unexpected issues during maintenance activities. Overall, maintainability is a crucial aspect of building robust and sustainable solutions on Google Cloud, ensuring their long-term viability and minimizing the burden on future developers and administrators.

Answer 12

Maintainability, M, is the probability that the service will restore its functions within a specified time of fault occurrence. M measures how conveniently and swiftly the service regains its normal operating conditions. Mean time to repair.

Answer 13

Maintainability and reliability# Maintainability can be defined more clearly in close relation to reliability. The only difference between them is the variable of interest. Maintainability refers to time-to-repair, whereas reliability refers to both time-to-repair and the time-to-failure. Combining maintainability and reliability analysis can help us achieve availability, downtime, and uptime insights.

Answer 14

fault tolerance refers to the ability of a system to remain operational and functional even in the presence of faults or failures.

Answer 15

Redundancy: Implementing redundant components like servers, network connections, or storage resources allows the system to automatically switch to a backup if a primary component fails. This can be achieved through services like Google Cloud Load Balancing and regional deployments of resources. Failover mechanisms: Designing automatic failover mechanisms ensures the system gracefully switches to a secondary resource when a primary component encounters an issue. This minimizes service interruption and user impact. Google Cloud offers features like Cloud Spanner and Cloud SQL failover groups for automated failover capabilities. Self-healing: Building self-healing capabilities allows the system to automatically detect and recover from failures without manual intervention. This can involve restarting failed processes, re-initializing connections, or automatically scaling resources in response to increased load. Google Cloud offers tools like Cloud Monitoring and Cloud Functions to automate recovery actions. Isolation: Designing the system with loosely coupled components helps prevent cascading failures where a single fault brings down the entire system. This allows other parts of the system to continue functioning even if one component encounters an issue. Error handling and recovery: Implementing robust error handling mechanisms ensures the system gracefully handles errors and attempts to recover from failures without crashing. This includes logging errors for further analysis and providing meaningful feedback to users. Testing and monitoring: Regularly testing the system under simulated failure conditions and implementing comprehensive monitoring practices are crucial for identifying potential weaknesses and vulnerabilities. This allows proactive measures to be taken to strengthen the system's fault tolerance. By prioritizing these aspects, Google Solutions Architects aim to build systems that are resilient and adaptable to unexpected events. This translates to: Reduced downtime and data loss: Fault tolerance mechanisms minimize service disruptions and ensure data integrity even in the face of failures. Improved user experience: By minimizing downtime and errors, fault tolerance helps maintain a consistent and reliable user experience. Increased operational efficiency: By automating recovery processes and proactively identifying potential issues, fault tolerance reduces the need for manual intervention and optimizes operational efficiency. Overall, fault tolerance is a cornerstone of building reliable and robust systems on Google Cloud. By incorporating these principles, Google Solutions Architects create solutions that can withstand unexpected challenges and deliver a high level of availability and service continuity to their users.

Answer 16

Replication- replicate services and data. Swap out failed nodes and a failed data store with its replica.

Answer 17

Updating data in replicas is a challenging job. When a system needs strong consistency, we can synchronously update data in replicas. However, this reduces the availability of the system

Answer 18

We can also asynchronously update data in replicas when we can tolerate eventual consistency, resulting in stale reads until all replicas converge.

Answer 19

Example 1: Replicating Cloud SQL Database to Cloud Storage using Cloud Functions Scenario: You have a Cloud SQL database containing important application data and want to replicate it to another region for backup and disaster recovery purposes. Services Used: Cloud SQL: Stores the source database. Cloud Functions: Triggers the replication process based on changes in the database. Cloud Storage: Stores the replicated data in a bucket.

Answer 20

Availability - They need this service to not be down. Scalability - They need to be able to scale to the num of requests coming in.

Answer 21

From a Google Solutions Architect's perspective, consistency refers to the guarantee of data integrity across all replicas within a distributed system. This ensures everyone sees the same "truth" when accessing the data, whether it's strong consistency with immediate updates or eventual consistency where updates eventually propagate. Choosing the right model depends on your needs for data accuracy versus availability.

Answer 22

For a Google Solutions Architect, low latency translates to minimized delays in data transfer and processing across a system. It's about ensuring responsiveness and fast user experiences, like serving search results or loading content near-instantaneously. They strive to optimize network infrastructure, utilize regional deployments, and leverage caching mechanisms to achieve this goal.

Answer 23

Security -> Consistency -> Low Latency

Answer 24

Fault Tolerance: This is because when one system fails we want to be able to potentially spin up another system that can get communications back online or the hardware fixed.

Answer 25

Reliability: Ensuring your system consistently delivers accurate and expected results, even under challenging conditions, like a reliable car that always starts. Maintainability: Designing systems that are easy to modify, troubleshoot, and extend over time, like a modular house that's easy to add rooms to. Consistency: Guaranteeing data integrity across all copies in a distributed system, ensuring everyone sees the same "truth," with choices based on the trade-off between immediate updates (strong consistency) and eventual consistency (updates propagate eventually). Fault Tolerance: Building systems that can withstand failures and continue operating, like a car with redundant brakes that can still stop even if one fails. Availability: Ensuring your system is accessible and operational for users when they need it, like a store that's always open for business. Scalability: Designing systems that can adapt to changing demands by efficiently adding or removing resources, like a house party that can accommodate more guests by moving furniture around.

Answer 26

A Google Cloud Architect would explain a load balancer like a traffic cop for your cloud applications. Imagine a busy intersection with cars trying to reach different buildings (your application servers). The load balancer directs incoming user traffic (cars) across multiple available application servers (buildings) to ensure smooth operation and prevent any one server from getting overloaded. This keeps response times fast and your application highly available.

Answer 27

Do focus less on the knitty gritty details of a sys. Examples: The number of concurrent TCP connections a server can support. The number of requests per second (RPS) a web, database, or cache server can handle. The storage requirements of a service.

Answer 28

Decoupled from application servers. A Google Cloud Architect would describe a web server as the behind-the-scenes engine that delivers your website content to users. Imagine a restaurant kitchen (web server) that receives customer orders (user requests) and prepares the food (processes the request) using recipes (web applications). Finally, it delivers the prepared dishes (web content) to the waiters (web browsers) who serve them to the customers (users). Here at Google Cloud, we offer several options to get your web server up and running quickly: Compute Engine: This provides you with virtual machines (VMs) where you can install and configure any web server software you prefer, like Apache or Nginx. It offers full control and flexibility, but requires more manual setup and management. App Engine: This is a fully managed platform where you deploy your web application code (recipes) and Google handles the underlying infrastructure, including the web server (kitchen). It's ideal for simple to complex web applications and offers automatic scaling and load balancing. Cloud Run: This serverless offering lets you deploy containerized web applications (pre-packaged recipes) without managing any servers yourself. Google automatically provisions resources based on traffic, making it ideal for cost-effective and scalable web applications.

Answer 29

good computation

Answer 30

A Google Cloud Architect would describe an application server as the chef in the kitchen of your web application. While the web server delivers the content (like the waiter bringing the food), the application server prepares the content based on the user's request. Imagine the chef receiving orders (requests) and using the kitchen (application server) with its ingredients and tools (databases, application logic) to cook the food (process the request and generate the response). Finally, the prepared dish (processed response) is sent back to the web server for delivery to the user. Here at Google Cloud, we offer a few options to get your application server up and running: Compute Engine: Similar to web servers, you can use Compute Engine VMs to install and configure any application server software you prefer, like Tomcat or WildFly. This offers full control and flexibility but requires more manual setup and management. App Engine: While primarily focused on web applications, App Engine can also handle some application server functionalities. You can deploy your application code containing business logic alongside your web application, simplifying deployment and management. Cloud Run: Like with web servers, Cloud Run allows deploying containerized application server code alongside your web application. This serverless approach offers automatic scaling and eliminates server management, making it cost-effective and scalable. Kubernetes Engine (GKE): This managed Kubernetes service allows deploying and managing containerized applications at scale. You can use GKE to deploy containerized application servers alongside your web applications, offering flexibility and control over the environment. Choosing the right service depends on your specific needs: For full control and customization: Choose Compute Engine or GKE. For ease of use and some application server functionality: Choose App Engine. For serverless and cost-effective deployments: Choose Cloud Run.

Answer 31

As a Google Cloud Architect, I'd explain microservices like this: Imagine a complex restaurant operation (your application). Traditionally, everything might be done in one giant kitchen (monolithic architecture). With microservices, we break it down into specialized stations (individual services) like appetizers, entrees, and desserts. Here's the breakdown: Independent services: Each microservice focuses on a specific task (preparing appetizers, grilling steaks) and operates independently. Clear communication: Services communicate with each other through well-defined APIs (like waitstaff taking orders from tables). Faster development: You can develop and deploy individual services faster, like adding a new dessert station without affecting the entire kitchen. Scalability: You can independently scale up specific services (adding more chefs to the grill station during peak hours) based on their needs. Fault tolerance: If one service has an issue (grill malfunctions), it doesn't bring down the whole operation (other services like appetizers can still function). Here at Google Cloud, we offer several tools to help you build and deploy microservices: Cloud Functions: For small, event-driven tasks. App Engine: For simple to complex web applications with built-in scaling. Cloud Run: Serverless platform for deploying containerized microservices. Kubernetes Engine (GKE): For managing and orchestrating containerized microservices at scale. Choosing the right tools depends on your project's complexity and needs. By adopting microservices, you can build more agile, scalable, and maintainable applications, just like a well-organized restaurant kitchen can deliver delicious food efficiently.

Answer 32

high computation and storage capacities as they are working with many things. They typically provide changing content whereas a webserver usually is providing static

Answer 33

Typically have loads of hard drive memory and can store a ton of data.

Answer 34

RAM (Random Access Memory): Imagine RAM as your workstation desk. It's where you keep the things you're currently working on, like open documents, folders, and tools. Faster access: RAM is super fast, like having everything you need right at your fingertips. You can access information instantly. Volatile memory: RAM is like a whiteboard – information is erased once you turn off your computer (or clear the desk). Smaller capacity: RAM is typically smaller in size compared to a hard drive. It's designed to hold what you're actively using, not everything you own. Hard Drive Memory (HDD) or Solid State Drive (SSD): Think of the hard drive as your room's storage closet. It's where you keep all your stuff, from clothes and books (documents, music, movies) to long-term projects (archived files). Slower access: HDDs are slower than RAM, like rummaging through a closet to find something specific. SSDs are a faster type of hard drive but still slower than RAM. Non-volatile memory: Hard drives retain information even when you turn off your computer (or close the closet door). Your stuff stays there until you take it out. Larger capacity: Hard drives have a much larger storage capacity compared to RAM. You can store a vast amount of information there.

Answer 35

RAM (Random Access Memory): Imagine RAM as your workstation desk. It's where you keep the things you're currently working on, like open documents, folders, and tools. Faster access: RAM is super fast, like having everything you need right at your fingertips. You can access information instantly. Volatile memory: RAM is like a whiteboard – information is erased once you turn off your computer (or clear the desk). Smaller capacity: RAM is typically smaller in size compared to a hard drive. It's designed to hold what you're actively using, not everything you own. Hard Drive Memory (HDD) or Solid State Drive (SSD): Think of the hard drive as your room's storage closet. It's where you keep all your stuff, from clothes and books (documents, music, movies) to long-term projects (archived files). Slower access: HDDs are slower than RAM, like rummaging through a closet to find something specific. SSDs are a faster type of hard drive but still slower than RAM. Non-volatile memory: Hard drives retain information even when you turn off your computer (or close the closet door). Your stuff stays there until you take it out. Larger capacity: Hard drives have a much larger storage capacity compared to RAM. You can store a vast amount of information there.

Answer 36

Let DAU = NUM requests per second

Answer 37

This building block focuses on how to design hierarchical and distributed naming systems for computers connected to the Internet via different Internet protocols.

Answer 38

Here, we’ll understand the design of a load balancer, which is used to fairly distribute incoming clients’ requests among a pool of available servers. It also reduces load and can bypass failed servers

Answer 39

This building block enables us to store, retrieve, modify, and delete data in connection with different data-processing procedures. Here, we’ll discuss database types, replication, partitioning, and analysis of distributed databases.

Answer 40

It is a non-relational database that stores data in the form of a key-value pair. Here, we’ll explain the design of a key-value store along with important concepts such as achieving scalability, durability, and configurability

Answer 41

In this chapter, we’ll design a content delivery network (CDN) that’s used to keep viral content such as videos, images, audio, and webpages. It efficiently delivers content to end users while reducing latency and burden on the data centers.

Answer 42

In this building block, we’ll focus on the design of a unique IDs generator with a major focus on maintaining causality. It also explains three different methods for generating unique IDs.

Answer 43

Monitoring systems are critical in distributed systems because they help analyze the system and alert the stakeholders if a problem occurs. Monitoring is often useful to get early warning systems so that system administrators can act ahead of an impending problem becoming a huge issue. Here, we’ll build two monitoring systems, one for the server-side and the other for client-side errors.

Answer 44

In this building block, we’ll design a distributed caching system where multiple cache servers coordinate to store frequently accessed data.

Answer 45

In this building block, we’ll focus on the design of a queue consisting of multiple servers, which is used between interacting entities called producers and consumers. It helps decouple producers and consumers, results in independent scalability, and enhances reliability.

Answer 46

In this building block, we’ll focus on the design of an asynchronous service-to-service communication method called a pub-sub system. It is popular in serverless, microservices architectures and data processing systems.

Answer 47

Here, we’ll design a system that throttles incoming requests for a service based on the predefined limit. It is generally used as a defensive layer for services to avoid their excessive usage-whether intended or unintended.

Answer 48

This building block focuses on a storage solution for unstructured data—for example, multimedia files and binary executables.

Answer 49

A search system takes a query from a user and returns relevant content in a few seconds or less. This building block focuses on the three integral components: crawl, index, and search.

Answer 50

Logging is an I/O intensive operation that is time-consuming and slow. Here, we’ll design a system that allows services in a distributed system to log their events efficiently. The system will be made scalable and reliable.

Answer 51

We’ll design a distributed task scheduler system that mediates between tasks and resources. It intelligently allocates resources to tasks to meet task-level and system-level goals. It’s often used to offload background processing to be completed asynchronously

Answer 52

This building block demonstrates an efficient distributed counting system to deal with millions of concurrent read/write requests, such as likes on a celebrity’s tweet.

Answer 53

Databases: Storing, retrieving, and managing training data, model parameters, and output results. Key-Value Stores: Storing and retrieving training data, model configurations, and intermediate results, especially when dealing with large datasets. Distributed Caching: Caching frequently accessed data like model parameters or pre-computed results to improve training and inference performance. Distributed Messaging Queues: Facilitating communication and asynchronous execution of tasks in various stages of the ML pipeline, such as data pre-processing, training, and evaluation. Blob Store: Storing large datasets of various formats, including images, text, and audio, which are commonly used for training generative AI models. Distributed Search: Efficiently searching and retrieving relevant data points from large datasets used for training or evaluating models. Distributed Logging: Recording and managing logs from various components of the ML system for monitoring, debugging, and troubleshooting purposes. Distributed Task Scheduling: Scheduling and managing the execution of computationally expensive tasks involved in training and deploying ML models, especially in large-scale systems. These building blocks play crucial roles in building scalable, efficient, and robust ML and generative AI systems.

Answer 54

Conventions# For elaboration, we’ll use a “Requirements” section whenever we design a building block (and a design problem). The “Requirements” section will highlight the deliverables we expect from the developed design. “Requirements” will have two sub-categories: Functional requirements: These represent the features a user of the designed system will be able to use. For example, the system will allow a user to search for content using the search bar. Non-functional requirements (NFRs): The non-functional requirements are criteria based on which the user of a system will consider the system usable. NFR may include requirements like high availability, low latency, scalability, and so on. Let’s start with our building blocks.

Answer 55

RAG Architecture: Databases: Storing retrieved documents, LLM prompts, and generated responses for training and future reference. Distributed Search: Efficiently searching the document repository for relevant information during retrieval stage of the RAG process. LLMs: The core component responsible for generating text based on retrieved information and prompts. Orchestration Layer: Manages communication between the various components like retrieval models, LLMs, and potentially user interfaces. LLM-based Recommendation Engine: Databases: Storing user data, item information, and potentially historical interactions or feedback. LLMs: Generating personalized recommendations based on user data and item information. This might involve tasks like summarizing item descriptions, generating personalized messages, or tailoring recommendations to specific user preferences. Content Delivery Network (CDN): Efficiently delivering LLM-generated content to users, especially if the recommendations involve text, images, or audio. Monitoring and Logging: Monitoring the performance and effectiveness of the recommendation engine, including user interactions and feedback loops, to improve future recommendations.

Sys Design and Distributed Systems Flashcards

(81 cards)