General Flashcards

Question

CAP Theorem

Answer 1

- Generally, it's an abstract framework for understanding the trade-offs between three essential properties of distributed systems: consistency, availability, and partition tolerance - States it is impossible for a distributed system to simultaneously provide more than two of these three guarantees: consistency, availability, and partition tolerance Consistency: - All clients see the same data - Sacrifies availability - Ramification is returning an error to the client Availability: - Any client which requests data gets a response even if some of the nodes are down - Sacrifies consistency - Ramification is continue to allow reads but stale data may be returned Partition tolerance: - A partition indicates a communication break between two nodes. Partition tolerance means the system continues to operate despite network partitions.

Answer 2

- Open source remote procedural calls - Built on top of dependent on HTTP/2 - Let you write software as if you were running the functional call locally - They use IDL (Interface Definition Language) to create a contract on the data types and methods to be invoked - gRPC uses Protocol Buffers (protobuf) for serialization and deserialization. Data is translated into binary which has a smaller footprint and is therefore faster - Streaming: Allows gRPC to execute several processes inside a single request. This is possible through HTTP/2 multiplexing feature. Supports serverside, clientside, and bidirectional streaming - Code Generation: Client and Server code is generated from the .proto file. Defines data formats/types and application endpoints - gRPC heavily uses HTTP/2 so you cannot call a gRPC service from a web browser like you would with REST - gRPC internals and communication between 2 micro services - The service that provides the service is the gRPC server - Thee other service which requires data from the providers is the gRPC consumer - A gRPC server can also be a consumer when it needs data from another microservice 1) Service Definition, eg .proto file. Defines contract between client and server 2) gRPC Server stubs. Also runs gRPC server to handle client calls 3) gRPC Client Stubs. Converts code to remote function invocation gRPC use cases - Designed for latency and high throughput communication. Works very well with microservices - Point to point real time communication. Supports bi-directional streaming. gRPC services can push messages in real-time without polling - Messages are serialized with Protobuf, a lightweight message format

Answer 3

Different types of caching: - In memory. Faster than fetching from disk - Disk caching. Faster than retrieving from remote source - CDN caching for faster retrieval of static content - Client caching (eg browser) - DNS cache for faster domain resolution

Answer 4

- Technique in distributed systems and DBs to divide a large dataset into smaller parts referred to as partitions - Each partition is assigned to a separate node - Improves the performance and scalability of large-scale data processing applications, as it allows processing to be distributed across multiple nodes - Also workload can be balanced and handle more requests and process data efficiently Partition: A smaller, more manageable part of a larger dataset, created as a result of data partitioning Partition key: A data attribute used to determine how data is distributed across partitions. An effective partition key should provide an even distribution of data and support efficient query patterns Shard: Often used interchangeably with a partition, particularly in the context of horizontal partitioning Partitioning Methods ----- 1) Horizontal Partitioning (also known as sharding) - Each shard contains a subset of the rows - Each shard is typically assigned to a different server which allows for parallel processing and faster query execution times 2) Vertical Partitioning - Each partition contains a subset of the columns - Optimizes performance by reducing the amount of data that needs to be scanned Data Sharing Techniques ------ 1) Range based sharding - Data is divided based on a specific range of values for a given partition key. Example: order dates, IDs 2) Hash based sharding - Applying a consistent hash function to the partitioning key - Particularly useful when key has a large number of distinct values or is not easily divided into ranges. - Example: shared based on User IDs 3) Directory based sharding - Use a custom lookup table to map each data entry to a specific shard - Greater flexibility but introduces a layer of complexity as the directory must be maintained 4) Geographical sharding - Shard by US State, Country, Zip Benefits of Data Partitioning - Improved query performance - Enhanced scalability - Load balancing. Helps distributed the workload evenly - Data isolation - Parallel processing - Storage efficiency - Faster data recovery Problems of Data Partitioning - Complexity - Data skew. Uneven data distribution across partitions - Cross partition queries. When queries need to access data across multiple partitions, performance can suffer as the system must search and aggregate data from several partitions

Answer 5

**Common Characteristics of NoSQL DBs** * Not using the relational model * Running well on clusters and sharded. Horizontally scalable * You need to store a massive amount of data * Most NoSQL stores don't support joins * Schema-less (dynamic) design which allows for greater flexibility of data * Perform well under specific workloads such as as high write loads or large scale data storage and retrieval **Types of NoSQL Storage Models** * KV. Excel at high write and read for simple data models like session management and real-time analytics. Examples: DynamoDB, Riak * In memory KV. Excel at low latency. Examples: Redis, Memcache * Document. Keys and Values are stored in documents written into JSON. Each document can contain nested fields and complex data structures. Examples: Elasticsearch, Mongo, CouchDB * Graph. Maps relationships between nodes and edges * Columnar. Primary use case is large scale analytics. Examples: Cassandra, Vertica, RedShift * Time series. Store data in time-ordered streams, sorted by timestamps. Examples: Graphite, Prometheus, AWS Timestream **Common Characteristics of SQL** * Relational are ACID compliant which provides a high level of safe guarantees, reliable transactions, and consistency of the data (these properties guarantee that any operation on the data will either be completed in its entirety or not at all). * Most NoSQL sacrifice ACID compliance for availability, performance, and scalability

Answer 6

Searches a tree data structure one level of depth at a time Means we explore all of a node's neighbors before exploring any children Uses a queue Common applications of BFS is path finding

Answer 7

Traverse as far as possible along each branch before backtracking, exploring until we reach a node without edges or a node that we've previously visited DFS uses a stack rather than a queue to track locations to search next Common applications are topological sorting

Answer 8

In a distributed environment, a quorum is the minimum number of servers on which a distributed operation needs to be performed successfully before declaring the operation's overall success It enforces the consistency requirement needed for distributed operations What value should we choose for a quorum? Majority

Answer 9

An architecture that structures an application as a collection of loosely couple services Each service is independently deployable Pros - Each microservice can be independently scaled - Flexible. Each microservice can be developed, deployed, and updated independently leading to faster iteration - Resilient. Failure in one microservice doesn't necessarily impact the broader system - Technology diversity - Autonomy Cons - Complexity. Numerous systems, cognitive overload, monitoring, deploys. logging - Latency. Communication between microservices over a network can introduce latency - Data management and schema migrations - Deployment overhead - Operational overhead