Chapter 4, Data Management Patterns GPT Flashcards

1
Q

What are the unique characteristics of cloud native data compared to traditional data processing practices?

A

Cloud native data can be stored in many forms, in a variety of data formats and data stores, does not maintain a fixed schema, and is encouraged to have duplicate data to facilitate availability and performance over consistency. Multiple services are encouraged to call respective service APIs that own the data store rather than accessing the same database directly. This provides separation of concerns and allows cloud native data to scale out.
Page 162

Just as cloud native microservices have characteristics such as being scalable, resilient, and manageable, cloud native data has its own unique characteristics that are quite different from traditional data processing practices. Most important, cloud native data can be stored in many forms, in a variety of data formats and data stores. They are not expected to maintain a fixed schema and are encouraged to have duplicate data to facilitate availability and performance over consistency. Furthermore, in cloud native applications, multiple services are not encouraged to access the same database; instead, they should call respective service APIs that own the data store to access the data. All these provide separation of concerns and allow cloud native data to scale out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are stateless applications, and why are they simpler to implement and scale compared to stateful applications?

A

Stateless applications depend only on input and configuration data, making their failure or restart have almost no impact on execution. In contrast, stateful applications depend on input, config, and state data, which makes them more complex to implement and scale, as application failures can corrupt their state leading to incorrect execution.
Page 162

Applications that depend only on input and configuration (config) data are called stateless applications. These applications are relatively simple to implement and scale because their failure or restart has almost no impact on their execution. In contrast, applications that depend on input, config, and state data—stateful applications—are much more complex to implement and scale. The state of the application is stored in data stores, so application failures can result in partial writes that corrupt their state, which can lead to incorrect execution of the application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are relational databases best suited for, and what principle do they follow for schema definition?

A

Relational databases are ideal for storing structured data that has a predefined schema and use Structured Query Language (SQL) for processing, storing, and accessing data. They follow the principle of defining schema on write, meaning the data schema is defined before writing the data to the database.
Page 165

Relational databases are ideal for storing structured data that has a predefined schema. These databases use Structured Query Language (SQL) for processing, storing, and accessing data. They also follow the principle of defining schema on write: the data schema is defined before writing the data to the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the advantages of using relational databases for cloud native application data?

A

Relational databases can optimally store and retrieve data using database indexing and normalization, provide transaction guarantees through ACID properties, and help deploy and scale the data along with microservices as a single deployment unit.
Page 165-166

Relational databases can optimally store and retrieve data by using database indexing and normalization. Because these databases support atomicity, consistency, isolation, and durability (ACID) properties, they can also provide transaction guarantees.

Relational databases are a good option for storing cloud native application data. We recommend using a relational database per microservice, as this will help deploy and scale the data along with the microservice as a single deployment unit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the principle of schema on read, and which type of databases follow this principle?

A

The principle of schema on read means that the schema of the data is defined only at the time of accessing the data for processing, not when it is written to the disk. NoSQL databases follow this principle.
Page 166

NoSQL databases follow the principle of schema on read: the schema of the data is defined only at the time of accessing the data for processing, and not when it is written to the disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are NoSQL databases suitable for handling big data, and what is a general recommendation regarding their use for transaction guarantees?

A

NoSQL databases are designed for scalability and performance, making them suitable for handling big data. However, it is generally not recommended to store data in NoSQL stores that need transaction guarantees.
Page 166-167

NoSQL databases are best suited to handling big data, as they are designed for scalability and performance.

it is generally not recommended to store data in NoSQL stores that need transaction guarantees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does a column store database manage data, and what are some common examples?

A

A column store database stores multiple key (column) and value pairs in each of its rows, allowing for writing any number of columns during the write phase and specifying only the columns of interest during data retrieval. Examples include Apache Cassandra and Apache HBase.
Page 167

Column store stores multiple key (column) and value pairs in each of its rows, as shown in Figure 4-2. These stores are a good example of schema on read: we can write any number of columns during the write phase, and when data is retrieved, we can specify only the columns we are interested in processing. The most widely used column store is Apache Cassandra. For those who use big data and Apache Hadoop infrastructure, Apache HBase can be an option as it is part of the Hadoop ecosystem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of data is stored in a document store, and which databases are popular for this purpose?

A

A document store can store semi-structured data such as JSON and XML documents, allowing processing with JSON and XML path expressions. Popular document stores include MongoDB, Apache CouchDB, and CouchBase.
Page 169

Document store can store semi-structured data such as JSON and XML documents. This also allows us to process stored documents by using JSON and XML path expressions. These data stores are popular as they can store JSON and XML messages, which are usually used by frontend applications and APIs for communication. MongoDB, Apache CouchDB, and CouchBase are popular options for storing JSON documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the CAP theorem, and how does it apply to NoSQL stores?

A

The CAP theorem states that a distributed application can provide either full availability or consistency, but not both, while ensuring partition tolerance. Availability means the system is fully functional when some nodes are down, consistency means an update/change in one node is immediately propagated to others, and partition tolerance means the system can work even when some nodes cannot connect to each other.
Page 169

NoSQL stores are distributed, so they need to adhere to the CAP theorem; CAP stands for consistency, availability, and partition tolerance. This theorem states that a distributed application can provide either full availability or consistency; we cannot achieve both while providing network partition tolerance. Here, availability means that the system is fully functional when some of its nodes are down, consistency means an update/change in one node is immediately propagated to other nodes, and partition tolerance means that the system can continue to work even when some nodes cannot connect to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is filesystem storage preferred for unstructured data in cloud native applications?

A

Filesystem storage is preferred for unstructured data because it optimizes data storage and retrieval without trying to understand the data. It can also be used to store large application data as a cache, which can be cheaper than retrieving data repeatedly over the network.
Page 171

Filesystem storage is the best for storing unstructured data in cloud native applications. Unlike NoSQL stores, it does not try to understand the data but rather purely optimizes data storage and retrieval. We can also use filesystem storage to store large application data as a cache, as it can be cheaper than retrieving data repeatedly over the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When should cloud native applications use relational data stores, NoSQL stores, or filesystem storage?

A

Cloud native applications should use relational data stores when they need transactional guarantees and data needs to be tightly coupled with the application. NoSQL or filesystem stores should be used for semi-structured or unstructured fields to achieve scalability while preserving transactional guarantees. NoSQL is also suitable for extremely large data, querying capability, or specific application use cases like graph processing.
Page 172

Cloud native applications should use relational data stores when they need transactional guarantees and when data needs to be tightly coupled with the application.
When data contains semi-structured or unstructured fields, they can be separated and stored in NoSQL or filesystem stores to achieve scalability while still preserving transactional guarantees. The applications can choose to store in NoSQL when the data quantity is extremely large, needs a querying capability, or is semi- structured, or the data store is specialized enough to handle the specific application use case such as graph processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the advantages and disadvantages of centralized data management in traditional data-centric applications?

A

Centralized data management allows data normalization for high consistency, enables running stored procedures across multiple tables for faster retrieval, and provides tight coupling between applications. However, it hinders the ability to evolve applications independently and is considered an antipattern for cloud native applications.
Page 172

Centralized data management is the most common type in traditional data-centric applications. In this approach, all data is stored in a single database, and multiple components of the application are allowed to access the data for processing (Figure 4-3).
This approach has several advantages; for instance, the data in these database tables can be normalized, providing high data consistency. Furthermore, as components can access all the tables, the centralized data storage provides the ability to run stored procedures across multiple tables and to retrieve results faster. On the other hand, this provides tight coupling between applications, and hinders the ability to evolve the applications independently. Therefore, it is considered an antipattern when building cloud native applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does decentralized data management benefit microservices, and what are its potential disadvantages?

A

Decentralized data management allows scaling microservices independently, improving development time and release cycles, and solving data management and ownership problems. However, it can increase the cost of running separate data stores for each service.
Page 174

In Decentralized Data Management each independent functional component can be modeled as a microservice that has separate data stores, exclusive to each of them. This decentralized data management approach, illustrated in Figure 4-4, allows us to scale microservices independently without impacting other microservices.
Although application owners have less freedom to manage or evolve the data, segregating it in each microservice so that it’s managed by its teams/owners not only solves data management and ownership problems, but also improves the development time of new feature implementations and release cycles.
Decentralized data management allows services to choose the most appropriate data store for their use case. For example, a Payment service may use a relational database to perform transactions, while an Inquiry service may use a document store to store the details of the inquiry, and a Shopping Cart service may use a distributed key-value store to store the items picked by the customer.
one of the disadvantages of decentralized data management is the cost of running separate data stores for each service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is hybrid data management, and how does it help with data protection and security enforcement?

A

Hybrid data management helps achieve compliance with modern data-protection laws and ease security enforcement by having customer data managed via a few microservices within a secured bounded context. It provides ownership of the data to one or a few well-trained teams to apply data-protection policies.
Page 175

Hybrid Data Management helps achieve compliance with modern data-protection laws and ease security enforcement as data resides in a central place. Therefore, it is advisable to have all customer data managed via a few microservices within a secured bounded context, and to provide ownership of the data to one or a few well-trained teams to apply data-protection policies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What benefits does exposing data as a data service provide, and in what situations is the Data Service Pattern useful?

A

Exposing data as a data service allows control over data presentation, security, and priority-based throttling. The Data Service Pattern is useful when data does not belong to a specific microservice and multiple microservices depend on it, or for exposing legacy on-premises or proprietary data stores to cloud native applications.
Page 180, 182

Exposing data as a data service, shown in Figure 4-10, provides us more control over that data. This allows us to present data in various compositions to various clients, apply security, and enforce priority-based throttling, allowing only critical services to access data during resource-constraint situations such as load spikes or system failures.
These data services can perform simple read and write operations to a database or even perform complex logic such as joining multiple tables or running stored procedures to build responses much more efficiently. These data services can also utilize caching to enhance their read performance.

We can use the Data Service Pattern when the data does not belong to any particular microservice; no microservice is the rightful owner of that data, yet multiple microservices are depending on it for their operation. In such cases, the common data should be exposed as an independent data service, allowing all dependent applications to access the data via APIs.
We can also use the Data Service Pattern to expose legacy on-premises or proprietary data stores to other cloud native applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is accessing the same data via multiple microservices considered an antipattern, and how can the Data Service Pattern help?

A

Accessing the same data via multiple microservices introduces tight coupling and hinders scalability and independent evolution of microservices. The Data Service Pattern helps reduce coupling by providing managed APIs to access data.
Page 183

Considerations: When building cloud native applications, accessing the same data via multiple microservices is considered an antipattern. This will introduce tight coupling between the microservices and not allow the microservices to scale and evolve on their own. The Data Service pattern can help reduce coupling by providing managed APIs to access data.
the Data Service Pattern should not be used when the data can clearly be associated with an existing microservice, as introducing unnecessary microservices will cause additional management complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the primary purpose of the Sharding Pattern, and what should be avoided when generating shard keys?

A

The primary purpose of the Sharding Pattern is to improve data retrieval time by distributing data across multiple shards. When generating shard keys, avoid using auto-incrementing fields and ensure the fields that contribute to the shard key remain fixed to avoid time-consuming data migration.
Page 198, 200

For sharding to be useful, the data should contain one or a collection of fields that uniquely identifies the data or meaningfully groups it into subsets. The combination of these fields generates the shard/partition key that will be used to locate the data. The values stored in the fields that contribute to the shard key should be fixed and never be changed upon data updates. This is because when they change, they will also change the shard key, and if the updated shard key now points to a different shard location, the data also needs to be migrated from the current shard to the new shard location. Moving data among shards is time-consuming, so this should be avoided at all costs.

We don’t recommend using auto-incrementing fields when generating shard keys. Shards do not communicate with each other, and because of the use of auto-incrementing fields, multiple shards may have generated the same keys and refer to different data with those keys locally. This can become a problem when the data is redistributed during data-rebalancing operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does the Command and Query Responsibility Segregation (CQRS) Pattern enhance performance and scalability?

A

The CQRS Pattern enhances performance and scalability by segregating command (update/write) and query (read) operations into different services, allowing them to run on different nodes, optimize for their specific operations, and independently scale. It reduces data store contention and isolates operations needing higher security enforcement.
Page 203-204

We can separate commands (updates/writes) and queries (reads) by creating different services responsible for each (Figure 4-16). This not only facilitates running services related to update and reads on different nodes, but also helps model services appropriate for those operations and independently scale the services.
The command and query should not have data store–specific information but rather have high-level data relevant to the application. When a command is issued to a service, it extracts the information from the message and updates the data store. Then it will send that information as an event asynchronously to the services that serve the queries, such that they can build their data model. The Event Sourcing pattern using a log-based queue system like Kafka can be used to pass the events between services. Through this, the query services can read data from the event queues and perform bulk updates on their local stores, in the optimal format for serving that data.

Distribute operations and reduce data contention
the Command and Query Responsibility Segregation Pattern can be used when cloud native applications have performance-intensive update operations such as data and security validations, or message transformations, or have performance-intensive query operations containing complex joins or data mapping. When the same instance of the data store is used for both command and query, it can produce poor overall performance due to higher load on the data store. Therefore, by splitting the command and query operations, CQRS not only eliminates the impact of one on the other by improving the performance and scalability of the system, but also helps isolate operations that need higher security enforcement.
Because the Command and Query Responsibility Segregation Pattern allows commands and queries to be executed in different stores, it also enables the command and query systems to have different scaling requirements.

19
Q

Why is CQRS not recommended for applications requiring high consistency between command and query operations?

A

CQRS is not recommended for applications requiring high consistency between command and query operations because it achieves eventual consistency by sending updates asynchronously to query stores via events, which can introduce lock contention and high latencies if synchronous data replication is used.
Page 205

Considerations: Because the Command and Query Responsibility Segregation Pattern segregates the command and query operations, it can provide high availability. Even if some command or query services become unavailable, the full system will not be halted. In the Command and Query Responsibility Segregation Pattern, we can scale the query operations infinitely, and with an appropriate number of replications, the query operations can provide guarantees of zero downtime. When scaling command operations, we might need to use patterns such as Data Sharding to partition data and eliminate potential merge conflicts.
CQRS is not recommended when high consistency is required between command and query operations. When data is updated, the updates are sent asynchronously to the query stores via events by using patterns such as Event Sourcing. Hence, use CQRS only when eventual consistency is tolerable. Achieving high consistency with synchronous data replication is not recommended in cloud native application environments as it can cause lock contention and introduce high latencies.
When using the Command and Query Responsibility Segregation Pattern, we may not be able to automatically generate separate command and query models by using tools such as object-relational mapping (ORM). Most of these tools use database schemas and usually produce combined models, so we may need to manually modify the models or write them from scratch.

20
Q

What does the Materialized View Pattern accomplish, and how does it improve service performance?

A

The Materialized View Pattern replicates and moves data from dependent services to its local data store, building materialized views for efficient querying. It improves service performance by reducing the time to retrieve data, simplifying service logic, and providing resiliency by allowing operations to continue even when the source service is unavailable.
Page 210-212

the Materialized View Pattern replicates and moves data from dependent services to its local data store and builds materialized views (Figure 4-17). It also builds optimal views to efficiently query the data, similar to the Composite Data Services pattern.
the Materialized View Pattern asynchronously replicates data from the dependent services. If databases support asynchronous data replication, we can use it as a way to transfer data from one data store to another. Failing this, we need to use the Event Sourcing pattern and use event streams to replicate the data. The source service pushes each insert, delete, and update operation asynchronously to an event stream, and they get propagated to the services that build materialized views, where they will fetch and load the data to their local stores.

Even when we bring data into the same database, at times joining multiple tables can still be costly. In this case, we can use techniques like relational database views to consolidate data into an easily queryable materialized view.

Provide access to nonsensitive data hosted in secure systems
In some use cases, our caller service might depend on nonsensitive data that is behind a security layer, requiring the service needs to authenticate and go through validation checks before retrieving the data. But through the Materialized View Pattern, we can replicate the nonsensitive data relevant to the service and allow the caller service to access the data directly from its local store. This approach not only removes unnecessary security checks and validations but also improves performance.

21
Q

When is the Data Locality Pattern especially useful, and what considerations should be taken into account when using it?

A

The Data Locality Pattern is especially useful when retrieving data from multiple sources to perform data aggregation or filtering operations, as it reduces data transfer and improves bandwidth utilization. Consider not overloading data nodes and balance the trade-off between bandwidth savings and additional execution cost at data nodes.
Page 216-217

Reduce bandwidth usage when retrieving data
the Data Locality Pattern is especially useful when we need to retrieve data from multiple sources to perform data aggregation or filtering operations. The output of these queries will be significantly smaller than their input. By running the execution closer to the data source, we need to transfer only a small amount of data, which can improve bandwidth utilization. This is especially useful when data stores are huge and clients are geographically distributed. This is a good approach when cloud native applications are experiencing bandwidth bottlenecks.

Considerations: Applying the Data Locality pattern can also help utilize idle CPU resources at the data nodes. Most data nodes are I/O intensive, and when the queries they perform are simple enough, they might have plenty of CPU resources idling. Moving execution to the data node can better utilize resources and optimize overall performance. We should be careful to not move all executions to the data nodes, as this can overload them and cause issues with data retrieval.
the Data Locality Pattern is not ideal when queries output most of their input. These cases will overload the data nodes without any savings to bandwidth or performance. Deciding when to use the Data Locality Pattern depends on the trade-off between bandwidth and CPU utilization. We recommend using the Data Locality Pattern when the gains achieved by reducing the data transfer are much greater than the additional execution cost incurred at the data nodes.

22
Q

What are the key functions and benefits of caching in data retrieval?

A

Caching stores previously processed or retrieved data to reuse it without reprocessing or retrieving it again, improving time to retrieve data, static content loading, and reducing data store contention. It can also achieve high availability by relaxing data store dependency.
Page 219-221

How it works
A cache is usually an in-memory data store used to store previously processed or retrieved data so we can reuse that data when required without reprocessing or retrieving it again. When a request is made to retrieve data, and we can find the necessary data stored in the cache, we have a cache hit. If the data is not available in the cache, we have a cache miss.
When a cache miss occurs, the system usually needs to process or fetch data from the data store, as well as update the cache with the retrieved data for future reference. This process is called a read-through cache operation. Similarly, when a request is made to update the data, we should update it in the data store and remove or invalidate any relevant previously fetched entries stored in the cache. This process is called a write-through cache operation. Here, invalidation is important, because when that data is requested again, the cache should not return the old data but should retrieve updated data from the store by using the read-through cache operation. This reading and updating behavior is commonly referred to as a cache aside, and most commercial caches support this feature by default.
Caching data can happen on either the client or server side, or both, and the cache itself can be local (storing data in one instance) or shared (storing data in a distributed manner).
Especially when the cache is not shared, it cannot keep on adding data, as it will eventually exhaust available memory. Hence, it uses eviction policies to remove some records to accommodate new ones. The most popular eviction policy is least recently used (LRU), which removes data that is not used for a long period to accommodate new entries. Other policies include first in, first out (FIFO), which removes the oldest loaded entry; most recently used (MRU), which removes the last-used entry; and trigger-based options that remove entries based on values in the trigger event. We should use the eviction policy appropriate for our use case.
When data is cached, data stored in the data store can be updated by other applications, so holding data for a long period in the cache can cause inconsistencies between the data in the cache and the store. This is handled by using an expiry time for each cache entry. This helps reload the data from the data store upon time-out and improves consistency between the cache and data store.

Improve time to retrieve data
Caching can be used when retrieving data from the data store requires much more time than retrieving from the cache. This is especially useful when the original store needs to perform complex operations or is deployed in a remote location, and hence the network latency is high.

Improve static content loading
Caching is best for static data or for data that is rarely updated. Especially when the data is static and can be stored in memory, we can load the full data set to the cache and configure the cache not to expire. This drastically improves data-retrieval time and eliminates the need to load the data from the original data source.

Reduce data store contention
Because it reduces the number of calls to the data store, we can use the Caching Pattern to reduce data store contention or when the store is overloaded with many concurrent requests. If the application consuming the data can tolerate inconsistencies, such as data being outdated by a few minutes, we can also deploy the Caching Pattern on write-intensive data stores to reduce the read load and improve the stability of the system. In this case, the data in the cache will eventually become consistent when the cache times out.

23
Q

What are the benefits of prefetching data in a cache, and when should this technique be used?

A

Prefetching data improves data-retrieval time by loading the cache with data likely to be queried, reducing initial cache misses, and stress on the service and data store. This technique should be used when predictable query patterns exist, such as processing recent orders or anticipating user actions like fetching the next set of search results.
Page 221

Prefetch data to improve data-retrieval time
We can preload the cache fully or partially when we know the kind of queries that are more likely to be issued. For example, if we are processing orders and know that the applications will mostly call last week’s data, we can preload the cache with last week’s data when we start the service. This can provide better performance than loading data on demand. When preloading is omitted, the service and the data store can encounter high stress, as most of the initial requests will result in a cache miss.
the Caching Pattern also can be used when we know what data will be queried next. For example, if a user is searching for products on a retail website, and we are rendering only the first 10 entries, the user likely will request the next 10 entries. Preloading the next 10 entries to the cache can save time when that data is needed.

24
Q

How can caching achieve high availability, and what are the benefits of using a distributed cache?

A

Caching can achieve high availability by handling service calls with cached data even when the backend data store is unavailable, and using a fallback mechanism with a shared or distributed cache. Distributed caches provide scalability and resiliency by partitioning and replicating data, bringing data closer to clients and yielding faster response times.
Page 222-223

Achieve high availability by relaxing the data store dependency
Caching can also be used to achieve high availability, especially when the service availability is more important than the consistency of the data. We can handle service calls with cached data even when the backend data store is not available. As shown in Figure 4-20, we can also extend the Caching Pattern by making the local cache fall back on a shared or distributed cache, which in turn can fall back to the data store when the data is not present. the Caching Pattern can incorporate the Resilient Connectivity pattern with a circuit breaker discussed in Chapter 3 for the fallback calls so that they can retry and gracefully reconnect when the backends become available after a failure.
When using a shared cache, we can also introduce a secondary cache instance as a standby and replicate the data to it, to improve availability. This allows our applications to fall back to the standby when the primary cache fails.

Cache more data than a single node can hold
Distributed caching systems can be used as another alternative option when the local cache or shared cache cannot contain all the needed data. They also provide scalability and resiliency by partitioning and replicating data. These systems support read-through and write-through operations and can make direct calls to the data stores to retrieve and update data. We can also scale them by simply adding more cache servers as needed.
Though distributed caches can store lots of data, they are not as fast as the local cache and add more complexity to the system. We might need additional network hops to retrieve data, and we now need to manage an additional set of nodes. Most important, all nodes participating in the distributed cache should be within the same network and have relatively high bandwidth among one another; otherwise, they can also suffer data-synchronization delays. In contrast, when the clients are geographically distributed, a distributed cache can bring the data closer to the clients, yielding faster response times.

25
Q

What precautions should be taken when setting cache time-outs, and how can unnecessary layers of cache impact performance?

A

Cache time-outs should be set at an optimum level to balance consistency and reloading frequency. Unnecessary layers of cache can cause high memory consumption, reduce performance, and cause data inconsistencies. Load testing and monitoring cache hit percentage, performance, CPU, and memory usage are recommended to ensure cache effectiveness.
Page 224, 226

Things to note: The cache time-out should be set at an optimum level, not too long or too short. While setting a too-long cache time-out can cause higher inconsistencies, setting a too-short time-out is also detrimental, as it will reload the data too often and defeat the purpose of caching data. However, setting a long time-out can also be beneficial when the cost of data retrieval is significantly higher than the cost of data being inconsistent.

Things to note: Introducing unnecessary layers of cache can cause high memory consumption, reduce performance, and cause data inconsistencies. We highly recommend performing a load test when introducing any caching solution, and especially monitoring the percentage of cache hits, along with performance, CPU, and memory usage. A lower cache-hit percentage can indicate that the cache is not effective. In this case, either modify the cache to achieve a higher percentage of cache hits or choose other alternatives. Increasing the size of the cache, reducing cache expiry, and preloading the cache are options that we can use to improve cache hits.

Note to self. I just put the image here.. It durante 100% match the question + notes.

26
Q

What are the benefits of batch data updates to caches, and what approaches can be used to handle concurrent updates?

A

Batch data updates optimize bandwidth and improve performance under high load. For concurrent updates, an optimistic approach assumes no concurrent updates and checks for concurrent writes before updating, while a pessimistic approach locks the cache for the update duration, though it is not scalable and suitable only for short-lived operations.
Page 227

Whenever possible, we recommend batch data updates to caches, as is done in data stores. This optimizes bandwidth and improves performance when the load is high. When multiple cache entries are updated at the same time, the updates can follow either an optimistic or a pessimistic approach. In the optimistic approach, we assume that no concurrent updates will occur and check the cache only for a concurrent write before updating the cache. But in the pessimistic approach, we lock the cache for the full update period so no concurrent updates can occur. The latter approach is not scalable, so you should use this only for very short-lived operations.

27
Q

How can data security be enforced for caches, and why should caches not be exposed directly to external systems?

A

Data security for caches can be enforced by adding a data service with API security on top of the cache, using the Data Service pattern. Caches should not be exposed directly to external systems because they are usually not designed for security, and adding a data service helps protect the data and control access.
Page 228

Some commercial cache services can provide data security by using the Vault Key pattern, covered later in this chapter. But most caches are usually not designed for security, and they should not be directly exposed to external systems. To achieve security, we can add a data service on top of the cache by using the Data Service pattern and apply API security for the data service (Figure 4-22). These will add data protection and allow only authorized services to read and write data to the cache.

28
Q

What is the Static Content Hosting Pattern, and how does it benefit cloud native web services?

A

The Static Content Hosting Pattern allows direct serving of static content from storage services such as CDNs, reducing resource utilization on rendering services and providing faster static content delivery by replicating and caching data in multiple locations closer to clients.
Page 231-232

Cloud native web services are used to create dynamic content based on clients’ requests. Some clients, especially browsers, require a lot of other static content, such as static HTML pages, JavaScript and CSS files, images, and files for downloads. Rather than using microservices to cater to static content, Static Content Hosting Pattern allows us to directly serve static content from storage services such as content delivery networks (CDNs).

Provide faster static content delivery
Because static content does not change, the Static Content Hosting Pattern replicates and caches data in multiple environments and geographical locations with the motivation of moving it closer to the clients. This can help serve static data with low latency.

Reduce resource utilization on rendering services
When we need to send both static and dynamic data to clients, as discussed in the preceding web browser example, we can separate the static data and move it to a storage system such as a CDN or an S3 bucket, and let clients directly fetch that data. This reduces the resource utilization of the microservice that renders the dynamic content, as it does not need to pack all the static content in its response.

29
Q

Under what circumstances is the Static Content Hosting Pattern not recommended, and what are the additional considerations for its use?

A

The Static Content Hosting Pattern is not recommended if the static content needs updating before delivery or when the amount of static data is small, as requesting data from multiple sources can incur more latency. Additionally, it requires more complex client implementations and might need secure storage if authorized access is required.
Page 233

Considerations: We cannot use the Static Content Hosting Pattern if the static content needs to be updated before delivering it to the clients, such as adding the current access time and location to the web response. Further, this is not a feasible solution when the amount of static data that needs to be served is small; the cost for the client to request data from multiple sources can incur more latency than is being served directly by the service. When you need to send both static and dynamic content, we recommend using the Static Content Hosting Pattern only when it can provide a significant performance advantage.
When using the Static Content Hosting Pattern, remember that you might need more-complex client implementations. This is because, based on the dynamic data that arrives, the client should be able to retrieve the appropriate static content and combine both types of data at the client side. If we are using the Static Content Hosting Pattern for use cases other than web-page rendering in a browser, we have to also be able to build and execute complex clients to fulfill that use case.
Sometimes we might need to store static data securely. If we need to allow authorized users to access static data via the Static Content Hosting Pattern, we can use the Data Service pattern along with API security or the Vault Key pattern to provide security for the data store.

30
Q

What are the different levels of transaction isolation, and what type of operations can the Transaction Pattern combine?

A

The different levels of transaction isolation are serializable isolation, repeatable reads isolation, read committed isolation, and read uncommitted isolation. The Transaction Pattern can combine multiple operations as a single unit of work across multiple systems, such as consuming an event from an event queue, performing an update to a data store, and passing the message to another event queue.
Page 238-239

We can achieve transaction isolation at different levels. Serializable isolation provides the highest level. This blocks data access on selected data for parallel read and write queries during the transaction, and blocks addition and removal of data that might fall into the transaction data range.
Repeatable reads isolation provides the second-best level of isolation. This blocks data access on selected data for read and write queries during the transaction, but allows addition and removal of new data in the transaction data range. At the same time, read committed isolation blocks only data writes, while read uncommitted isolation allows reading noncommitted updates made by other transactions.
Transactions are commonly used with only a single data store, such as a relational database, but we can also coordinate operations across multiple systems, such as databases, event streams, and queuing systems.

Combine multiple operations as a single unit of work: We can use the Transaction pattern to combine multiple steps that should all be processed completely to consider the operation valid.
We can also make sure that multiple transactions do not interfere with each other. For example, both Bob and Eve will be able to transfer money to Alice’s account at the same time in parallel.

Combine operations across multiple systems: the Transaction pattern can be used when we want to consume an event from an event queue, perform an update based on that to a data store, and pass that message to another event queue for further processing—all in a single transaction, as depicted in Figure 4-24. To synchronize the operations between multiple systems, we can use an XA transaction that uses a two-phase commit protocol. Most databases and event-queuing systems also natively support XA transactions, and through this we can ensure that the event will not get lost even if the processing system fails in the middle of its execution.

31
Q

When is it advisable to use the Transaction Pattern, and when should alternative patterns like Saga be considered?

A

The Transaction Pattern should be used when all steps must be performed as an atomic operation and are relatively short-lived. The Saga pattern should be considered when transactions involve more than three systems, and compensation transactions are possible, as it reduces latency and coupling compared to XA transactions.
Page 240

Considerations: We do not need to use the Transaction pattern when the operation has only a single step, or when there are multiple steps but failure of some is considered acceptable.
It is important to note that the use of consensus algorithms such as XA transactions will synchronize operations and introduce latency. We recommend using the Transaction pattern only when the transaction is relatively short lived, and only if it involves few systems.
Whenever possible, make the operation idempotent; this will help eliminate the need for using any transactions and simplifies the system. This is because with idempotent updates, even when the same operation is performed multiple times, the results will be the same.
When we need to synchronize execution and have more than three systems, we recommend using the Saga pattern discussed in Chapter 3. the Saga pattern is useful for coordinating transactions among multiple data stores, microservices, and message brokers. It enables us to execute multiple transactions in order, and to compensate previous transactions when a latter transaction fails. This can also reduce the high latency or coupling that can occur from the distributed locks used by XA transactions. But we can use Saga only when all the participating transactions can be reverted—in the event of a failure—by using a compensation transaction. This can especially become a problem when we are integrating with third-party systems and might not have a way to compensate them in the case of failure.
We recommend using XA transactions over Saga when all updates need to be done in a single data store or when all steps must be performed at the same time as an atomic operation. While Saga performs transactions in order, other systems can access data from the data stores and microservices in parallel. They can then get inconsistent results if they retrieve one part of the data from a data store that has already performed the transaction and another from a data store that has not yet processed the transaction.

32
Q

What is the purpose of the Vault Key Pattern, and what should be considered when using it?

A

The Vault Key Pattern provides a mechanism to control data store access and enforce security. It requires data store support for key validation, and considerations include setting moderate expiry times to reduce damage from compromised keys, and using alternative approaches if data stores can’t validate access based on keys.
Page 244

Considerations: Once the caller service gets access to the data store, the application that governs the service usually will lose control. the Vault Key Pattern provides a mechanism to withhold control over the data store and enforce security. But we can apply the Vault Key Pattern only when the data store supports key validation. This is important to ensure that the token is issued by the identity provider and is not expired. Some advanced data stores also support access scopes; they can identify which section of the data store, such as the table or row, can be accessed by the incoming request. When the data store can’t validate access based on keys, use alternative approaches such as fronting the stores with a data service and protecting with API security.
Sometimes the issued vault key can be compromised. In these cases, it is usually not possible to block the use of that token, as most data stores do not support this functionality. We can reduce the damage that can be caused by a compromised vault key by setting the expiry time to a moderate value.

33
Q

What is the primary recommendation for using relational database management systems (RDBMS) in cloud native applications?

A

For cloud native applications, it is highly recommended to use a managed version of RDBMS provided by a cloud vendor, such as Amazon RDS, Google Cloud SQL, or Azure SQL, to reduce the complexity of managing databases and ensure better tuning for the environment.
Page 245

Relational Database Management Systems: Most traditional databases fall under the category of relational database management systems (RDBMSs), which includes MySQL, Oracle, MSSQL, Postgres, H2, and more. These relational databases provide the ACID properties, and with their SQL can also have very complex data access patterns. However, if you have nonrelational data such as XML, JSON, or binary format, then an RDBMS may not be the best option, and you might need to select a distributed filesystem or NoSQL database to store data, as discussed previously in “Relational Databases”.
When building cloud native applications, instead of deploying the database yourself on the cloud infrastructure, we highly recommend using a managed version of RDBMSs provided by a cloud vendor, such as Amazon Relational Database Service (RDS), Google Cloud SQL, or Azure SQL. This will not only reduce the complexity of managing the databases, but also be better tuned for the environment.
To scale RDBMSs, we can deploy them as primary and replica databases, as discussed in the Materialized View pattern, or shard the data as in the Sharding pattern. In the worst case, if we still have issues with space, we can also periodically back up older, rarely used data to an archive such as NoSQL, and delete it from the store.

34
Q

What are the key features and limitations of Apache Cassandra as a NoSQL database?

A

Apache Cassandra is known for continuous availability, high performance, and linear scalability. It offers replication across data centers, handles large amounts of data, and supports eventual consistency with adjustable levels. However, it has limited performance for frequent updates or deletes and is inefficient for joining two column families.
Page 246

Apache Cassandra: Apache Cassandra is a distributed NoSQL database that began internally at Facebook and was released as an open source project in July 2008. Cassandra column store is well-known for its continuous availability (zero downtime), high performance, and linear scalability, which modern applications and microservices require. It also offers replication across data centers and geographies to guarantee availability across regions. Cassandra can handle petabytes of information and thousands of concurrent operations per second, enabling you to manage large amounts of data across hybrid cloud and multicloud environments. For cloud native application deployment, we recommend using managed Cassandra deployments such as Amazon Keyspaces and Asta on Google Cloud. Cassandra’s write performance is very high compared to its read performance, As discussed previously in “NoSQL Databases”, it provides eventual consistency by design. However, it also lets us change its consistency levels to achieve weak or strong consistency based on the use case.
The performance of Cassandra also depends on how we store and query data. If we will be querying data based on a set of keys, we should use its row key (partition key). If we need to query data from different keys, we can create secondary indexes. Don’t overuse the secondary indexes; they can slow the data store, as each insertion has to also update the indexes. Further, Cassandra is not efficient when we want to join two column families, and we should not use it if we are to update the data more frequently.

35
Q

What are the strengths and weaknesses of Apache HBase compared to Apache Cassandra?

A

Apache HBase provides linear scalability and real-time read/write access to large data sets, and it supports dynamic database schema. However, it has a complex interdependent system and a single point of failure due to its master/worker deployment. HBase is more suitable for high data consistency requirements, unlike Cassandra, which excels in high availability.
Page 274

Apache HBase: Apache HBase is a distributed, scalable, NoSQL column store that runs on top of the HDFS. HBase can host very large tables with billions of rows and millions of columns, and can also provide real-time, random read/write access to Hadoop data. It scales linearly across very large data sets and easily combines data sources with different structures and schemas.
As HBase is a column store, it supports dynamic database schema, and as it runs on top of HDFS, it can also be used in MapReduce jobs. Consequently, HBase’s complex interdependent system is more difficult to configure, secure, and maintain.
Unlike Cassandra, HBase uses “master/worker” deployment, and so can suffer a single point of failure. If your application requires high availability, choose Cassandra over HBase. However, when we depend heavily on data consistency, HBase will be more suitable because it writes data to only one place and always knows where to find it (because data replication is done “externally” by HDFS). Similar to Cassandra, HBase also does not perform well for frequent data deletes or updates.

36
Q

What are the key features of MongoDB, and what scenarios is it particularly suited for?

A

MongoDB is a document store that supports JSON-like documents, allowing flexible schema definition and various data operations. It is suited for mobile applications, content management, real-time analytics, and IoT applications. It favors consistency over availability, with multiple secondary replicas and an automatic primary election mechanism.
Page 248

MongoDB: MongoDB is a document store that supports storing data in JSON-like documents, as discussed in “NoSQL Databases”. Documents and collections in MongoDB are comparable to records and tables in relational databases. It uses MongoDB query language to access the stored data, perform aggregation filtering and sorting based on any document fields, and insert and delete fields without restructuring documents. MongoDB Cloud provides MongoDB as a hosted solution for cloud native application usage.
Unlike Cassandra or RDBMSs, MongoDB prefers more indexes. When not indexed, its performance can suffer, as it needs to search the entire collection. MongoDB also favors consistency over availability. It achieves availability by using a single read/write primary and multiple secondary replicas. When a primary becomes unavailable, the read/write operations will be temporarily halted for about 10 to 40 seconds while MongoDB automatically elects one of its secondary replicas as the primary.
MongoDB is heavily used for mobile applications, content management, real-time analytics, and IoT applications. MongoDB is also a good choice if you have no clear schema definition with your JSON documents, and you can tolerate some data store unavailability. However, like other NoSQL databases, it is not suitable for transactional data.

37
Q

What are the key characteristics and limitations of Amazon DynamoDB?

A

Amazon DynamoDB is a key-value and document database known for low latency, high scalability, automatic partitioning, and replication across multiple availability zones. It supports fine-grained access control but has limited querying capabilities and does not support relational database features like table joins and foreign-key concepts.
Page 249

Amazon DynamoDB: DynamoDB is a key-value and document database that can be used to store and retrieve data with low latency and high scalability. It can handle more than 10 trillion requests per day and more than 20 million requests per second during peaks. Data in DynamoDB is stored on solid-state disks (SSDs), automatically partitioned, and replicated across multiple availability zones. It also provides fine-grained access control and uses proven secured methods to authenticate users and prevent unauthorized data access.
DynamoDB, a service provided by AWS, cannot be installed on a local server or in clouds other than AWS. Use DynamoDB only if you are using AWS as the primary cloud infrastructure for your cloud native applications. Further, DynamoDB has limited querying capability compared to relational stores and does not support relational database features such as table joins and foreign-key concepts; instead, it advocates using non-normalized data with redundancy for performance.

38
Q

What are the primary uses and limitations of Apache HDFS?

A

Apache HDFS is used for storing analytical data due to its high data resiliency and optimization for writing and reading data in a streaming manner. It supports storing large files efficiently but has limitations with random reads and can suffer unavailability if the single-name node is down.
Page 250

Apache HDFS: The Apache Hadoop Distributed File System (HDFS) is a widely used distributed filesystem designed to run on cheap commodity hardware while providing high data resiliency by storing at least three copies of data in a distributed manner. HDFS is commonly used to store analytical data because the data stored in HDFS is immutable and is optimized to write and read data in a streaming manner. This also allows HDFS to be used as the data source for Hadoop MapReduce jobs for efficient processing of large data. Cloudera and major cloud vendors provide HDFS as a hosted service to use with cloud native applications.
HDFS stores data in multiple data nodes, and stores all its metadata in memory in a single-name node. When that node is not available, it can fail new reads and writes, causing unavailability. Also, based on the capacity of its name node’s memory, it has an upper limit on the number of files that it can store. We recommend using HDFS to store a small number of large files instead of a large number of small files. Because it is optimized to read data sequentially, it is not the best solution when we need random reads.

39
Q

What are the benefits of using Amazon S3 for cloud native applications, and what additional features does it offer?

A

Amazon S3 is beneficial for cloud native applications as it provides highly available object storage with fine-grained data access control, supports running analytics on data nodes using standard SQL expressions, and allows retrieval of subsets of object data to improve performance. It is recommended for use with AWS as the primary cloud platform.
Page 251

Amazon S3: Amazon Simple Storage Service (S3) is an object storage that is part of AWS. It can be used in a data lake, as storage for cloud native applications, as a data backup or archive, and for big data analytics. It also supports the Data Locality pattern by running analytics on data nodes using standard SQL expressions of Amazon Athena. We can use S3 Select to retrieve subsets of object data instead of the entire object. This can improve data-access performance by up to four times. Amazon S3 is highly available and provides fine-grained data access control. We recommend using it when you use AWS as your primary cloud native application platform.

40
Q

What are the primary features of Azure Cosmos DB, and what are its usage constraints?

A

Azure Cosmos DB is a fully managed NoSQL data store supporting key-value, document, column, and graph database semantics, providing low-latency data retrieval, enterprise-grade security, and open-source APIs for MongoDB and Cassandra. It can only be used on the Azure cloud platform and provides limited transactional support within logical data partitions.
Page 251

Azure Cosmos DB: Azure Cosmos DB is a fully managed NoSQL data store that supports key-value, document, column, and graph database semantics. It can store and retrieve data with low latency, and provides enterprise-grade security with end-to-end encryption and access control. It also provides open source APIs for MongoDB and Cassandra, enabling clients to leverage the cloud without changing their application.
Cosmos DB, a service provided by Azure, cannot be installed on a local server or in clouds other than Azure. Use Cosmos DB only if you are using Azure as the primary cloud infrastructure for your cloud native applications. Still, Cosmos DB provides some flexibility by providing migration and synchronization of data with your on-premises Cassandra cluster. Though Cosmos DB can provide transactional support, it is limited within the logical data partition.

41
Q

What distinguishes Google Cloud Spanner as a relational data store, and what limitations does it have?

A

Google Cloud Spanner supports unlimited scale, strong consistency, and the capability to run SQL queries with support for transactions across all cluster nodes. It provides security through data-layer encryption and access controls. It is only available on the Google Cloud platform and requires changes to applications due to partial ANSI SQL support.
Page 251

Google Cloud Spanner: Google Cloud Spanner is a fully managed relational data store that supports unlimited scale and strong consistency. It provides the capability to run SQL queries while providing support for transactions across all the nodes in the cluster. It also linearly scales write and read transactions and provides security through data-layer encryption and access controls.
Because Cloud Spanner is a service provided by Google, it cannot be installed on a local server or in clouds other than Google. Use Spanner only if you are using Google as the primary cloud infrastructure for your cloud native applications. Though it provides SQL support, it does not fully support the American National Standards Institute (ANSI) SQL spec and so requires changes to applications before migrating from standard relational databases to Spanner.

42
Q

What key practices are recommended for ensuring data security in cloud native applications?

A

Recommended practices include enforcing physical and software security for data at rest using the Vault Key pattern and API security, encrypting sensitive data before storage, separating sensitive data for additional protection, using secure transmission channels like HTTPS for data in transit, and encrypting only sensitive parts of messages to protect data without segmenting messages.
Page 256

Security:
Protecting data and allowing only the appropriate people and systems to access relevant data is key to the successful execution of a cloud native application, and to the success of an organization in general. The security of data should be enforced both when data is at rest and when data is on the move.
We can enforce data security at rest both physically and through software. Data servers should be guarded and accessed only by authorized persons. Data stores running in the servers should also enforce security via the Vault Key pattern and API security to control data access. When storing sensitive data, we recommend encrypting it before storing it in the data store. We also recommend encrypting the filesystem in which the data is stored as an added layer of protection.
We recommend separating sensitive data from other data so that sensitive data can be governed with additional layers of protection, along with audit trails to monitor suspicious behavior. Don’t collect and store unnecessary sensitive information. When needed, mask all sensitive information such as usernames and email addresses. This can be done by replacing sensitive data with unique identifiers and storing their mapping in a protected data store. This can enable us to continuously analyze and audit user behavior while providing the capability to delete all sensitive user data by simply deleting the data mapping. This will also help enforce privacy and data regulations such as Europe’s General Data Protection Regulation (GDPR).
When it comes to data in transit, we should always transmit the data via secure data transmission channels such as HTTPS. For added security, we can encrypt the messages with asymmetric keys so that the intermediary hosts will not have access to the content.
To protect sensitive information without segmenting messages, we can encrypt only the part of the message that has sensitive information. The whole message will be delivered to each client, but only the clients with the relevant key for the sensitive data can decrypt it, while others can’t access that data.

43
Q

What makes Redis suitable as a cache, and what are its limitations compared to relational data stores?

A

Redis is suitable as a cache due to its in-memory data storage, support for various data structures, transactions, LRU eviction, automatic failover, and persistence options. However, it does not support efficient querying, complex data manipulation, or aggregation operations, making it unsuitable as a NoSQL replacement for relational data stores.
Page 248

Redis: Redis is an in-memory key-value data store commonly used as a cache, as discussed in the “Caching Pattern”. It supports string keys and various kinds of values such as strings, lists, sets, sorted sets, hashes, bit arrays, and much more. This makes the application less complex, as it can now store its internal data structure directly in Redis. Redis is ideal for a cache, as it supports transactions, keys with a limited time to live, LRU eviction of keys, automatic failover, and its ability to write excess data to disk. Redis also has plenty of cloud hosting options for cloud native applications to use, including AWS, Google, Redis Labs, and IBM.
Redis supports two types of persistence options: Redis Database Backup (RDB) and Append Only File (AOF). By using both options, we can achieve good write performance and a good degree of data safety upon system failures. Redis features high availability by using a single “master” and multiple “replica”s as in the CQRS pattern, and provides scalability through sharding “master” and “replica”s as discussed in the “Data Sharding Pattern”.
However, Redis is not a NoSQL replacement for relational data stores, as it does not support many standard relational data store features, such as efficient querying, and performing complex data manipulation and aggregation operations.

44
Q

Categorizes NoSQL data stores in terms of consistency and availability.

A

Table 4-1 categorizes NoSQL data stores in terms of consistency and availability.

170