Chapter 4, Data Management Patterns Flashcards

Question

What are some of the related patterns to the Client-Side Mashup Pattern?

Answer 1

- Composite Data Services pattern This is useful when content needs to be mashed synchronously and the composite data is common enough to be used by multiple services. - Caching pattern Provides an opportunity to cache data to improve the overall latency. 191

Answer 2

Data is not owned by a single microservice, yet multiple microservices are depending on the data for their operation. 192

Answer 3

Data can clearly be associated with an existing microservice, as introducing unnecessary microservices can also cause management complexity. 192

Answer 4

Reduces the coupling between services. Provides more control/security on the operations that can be performed on the shared data. 192

Answer 5

Many clients query multiple services to consolidate their desired data, and this consolidation is generic enough to be reused among the clients. 192

Answer 6

Only one client needs the consolidation. Operations performed by clients cannot be generalized to be reused by many clients. 192

Answer 7

Reduces duplicate work done by the clients and consolidates it into a common service. Provides more data resiliency by using caches or static data. 192

Answer 8

Some meaningful operations can be performed with partial data; for example, rendering nondependent data in web browsers. 192

Answer 9

Processing, such as a join, is required on the independently retrieved data before sending the response. 192

Answer 10

Results in more-responsive applications. Reduces the wait time. 192

Answer 11

In the Data Sharding pattern, the data store is divided into shards, which allows it to be easily stored and retrieved at scale. The data is partitioned by one or more of its attributes so we can easily identify the shard in which it resides. 193

Answer 12

To shard the data, we can use horizontal, vertical, or functional approaches. Let’s look at these three options in detail: 193

Answer 13

Each shard has the same schema, but contains distinct data records based on its sharding key. A table in a database is split across multiple nodes based on these sharding keys. For example, user orders can be shared by hashing the order ID into three shards, as depicted in Figure 4-13. 193 Figure 4-13. Horizontal data sharding using hashing

Answer 14

Each shard does not need to have an identical schema and can contain various data fields. Each shard can contain a set of tables that do not need to be in another shard. This is useful when we need to partition the data based on the frequency of data access; we can put the most frequently accessed data in one shard and move the rest into a different shard. Figure 4-14 depicts how frequently accessed user data is sharded from the other data. 194 Figure 4-14. Vertical data sharding based on frequency of data access

Answer 15

Data is partitioned by functional use cases. Rather than keeping all the data together, the data can be segregated in different shards based on different functionalities. This also aligns with the process of segregating functions into separate functional services in the cloud native application architecture. Figure 4-15 shows how product details and reviews are sharded into two data stores. 196 Figure 4-15. Functional data sharding by segregating product details and reviews into two data stores

Answer 16

- Lookup-based data sharding - Range-based data sharding - Hash-based data sharding 197

Answer 17

A lookup service or distributed cache is used to store the mapping of the shard key and the actual location of the physical data. When retrieving the data, the client application will first check the lookup service to resolve the actual physical location for the intended shard key, and then access the data from that location. If the data gets rebalanced or resharded later, the client has to again look up the updated data location. 197

Answer 18

This special type of sharding approach can be applied when the sharding key has sequential characters. The data is shared in ranges, and as in lookup-based sharding, a lookup service can be used to determine where the given data range is available. This approach yields the best results for sharding keys based on date and time. A data range of a month, for example, may reside in the same shard, allowing the service to retrieve all the data in one go, rather than querying multiple shards. 197

Answer 19

Constructing a shard key based on the data fields or dividing the data by date range may not always result in balanced shards. At times we need to distribute the data randomly to generate better-balanced shards. This can be done by using hash-based data sharding, which creates hashes based on the shard key and uses them to determine the shard data location. This approach is not the best when data is queried in ranges, but is ideal when individual records are queried. Here, we can also use a lookup service to store the hash key and the shard location mapping, to facilitate data loading. 197

Answer 20

This pattern can be used when we can no longer store data in a single node, or when we need data to be distributed so we can access it with lower latency. 198

Answer 21

- Materialized View pattern This can be used to replicate the dependent data of each shard to the local stores of the service, to improve data-querying performance and eliminate multiple lookup calls to data stores or services. This data can be replicated with only eventual consistency, so this approach is useful only if consistency on the dependent data is not business-critical for the applications. - Data Locality pattern Having all the relevant data at the shard will allow the creation of indexes and execution of stored procedures for efficient data retrieval. 202

Answer 22

The Command and Query Responsibility Segregation (CQRS) pattern separates updates and query operations of a data set, and allows them to run on different data stores. This results in faster data update and retrieval. It also facilitates modeling data to handle multiple use cases, achieves high scalability and security, and allows update and query models to evolve independently with minimal interactions. 202

Answer 23

We can use this pattern when we want to use different domain models for commands and queries, and when we need to separate updates and data retrieval for performance and security reasons. 204

Answer 24

- Event Sourcing pattern Allows command services to communicate updates to query services, and allows both command and query models to reside on different data stores. This provides only eventual consistency between command and query models and adds complexity to the system architecture. Chapter 5 covers this pattern in detail. - Materialized View pattern Recommended over the CQRS pattern to achieve scalability, when command and query models are simple enough; Materialized View is covered in the next section. - Data Sharding pattern Helps scale commands by partitioning the data (as covered previously in this chapter). As query operations can simply be replicated, applying this pattern for queries may not produce any performance benefit. API security Can be applied to enforce security for both command and query services. 206

Answer 25

Data contains one or a collection of fields that uniquely identify the data or meaningfully group the data into subsets. 207

Answer 26

Shard key cannot produce evenly balanced shards. The operations performed in the data require the whole set of data to be processed; for example, obtaining a median from the data set. 207

Answer 27

Groups shards based on the preferred set of fields that produce the shard key. Creates geographically optimized shards that can be moved closer to the clients. Builds hierarchical shards or time-range-based shards to optimize the search time. Uses secondary indexes to query data by using nonshard keys. 207

Answer 28

Applications have performance-intensive update operations with: - Data validations - Security validations - Message transformations For performance-intensive query operations such as complex joins or data mapping. 207

Answer 29

High consistency is required between command (update) and query (read). Command and query models are closer to each other. 207

Answer 30

Reduces the impact between command and query operations. Stores command and query data in two different data stores that suit their use cases. Enforces separated command/query security policies. Enables different teams to own applications that are responsible for command and query operations. Provides high availability. 207

Answer 31

The Materialized View pattern provides the ability to retrieve data efficiently upon querying, by moving data closer to the execution and prepopulating materialized views. This pattern stores all relevant data of a service in its local data store and formats the data optimally to serve the queries, rather than letting that service call dependent services for data when required. 209

Answer 32

We can use this pattern when we want to improve data-retrieval efficiency by eliminating complex joins and to reduce coupling with dependent services. 211

Answer 33

- Data Locality pattern Enables efficient data retrieval by moving the execution closer to the data. - Composite Data Services pattern This can be used instead of the Materialized View pattern when data compositions can be done at the service level, or when dependent services have static data that can be cached locally at the service. - Command and Query Responsibility Segregation (CQRS) pattern The Materialized View pattern can be used to serve query responses in the CQRS pattern. The command—the modifications to the data—will be done through the dependent service, and the query—the serving of the read requests—can be performed by query services constructing the materialized views. - Event Sourcing pattern Provides an approach to replicate data from one source to another. Changes on dependent data are pushed as events through event streams, which are stored sequentially at a reliable log-based event queue such as Kafka, and then the services that serve the data read those event streams and constantly update their local storage to serve updated information. Chapter 5 covers this pattern. 213

Answer 34

This pattern is used when part of the data is available locally and the rest needs to be fetched from external sources that incur high latency. 211

Answer 35

The goal of the Data Locality pattern is to move execution closer to the data. This is done by colocating the services with the data or by performing the execution in the data store itself. This allows the execution to access data with fewer limitations, helping to quicken execution, and to reduce bandwidth by sending aggregated results. 214

Answer 36

This pattern encourages coupling execution with data to reduce latency and save bandwidth, enabling distributed cloud native applications to operate efficiently over the network. 216

Answer 37

- Materialized View pattern Provides an alternative approach for this pattern, by moving data closer to the place of execution. This pattern is ideal when the data is small or when CPU-intensive operations such as complex joins and data transformations are needed during reads. - Caching pattern Complements this pattern by storing preprocessed data and serving it during repeated queries. 218

Answer 38

The Caching pattern stores previously processed or retrieved data in memory, and serves this data for similar queries issued in the future. This not only reduces repeated data processing at the services, but also eliminates calls to dependent services when the response is already stored in the service. 218

Answer 39

This pattern is usually applied when the same query can be repeatedly called multiple times by one or more clients, especially when we don’t have enough knowledge about what data will be queried next. 220

Answer 40

- Data Sharding pattern Enables the cache to be scaled similarly to the way we can scale data stores. This also enables distributing data geographically so relevant data in the cache can be closer to the services that operate them. - Resilient Connectivity pattern Provides a mechanism to serve requests from the data sources when data is not available in the cache. Chapter 3 discusses this pattern. - Data Service pattern Along with API security, can be used to provide a service layer for distributed caches, providing more business-centric APIs for data consumers. - Vault Key pattern Provides the capability to secure the caches by using access tokens enabling third parties to access the data directly from caches. This can be used only if the caching systems support this functionality. Otherwise, we need to fall back on using the Data Service pattern with API security. - Event Sourcing pattern Propagates cache-invalidation requests to all local caches. This enables eventual consistency of cache data and reduces the chance of data being obsolete as data sources are updated by multiple services. Chapter 5 details this pattern. 229

Answer 41

The Static Content Hosting pattern deploys static content in data stores that are closer to clients so content can be delivered directly to the client with low latency and without consuming excess computational resources. 230

Answer 42

This pattern is used when we need to quickly deliver static content to clients with low response time, and when we need to reduce the load on rendering services. 232

Answer 43

- Data Sharding pattern 1Can be used to shard data when you have a lot of static data. - Caching pattern Caches content for faster data access. The cache expiration based on time-out is not necessary, as static data will not become outdated. - Vault Key pattern Provides security to systems hosting static content. - Data Service pattern Along with API security, provides a service layer on top of the content to control data access. 233

Answer 44

Part of the data is available locally, and the rest of the data needs to be fetched from external sources that incur high latency. The data that needs to be moved is small and rarely updated. Provides access to nonsensitive data that is hosted in secure systems. 234

Answer 45

Data can be retrieved from dependent services with low latency. Data in the dependent services is changing quickly. Consistency of the data is considered important for the response. 234

Answer 46

Can store the data in any database that is suitable for the application. Increases resiliency of the service by replicating the data to local stores. 234

Answer 47

To read data from multiple data sources and perform a join or data aggregation in memory. The data stores are huge, and the clients are geographically distributed. 234

Answer 48

Queries output most of their input. Additional execution cost incurred at the data nodes is higher than the cost of data transfer over the network. 234

Answer 49

Reduces network bandwidth utilization and data-retrieval latency. Better utilizes CPU resources and optimizes overall performance. Caches results and serves requests more efficiently. 234

Answer 50

Best for static data or data that is read more frequently than it is updated. Application has the same query that can be repeatedly called multiple times by one or more clients, especially when we do not have enough knowledge about what data will be queried next. The data store is subject to a high level of contention or cannot handle the number of concurrent requests it is receiving from multiple clients. 234

Answer 51

The data is updated frequently. As the means of storing state, as it should not be considered as the single source of truth. The data is critical, and the system cannot tolerate data inconsistencies. 234

Answer 52

Can choose which part of the data to cache to improve performance. Using a cache aside improves performance by reducing redundant computations. Can preload static data into the cache. Combined with eviction policy, the cache can hold the recent/required data. 234

Answer 53

All or some of the data requested by the client is static. The static data needs to be available in multiple environments or geographic locations. 234

Answer 54

The static content needs to be updated before delivering to the clients, such as adding the access time and location. The amount of data that needs to be served is small. Clients cannot retrieve and combine static and dynamic content together. 234

Answer 55

Geographically partitioning and storing closer to clients provides shorter response times and faster access/download speed. Reduces resource utilization on rendering services. 234

Answer 56

The Transaction pattern uses transactions to perform a set of operations as a single unit of work, so all operations are completed or undone as a unit. This helps maintain the integrity of the data, and error-proofs execution of services. This is critical for the successful execution of financial applications. 236

Answer 57

This pattern wraps multiple individual operations into a single large operation, providing a guarantee that either all operations or no operation will succeed. All transactions follow these steps: 1 - System initiates a transaction. 2 - Various data manipulation operations are executed. 3 - Commit is used to indicate the end of the transaction. 4 - If there are no errors, the commit will succeed, the transaction will finish successfully, and the changes will be reflected in the data stores. If there are errors, all the operations in the transaction will be rolled back, and the transaction will fail. No changes will be reflected in the data stores. 236

Answer 58

- Atomic All operations must occur at once, or none should occur. - Consistent Before and after the transaction, the system will be in a valid state. - Isolation The results produced by concurrent transactions will be identical to such transactions being executed in sequential order. - Durable When the transaction is finished, the committed changes will remain committed even during system failures. 237

Answer 59

Transactions can be used to combine multiple operations as a single unit of work, and to coordinate the operations of multiple systems. 238

Answer 60

The Transaction pattern has one related pattern, the Saga pattern. This pattern, covered in Chapter 3, reliably coordinates execution of multiple systems. 241

Answer 61

An operation contains multiple steps, and all the steps should be processed automatically to consider the operation valid. 241

Answer 62

The application has only a single step in the operation. The application has multiple steps, and failure of some steps is considered acceptable. 241

Answer 63

Adheres to ACID properties. Processes multiple independent transactions. 241

Answer 64

The Vault Key pattern provides direct access to data stores via a trusted token, commonly named the vault key. Some of the popular cloud data stores support this functionality. 242

Answer 65

The Vault Key pattern is based on a trusted token being presented by the client and being validated by the data store. In this pattern, the application determines who can access which part of the data. 242 Figure 4-25. Actions performed by clients to retrieve data in the Vault Key pattern

Answer 66

This pattern can be used when the data store cannot reach the identity provider to authenticate and authorize the client upon data access. In this pattern, the data store will contain the certificate of the identity provider, so it will be able to decrypt the token and validate its authenticity without calling the identity provider. Because it does not need to make remote service calls for validation, it can also perform authentication operations with minimal latency. 243

Answer 67

The Vault Key pattern has one related pattern, Data Service (covered at the start of this chapter). Along with API security, the Data Service pattern provides an alternative approach for providing security when the Vault Key pattern is not feasible. 244

Answer 68

To securely access remote data with minimal latency. The store has a limited computing capability to perform service calls for authentication and authorization. 244

Answer 69

Need fine-grained data protection. Need to restrict what queries should be executed on the data store with high precision. The exposed data store cannot validate access based on keys. 244

Answer 70

Accesses data stores directly by using a trusted token, a vault key Has minimal operational costs compared to calling the central identity service for validation 244

Answer 71

- Need transactions and ACID properties. - Interrelationship with data is required to be maintained. - Working with small to medium amounts of data. 252

Answer 72

- Data needs to be highly scalable, such as IoT data. - Working with XML, JSON, and binary data format. - Solution cannot tolerate some level of unavailability. 252

Answer 73

- Need high availability. - Need scalability. - Need a decentralized solution. - Need faster writes than reads. - Read access can be mostly performed by partition key. 252

Answer 74

- Existing data is updated frequently. - Need to access data by columns that are not part of the partition key. - Require relational features, such as transactions, complex joins, and ACID properties. 252

Answer 75

- Need consistency. - Need scalability. - Need a decentralized solution. - Need high read performance. - Need both random and real-time access to data. - Need to store petabytes of data. 252

Answer 76

- Solution cannot tolerate some level of unavailability. - Existing data is updated very frequently. - Require relational features, such as transactions, complex joins, and ACID properties. 252

Answer 77

- Need consistency. - Need a decentralized solution. - Need a document store. - Need data lookup based on multiple keys. - Need high write performance. 252

Answer 78

- Solution cannot tolerate some level of unavailability. - Require relational features, such as transactions, complex joins, and ACID properties. 252

Answer 79

- Need scalability. - Need an in-memory database. - Need a persistent option to restore the data. - As a cache, queue, and real-time storage. 252

Answer 80

- As a typical database to store and query with complex operations. 252

Answer 81

- Need a highly scalable solution. - Need a document store. - Need a key-value store. - Need high write performance. - Fine-grained access control. 252

Answer 82

- Use in platforms other than AWS. - Require relational features, such as complex joins, and foreign keys. 252

Answer 83

- Need a filesystem. - Store large files. - Store data once and reads multiple times. - Perform MapReduce operation on files. - Need scalability. - Need data resiliency. 252

Answer 84

- Store small files. - Need to update files. - Need to perform random data reads. 252

Answer 85

- Need an object store. - Perform MapReduce operations on objects. - Need a highly scalable solution. - Read part of the object data. - Fine-grained access control. 252

Answer 86

- Use in platforms other than AWS. - Need to run complex queries. 252

Answer 87

- Need a highly scalable solution. - Need a document store. - Need a key-value store. - Need a graph store. - Need a column store. - Fine-grained access control. - Connectivity via MongoDB and Cassandra clients 252

Answer 88

- Use in platforms other than Azure. - Perform transaction across data partitions. 252

Answer 89

- Need a highly scalable solution. - Need a relational store. - Need support for SQL query processing - Need transaction support across all nodes in the cluster. 252

Answer 90

- Use in platforms other than Google Cloud. - Support for full ANSI SQL spec. 252

Answer 91

- Tests should be performed with both clean and prepopulated data stores, as the former will test for data initialization code and the latter will test for data consistency during operation. - Test all data store types and versions that will be used in production to eliminate any surprises. We can implement test data stores as Docker instances that will help run tests in multiple environments with quick startup and proper cleanup after the test. - Test data mapping and make sure all fields are properly mapped when calling the data store. - Validate whether the service is performing inserts, writes, deletion, and updates on the data stores in an expected manner by checking the state of the data store via test clients that can access the database directly. - Validate that relational constraints, triggers, and stored procedures are producing correct results. In addition, it is important to do a load test on the data service along with the data store in a production-like environment with multiple clients. This will help identify any database lock, data consistency, or other performance-related bottlenecks present in the cloud native application. It will also show how much load the application can handle and how that will be affected when various data scaling patterns and techniques are deployed. 254

Answer 92

Observability and monitoring help us identify the performance of data stores and take corrective actions when they deviate because of load or changes to the application. In most applications, incoming requests interact with the data stores. Any performance or availability issues in the data store will resonate across all layers of the system, affecting the overall user experience. 257

Answer 93

- Application metrics -- Data store uptime/health: To identify whether each node in the data store is up and running. -- Query execution time: Five types of issues can cause high query execution times: --- Inefficient query: Use of nonoptimized queries including multiple complex joins, and tables not being indexed properly. --- Data growth in the data store: Data stores containing more data than it can handle. --- Concurrency: Concurrent operations on the same table/row, locking data stores and impacting their performance. --- Lack of system resources such as CPU/memory/disk space: Data store nodes not having enough resources to efficiently serve the request. --- Unavailability of dependent system or replica: In distributed data stores, when its replica or other dependent systems such as a lookup service is not available, it may take more time as it needs to provision a new instance or discover and route the request to another instance. -- Query execution response: Whether the query execution is successful. If the query is failing, we may need to look at the logs for more detail (depending on the failure). -- Audit of the query operations: Malicious queries or user operations can result in unexpected reduction in data store performance. We can use audit logs to identify and mitigate them. - System metrics: To identify a lack of system resources for efficient processing via CPU consumption, memory consumption, availability of disk space, network utilization, and disk I/O speed. - Data store logs - Time taken and throughput when communicating with primary and replicas: Helps to understand networking issues and bad data store When analyzing metrics, we can use percentiles to compare historical and current behaviors. This can identify anomalies and deviations, so we can quickly identify the root cause of the problem. 257

Answer 94

1 - Select data store types. Select the data store type (relational, NoSQL, or filesystem) and its vendor to match our use case. 2 - Configure the deployment pattern. This can be influenced by the patterns applied in the cloud native application and the type of data store we have selected. Based on this selection, high availability and scalability should be determined by answering the following questions: 1 - Who are the clients? 2 - How many nodes? 3 - Are we going to use a data store managed by the cloud vendor or deploy our own? 4 - How does the replication work? 5 - How do we back up the data? 6 - How does it handle disaster recovery? 7 - How do we secure the data store? 8 - How do we monitor the data store? 9 - How much does the data store/management cost? 3 - Enforce security. Data stores should be protected because they contain business-critical information. This can be enforced by applying relevant physical and software security as discussed in the preceding section. This may include enabling strict access control, data encryption, and use of audit logs. 4 - Set up observability and monitoring. Like microservices, data stores should be configured with observability and monitoring tools to guarantee continuous operation. This can provide early insights on possible scaling problems, such as a requirement to rebalance data shards, or to apply a different design pattern altogether to improve scalability and performance of the application. 5 - Automate continuous delivery. When it comes to data stores, automation and continuous delivery are not straightforward. Although we can easily come up with an initial data store schema, maintaining backward compatibility is difficult as the application evolves. Backward compatibility is critical; without it, we will not be able to achieve smooth application updates and rollbacks during failures. To improve productivity, we should always use proper automation tools such as scripts to automate continuous delivery. We also recommend having guardrails and using multiple deployment environments, such as development, and staging/preproduction, to reduce the impact of the changes and to validate the application before moving it to production. By following these steps, we can safely deploy and maintain cloud native applications while allowing rapid innovation and adoption to other systems. 260

Chapter 4, Data Management Patterns Flashcards

(118 cards)