Chapter 4, Data Management Patterns Flashcards
What are Data Sources?
Here, data sources are cloud native applications that feed data such as user inputs and sensor readings. They sometimes feed data into data-ingestion systems such as message brokers or, when possible, directly write to data stores. Data-ingestion systems can transfer data as events/messages to other applications or data stores;
160 Figure 4-1. Data architecture for cloud native applications
What do Batch-processing systems so?
Batch-processing systems process data from data sources in batches, and write the processed output back to the data stores so it can be used for reporting or exposed via APIs.
161 Figure 4-1. Data architecture for cloud native applications
What are the three main types of data that influence Application behavior?
- Input data
Sent as part of the input message by the user or client. Most commonly, this data is either JSON or XML messages, though binary formats such as gRPC and Thrift are getting some traction. - Configuration data
Provided by the environment as variables. XML has been used as the configuration language for a long time, and now YAML configs have become the de facto standard for cloud native applications. - State data
The data stored by the application itself, regarding its status, based on all messages and events that occurred before the current time. By persisting the state data and loading it on startup, the application will be able to seamlessly resume its functionality upon restart.
162
What are the three categories of data that Cloud native applications use?
- Structured data
Can fit a predefined schema. For example, the data on a typical user registration form can be comfortably stored in a relational database. - Semi-structured data
Has some form of structure. For example, each field in a data entry may have a corresponding key or name that we can use to refer to it, but when we take all the entries, there is no guarantee that each entry will have the same number of fields or even common keys. This data can be easily represented through JSON, XML, and YAML formats. - Unstructured data
Does not contain any meaningful fields. Images, videos, and raw text content are examples. Usually, this data is stored without any understanding of its content.
164
What are ACID properties?
- Atomicity
- Consistency
- Isolation
- Durability
165
Define Atomicity from ACID
atomicity guarantees that all operations within a transaction are executed as a single unit
165
Define Consistency from ACID
consistency ensures that the data is consistent before and after the transaction
165
Define Isolation from ACID
Isolation makes the intermediate state of a transaction invisible to other transactions
165
Define Durability from ACID
Durability guarantees that after a successful transaction, the data is persistent even in the event of a system failure
165
What does the CAP in CAP theorem stands for?
CAP stands for consistency, availability, and partition tolerance. This theorem states that a distributed application can provide either full availability or consistency; we cannot achieve both while providing network partition tolerance. Here, availability means that the system is fully functional when some of its nodes are down, consistency means an update/change in one node is immediately propagated to other nodes, and partition tolerance means that the system can continue to work even when some nodes cannot connect to each other.
169
What are three types of data store?
- Relational
- NoSQL
- Filesystem
172
What are the three techniques in which data can be managed?
- Centralized
- Decentralized
- Hybrid
172
Describe the Data Service Pattern
The Data Service pattern exposes data in the database as a service, referred to as a data service. The data service becomes the owner, responsible for adding and removing data from the data store. The service may perform simple lookups or even encapsulate complex operations when constructing responses for data requests.
180
How id the Data Service Pattern used?
This pattern can be used when we need to allow access to data that does not belong to a single microservice, or when we need to abstract legacy/proprietary data stores to other cloud native applications.
181
What are some related patterns to the Data Service pattern?
- Caching pattern
Provides an opportunity to optimize the efficiency of data retrieval by using local or distributed caching when exposing data via a service. - Performance optimization patterns
Apart from caching data, these execute complex queries such as table joins and running stored procedures directly in the database to improve performance. - Materialized View pattern
Accessing data via an API can still be performance-intensive. For use cases that need joins to be performed with data that resides in stores belonging to other services, having that data replicated in its local store and building a materialized view can help improve query performance. - Vault Key pattern
Along with API security, knowing who is accessing the data can help identify the caller and enforce adequate security and data protection.
183
Describe the Composite Data Services Pattern
The Composite Data Services pattern performs data composition by combining data from more than one data service and, when needed, performs fairly complex aggregation to provide a richer and more concise response. This pattern is also called the Server-Side Mashup pattern, as data composition happens at the service and not at the data consumer.
185
How does the Composite Data Services Pattern work?
The Composite Data Services Pattern combines data from various services and its own data store into one composite data service. This pattern not only eliminates the need for multiple microservices to perform data composition operations, but also allows the combined data to be cached for improving performance (Figure 4-11).
185 Figure 4-11. Composite Data Services pattern
How is the Composite Data Services Pattern used in practice?
This pattern can be used when we need to eliminate multiple microservices repeating the same data composition. Data services that are fine-grained force clients to query multiple services to build their desired data. We can use this pattern to reduce duplicate work done by the clients and consolidate it into a common service.
187
What are some considerations when using the Composite Data Services Pattern?
Use this pattern only when the consolidation is generic enough and other microservices will be able to reuse the consolidated data. We do not recommend introducing unnecessary layers of services if they do not provide meaningful data compositions that can be reused. Weigh the benefits of reusability and simplicity of the clients against the additional latency and management complexity added by the service layers.
187
What are some patterns related to The Composite Data Services pattern?
- Caching pattern
`Provides an opportunity to optimize the efficiency of data retrieval and helps achieve resiliency by serving data from the cache when backends are not available. - Client-Side Mashup pattern
`Allows the data mashup to happen at the client side, such as in the user’s browser. This can be a good solution when asynchronous data loading is feasible and when meaningful data composition can be performed with partial data.
187
Describe the Client-Side Mashup Pattern
In the Client-Side Mashup pattern, data is retrieved from various services and consolidated at the client side. The client is usually a browser loading data via asynchronous Ajax calls.
188
How does the Client-Side Mashup Pattern work?
This pattern utilizes asynchronous data loading, as shown in Figure 4-12. For example, when a browser using this pattern is loading a web page, it loads and renders part of the web page first, while loading the rest of the web page. This pattern uses client-side scripts such as JavaScript to asynchronously load the content in the web browser.
Rather than letting the user wait for a longer time by loading all content on the website at once, this pattern uses multiple asynchronous calls to fetch different parts of the website and renders each fragment when it arrives. These applications are also referred to as rich internet applications (RIAs).
188 Figure 4-12. Client-Side Mashup at a web browser
How is the Client-Side Mashup Pattern used in practice?
This pattern can be used when we need to present available data as soon as possible, while providing more detail later, or when we want to give a perception that the web page is loading much faster.
190
What are some considerations for the Client-Side Mashup Pattern?
Use this pattern only when the partial data loaded first can be presented to the user or used in a meaningful way. We do not advise using this pattern when the retrieved data needs to be combined and transformed with later data via some sort of a join before it can be presented to the user.
191
What are some of the related patterns to the Client-Side Mashup Pattern?
- Composite Data Services pattern
This is useful when content needs to be mashed synchronously and the composite data is common enough to be used by multiple services. - Caching pattern
Provides an opportunity to cache data to improve the overall latency.
191
When to use the Data Service pattern?
Data is not owned by a single microservice, yet multiple microservices are depending on the data for their operation.
192
When not to use the Data Service pattern?
Data can clearly be associated with an existing microservice, as introducing unnecessary microservices can also cause management complexity.
192
What are the benefits of using the Data Service pattern?
Reduces the coupling between services.
Provides more control/security on the operations that can be performed on the shared data.
192
When to use the Composite Data Services pattern?
Many clients query multiple services to consolidate their desired data, and this consolidation is generic enough to be reused among the clients.
192
When not to use the Composite Data Services pattern?
Only one client needs the consolidation.
Operations performed by clients cannot be generalized to be reused by many clients.
192
What are the benefits of using the Composite Data Services pattern?
Reduces duplicate work done by the clients and consolidates it into a common service.
Provides more data resiliency by using caches or static data.
192
When to use the Client-Side Mashup pattern?
Some meaningful operations can be performed with partial data; for example, rendering nondependent data in web browsers.
192
When not to use the Client-Side Mashup pattern?
Processing, such as a join, is required on the independently retrieved data before sending the response.
192
What are the benefits of using the Client-Side Mashup pattern?
Results in more-responsive applications.
Reduces the wait time.
192
Describe the Data Sharding Pattern
In the Data Sharding pattern, the data store is divided into shards, which allows it to be easily stored and retrieved at scale. The data is partitioned by one or more of its attributes so we can easily identify the shard in which it resides.
193
In what ways can you shard data?
To shard the data, we can use horizontal, vertical, or functional approaches. Let’s look at these three options in detail:
193
Describe Horizontal data sharding
Each shard has the same schema, but contains distinct data records based on its sharding key. A table in a database is split across multiple nodes based on these sharding keys. For example, user orders can be shared by hashing the order ID into three shards, as depicted in Figure 4-13.
193 Figure 4-13. Horizontal data sharding using hashing
Describe Vertical data sharding
Each shard does not need to have an identical schema and can contain various data fields. Each shard can contain a set of tables that do not need to be in another shard. This is useful when we need to partition the data based on the frequency of data access; we can put the most frequently accessed data in one shard and move the rest into a different shard. Figure 4-14 depicts how frequently accessed user data is sharded from the other data.
194 Figure 4-14. Vertical data sharding based on frequency of data access
Describe Functional data sharding
Data is partitioned by functional use cases. Rather than keeping all the data together, the data can be segregated in different shards based on different functionalities. This also aligns with the process of segregating functions into separate functional services in the cloud native application architecture. Figure 4-15 shows how product details and reviews are sharded into two data stores.
196 Figure 4-15. Functional data sharding by segregating product details and reviews into two data stores
When using horizontal data sharding, what are the techniques we can deploy to locate where we have stored data?
- Lookup-based data sharding
- Range-based data sharding
- Hash-based data sharding
197
Describe Lookup-based data sharding
A lookup service or distributed cache is used to store the mapping of the shard key and the actual location of the physical data. When retrieving the data, the client application will first check the lookup service to resolve the actual physical location for the intended shard key, and then access the data from that location. If the data gets rebalanced or resharded later, the client has to again look up the updated data location.
197
Describe Range-based data sharding
This special type of sharding approach can be applied when the sharding key has sequential characters. The data is shared in ranges, and as in lookup-based sharding, a lookup service can be used to determine where the given data range is available. This approach yields the best results for sharding keys based on date and time. A data range of a month, for example, may reside in the same shard, allowing the service to retrieve all the data in one go, rather than querying multiple shards.
197
Describe Hash-based data sharding
Constructing a shard key based on the data fields or dividing the data by date range may not always result in balanced shards. At times we need to distribute the data randomly to generate better-balanced shards. This can be done by using hash-based data sharding, which creates hashes based on the shard key and uses them to determine the shard data location. This approach is not the best when data is queried in ranges, but is ideal when individual records are queried. Here, we can also use a lookup service to store the hash key and the shard location mapping, to facilitate data loading.
197
How is the Data Sharding Pattern used in practice?
This pattern can be used when we can no longer store data in a single node, or when we need data to be distributed so we can access it with lower latency.
198
What are some patterns that are related to the Data Sharding Pattern?
- Materialized View pattern
This can be used to replicate the dependent data of each shard to the local stores of the service, to improve data-querying performance and eliminate multiple lookup calls to data stores or services. This data can be replicated with only eventual consistency, so this approach is useful only if consistency on the dependent data is not business-critical for the applications. - Data Locality pattern
Having all the relevant data at the shard will allow the creation of indexes and execution of stored procedures for efficient data retrieval.
202
Describe the Command and Query Responsibility Segregation Pattern
The Command and Query Responsibility Segregation (CQRS) pattern separates updates and query operations of a data set, and allows them to run on different data stores. This results in faster data update and retrieval. It also facilitates modeling data to handle multiple use cases, achieves high scalability and security, and allows update and query models to evolve independently with minimal interactions.
202
How is the Command and Query Responsibility Segregation Pattern used in practice?
We can use this pattern when we want to use different domain models for commands and queries, and when we need to separate updates and data retrieval for performance and security reasons.
204
What are some related patterns to the Command and Query Responsibility Segregation Pattern?
- Event Sourcing pattern
Allows command services to communicate updates to query services, and allows both command and query models to reside on different data stores. This provides only eventual consistency between command and query models and adds complexity to the system architecture. Chapter 5 covers this pattern in detail. - Materialized View pattern
Recommended over the CQRS pattern to achieve scalability, when command and query models are simple enough; Materialized View is covered in the next section. - Data Sharding pattern
Helps scale commands by partitioning the data (as covered previously in this chapter). As query operations can simply be replicated, applying this pattern for queries may not produce any performance benefit.
API security
Can be applied to enforce security for both command and query services.
206
When to use the Data Sharding pattern?
Data contains one or a collection of fields that uniquely identify the data or meaningfully group the data into subsets.
207
When not to use the Data Sharding pattern?
Shard key cannot produce evenly balanced shards.
The operations performed in the data require the whole set of data to be processed; for example, obtaining a median from the data set.
207
Benefits of using the Data Sharding pattern
Groups shards based on the preferred set of fields that produce the shard key.
Creates geographically optimized shards that can be moved closer to the clients.
Builds hierarchical shards or time-range-based shards to optimize the search time.
Uses secondary indexes to query data by using nonshard keys.
207
When to use the Command and Query Responsibility Segregation (CQRS) pattern?
Applications have performance-intensive update operations with:
- Data validations
- Security validations
- Message transformations
For performance-intensive query operations such as complex joins or data mapping.
207
When not to use the Command and Query Responsibility Segregation (CQRS) pattern?
High consistency is required between command (update) and query (read).
Command and query models are closer to each other.
207
Benefits of using the Command and Query Responsibility Segregation (CQRS) pattern
Reduces the impact between command and query operations.
Stores command and query data in two different data stores that suit their use cases.
Enforces separated command/query security policies.
Enables different teams to own applications that are responsible for command and query operations.
Provides high availability.
207
Describe the Materialized View Pattern
The Materialized View pattern provides the ability to retrieve data efficiently upon querying, by moving data closer to the execution and prepopulating materialized views. This pattern stores all relevant data of a service in its local data store and formats the data optimally to serve the queries, rather than letting that service call dependent services for data when required.
209
How is the Materialized View Pattern used in practice?
We can use this pattern when we want to improve data-retrieval efficiency by eliminating complex joins and to reduce coupling with dependent services.
211
What are some patterns that are related to the Materialized View Pattern?
- Data Locality pattern
Enables efficient data retrieval by moving the execution closer to the data. - Composite Data Services pattern
This can be used instead of the Materialized View pattern when data compositions can be done at the service level, or when dependent services have static data that can be cached locally at the service. - Command and Query Responsibility Segregation (CQRS) pattern
The Materialized View pattern can be used to serve query responses in the CQRS pattern. The command—the modifications to the data—will be done through the dependent service, and the query—the serving of the read requests—can be performed by query services constructing the materialized views. - Event Sourcing pattern
Provides an approach to replicate data from one source to another. Changes on dependent data are pushed as events through event streams, which are stored sequentially at a reliable log-based event queue such as Kafka, and then the services that serve the data read those event streams and constantly update their local storage to serve updated information. Chapter 5 covers this pattern.
213
When is the Materialized View Pattern used?
This pattern is used when part of the data is available locally and the rest needs to be fetched from external sources that incur high latency.
211
Describe the Data Locality Pattern
The goal of the Data Locality pattern is to move execution closer to the data. This is done by colocating the services with the data or by performing the execution in the data store itself. This allows the execution to access data with fewer limitations, helping to quicken execution, and to reduce bandwidth by sending aggregated results.
214
How is the Data Locality Pattern used in practice?
This pattern encourages coupling execution with data to reduce latency and save bandwidth, enabling distributed cloud native applications to operate efficiently over the network.
216
What are some related patterns to the Data Locality Pattern?
- Materialized View pattern
Provides an alternative approach for this pattern, by moving data closer to the place of execution. This pattern is ideal when the data is small or when CPU-intensive operations such as complex joins and data transformations are needed during reads. - Caching pattern
Complements this pattern by storing preprocessed data and serving it during repeated queries.
218
Define the Caching Pattern
The Caching pattern stores previously processed or retrieved data in memory, and serves this data for similar queries issued in the future. This not only reduces repeated data processing at the services, but also eliminates calls to dependent services when the response is already stored in the service.
218
How is the Caching Pattern used in practice?
This pattern is usually applied when the same query can be repeatedly called multiple times by one or more clients, especially when we don’t have enough knowledge about what data will be queried next.
220
What are some related patterns to the Caching Pattern?
- Data Sharding pattern
Enables the cache to be scaled similarly to the way we can scale data stores. This also enables distributing data geographically so relevant data in the cache can be closer to the services that operate them. - Resilient Connectivity pattern
Provides a mechanism to serve requests from the data sources when data is not available in the cache. Chapter 3 discusses this pattern. - Data Service pattern
Along with API security, can be used to provide a service layer for distributed caches, providing more business-centric APIs for data consumers. - Vault Key pattern
Provides the capability to secure the caches by using access tokens enabling third parties to access the data directly from caches. This can be used only if the caching systems support this functionality. Otherwise, we need to fall back on using the Data Service pattern with API security. - Event Sourcing pattern
Propagates cache-invalidation requests to all local caches. This enables eventual consistency of cache data and reduces the chance of data being obsolete as data sources are updated by multiple services. Chapter 5 details this pattern.
229
Describe the Static Content Hosting Pattern
The Static Content Hosting pattern deploys static content in data stores that are closer to clients so content can be delivered directly to the client with low latency and without consuming excess computational resources.
230
How is the Static Content Hosting Pattern used in practice?
This pattern is used when we need to quickly deliver static content to clients with low response time, and when we need to reduce the load on rendering services.
232
What are some related patterns to the Static Content Hosting Pattern?
- Data Sharding pattern
1Can be used to shard data when you have a lot of static data. - Caching pattern
Caches content for faster data access. The cache expiration based on time-out is not necessary, as static data will not become outdated. - Vault Key pattern
Provides security to systems hosting static content. - Data Service pattern
Along with API security, provides a service layer on top of the content to control data access.
233
When to use the Materialized View pattern?
Part of the data is available locally, and the rest of the data needs to be fetched from external sources that incur high latency.
The data that needs to be moved is small and rarely updated.
Provides access to nonsensitive data that is hosted in secure systems.
234
When not to use the Materialized View pattern?
Data can be retrieved from dependent services with low latency.
Data in the dependent services is changing quickly.
Consistency of the data is considered important for the response.
234
Benefits of using the Materialized View pattern
Can store the data in any database that is suitable for the application.
Increases resiliency of the service by replicating the data to local stores.
234
When to use the Data Locality pattern?
To read data from multiple data sources and perform a join or data aggregation in memory.
The data stores are huge, and the clients are geographically distributed.
234
When not to use the Data Locality pattern?
Queries output most of their input.
Additional execution cost incurred at the data nodes is higher than the cost of data transfer over the network.
234
Benefits of using the Data Locality pattern
Reduces network bandwidth utilization and data-retrieval latency.
Better utilizes CPU resources and optimizes overall performance.
Caches results and serves requests more efficiently.
234
When to use the Caching pattern?
Best for static data or data that is read more frequently than it is updated.
Application has the same query that can be repeatedly called multiple times by one or more clients, especially when we do not have enough knowledge about what data will be queried next.
The data store is subject to a high level of contention or cannot handle the number of concurrent requests it is receiving from multiple clients.
234
When not to use the Caching pattern?
The data is updated frequently.
As the means of storing state, as it should not be considered as the single source of truth.
The data is critical, and the system cannot tolerate data inconsistencies.
234
Benefits of using the Caching pattern
Can choose which part of the data to cache to improve performance.
Using a cache aside improves performance by reducing redundant computations.
Can preload static data into the cache.
Combined with eviction policy, the cache can hold the recent/required data.
234
When to use the pattern?
All or some of the data requested by the client is static.
The static data needs to be available in multiple environments or geographic locations.
234
When not to use the pattern?
The static content needs to be updated before delivering to the clients, such as adding the access time and location.
The amount of data that needs to be served is small.
Clients cannot retrieve and combine static and dynamic content together.
234
Benefits of using the pattern
Geographically partitioning and storing closer to clients provides shorter response times and faster access/download speed.
Reduces resource utilization on rendering services.
234
Describe the Transaction Pattern
The Transaction pattern uses transactions to perform a set of operations as a single unit of work, so all operations are completed or undone as a unit. This helps maintain the integrity of the data, and error-proofs execution of services. This is critical for the successful execution of financial applications.
236
How does the Transaction pattern work?
This pattern wraps multiple individual operations into a single large operation, providing a guarantee that either all operations or no operation will succeed. All transactions follow these steps:
1 - System initiates a transaction.
2 - Various data manipulation operations are executed.
3 - Commit is used to indicate the end of the transaction.
4 - If there are no errors, the commit will succeed, the transaction will finish successfully, and the changes will be reflected in the data stores.
If there are errors, all the operations in the transaction will be rolled back, and the transaction will fail. No changes will be reflected in the data stores.
236
What are the ACID properties?
- Atomic
All operations must occur at once, or none should occur. - Consistent
Before and after the transaction, the system will be in a valid state. - Isolation
The results produced by concurrent transactions will be identical to such transactions being executed in sequential order. - Durable
When the transaction is finished, the committed changes will remain committed even during system failures.
237
How is the Transaction pattern used in practice?
Transactions can be used to combine multiple operations as a single unit of work, and to coordinate the operations of multiple systems.
238
What are some related patterns to the Transaction pattern?
The Transaction pattern has one related pattern, the Saga pattern. This pattern, covered in Chapter 3, reliably coordinates execution of multiple systems.
241
When to use the Transaction pattern?
An operation contains multiple steps, and all the steps should be processed automatically to consider the operation valid.
241
When not to use the Transaction pattern?
The application has only a single step in the operation.
The application has multiple steps, and failure of some steps is considered acceptable.
241
What are the benefits of using the Transaction pattern?
Adheres to ACID properties.
Processes multiple independent transactions.
241
Describe the Vault Key Pattern
The Vault Key pattern provides direct access to data stores via a trusted token, commonly named the vault key. Some of the popular cloud data stores support this functionality.
242
Dow does the Vault Key Pattern work?
The Vault Key pattern is based on a trusted token being presented by the client and being validated by the data store. In this pattern, the application determines who can access which part of the data.
242 Figure 4-25. Actions performed by clients to retrieve data in the Vault Key pattern
How is the Vault Key Pattern used in practice?
This pattern can be used when the data store cannot reach the identity provider to authenticate and authorize the client upon data access. In this pattern, the data store will contain the certificate of the identity provider, so it will be able to decrypt the token and validate its authenticity without calling the identity provider. Because it does not need to make remote service calls for validation, it can also perform authentication operations with minimal latency.
243
What are some related patterns to the Vault Key Pattern?
The Vault Key pattern has one related pattern, Data Service (covered at the start of this chapter). Along with API security, the Data Service pattern provides an alternative approach for providing security when the Vault Key pattern is not feasible.
244
When to use the Vault Key Pattern?
To securely access remote data with minimal latency.
The store has a limited computing capability to perform service calls for authentication and authorization.
244
When not to use the Vault Key Pattern?
Need fine-grained data protection.
Need to restrict what queries should be executed on the data store with high precision.
The exposed data store cannot validate access based on keys.
244
What are the benefits of using the Vault Key Pattern?
Accesses data stores directly by using a trusted token, a vault key
Has minimal operational costs compared to calling the central identity service for validation
244
When to use Relational database management system (RDBMS)?
- Need transactions and ACID properties.
- Interrelationship with data is required to be maintained.
- Working with small to medium amounts of data.
252
When not to use Relational database management system (RDBMS)?
- Data needs to be highly scalable, such as IoT data.
- Working with XML, JSON, and binary data format.
- Solution cannot tolerate some level of unavailability.
252
When to use Apache Cassandra?
- Need high availability.
- Need scalability.
- Need a decentralized solution.
- Need faster writes than reads.
- Read access can be mostly performed by partition key.
252
When not to use Apache Cassandra?
- Existing data is updated frequently.
- Need to access data by columns that are not part of the partition key.
- Require relational features, such as transactions, complex joins, and ACID properties.
252
When to use Apache HBase?
- Need consistency.
- Need scalability.
- Need a decentralized solution.
- Need high read performance.
- Need both random and real-time access to data.
- Need to store petabytes of data.
252
When not to use Apache HBase?
- Solution cannot tolerate some level of unavailability.
- Existing data is updated very frequently.
- Require relational features, such as transactions, complex joins, and ACID properties.
252
When to use When to use MongoDB?
- Need consistency.
- Need a decentralized solution.
- Need a document store.
- Need data lookup based on multiple keys.
- Need high write performance.
252
When not to use When to use MongoDB?
- Solution cannot tolerate some level of unavailability.
- Require relational features, such as transactions, complex joins, and ACID properties.
252
When to use Redis?
- Need scalability.
- Need an in-memory database.
- Need a persistent option to restore the data.
- As a cache, queue, and real-time storage.
252
When not to use Redis?
- As a typical database to store and query with complex operations.
252
When to use Amazon DynamoDB?
- Need a highly scalable solution.
- Need a document store.
- Need a key-value store.
- Need high write performance.
- Fine-grained access control.
252
When not to use Amazon DynamoDB?
- Use in platforms other than AWS.
- Require relational features, such as complex joins, and foreign keys.
252
When to use Apache HDFS?
- Need a filesystem.
- Store large files.
- Store data once and reads multiple times.
- Perform MapReduce operation on files.
- Need scalability.
- Need data resiliency.
252
When not to use Apache HDFS?
- Store small files.
- Need to update files.
- Need to perform random data reads.
252
When to use Amazon S3?
- Need an object store.
- Perform MapReduce operations on objects.
- Need a highly scalable solution.
- Read part of the object data.
- Fine-grained access control.
252
When not to use Amazon S3?
- Use in platforms other than AWS.
- Need to run complex queries.
252
When to use Azure Cosmos DB?
- Need a highly scalable solution.
- Need a document store.
- Need a key-value store.
- Need a graph store.
- Need a column store.
- Fine-grained access control.
- Connectivity via MongoDB and Cassandra clients
252
When not to use Azure Cosmos DB?
- Use in platforms other than Azure.
- Perform transaction across data partitions.
252
When to use Google Cloud Spanner?
- Need a highly scalable solution.
- Need a relational store.
- Need support for SQL query processing
- Need transaction support across all nodes in the cluster.
252
When not to use Google Cloud Spanner?
- Use in platforms other than Google Cloud.
- Support for full ANSI SQL spec.
252
We can use test data stores to test data-service interactions, Though data services can have complex or simple logic, they can still cause bottlenecks in production. What are useful recommendations for overcoming these issues?
- Tests should be performed with both clean and prepopulated data stores, as the former will test for data initialization code and the latter will test for data consistency during operation.
- Test all data store types and versions that will be used in production to eliminate any surprises. We can implement test data stores as Docker instances that will help run tests in multiple environments with quick startup and proper cleanup after the test.
- Test data mapping and make sure all fields are properly mapped when calling the data store.
- Validate whether the service is performing inserts, writes, deletion, and updates on the data stores in an expected manner by checking the state of the data store via test clients that can access the database directly.
- Validate that relational constraints, triggers, and stored procedures are producing correct results.
In addition, it is important to do a load test on the data service along with the data store in a production-like environment with multiple clients. This will help identify any database lock, data consistency, or other performance-related bottlenecks present in the cloud native application. It will also show how much load the application can handle and how that will be affected when various data scaling patterns and techniques are deployed.
254
Describe Observability and Monitoring
Observability and monitoring help us identify the performance of data stores and take corrective actions when they deviate because of load or changes to the application. In most applications, incoming requests interact with the data stores. Any performance or availability issues in the data store will resonate across all layers of the system, affecting the overall user experience.
257
What are some key metrics to observe in a data store?
- Application metrics– Data store uptime/health: To identify whether each node in the data store is up and running.– Query execution time: Five types of issues can cause high query execution times:
--- Inefficient query: Use of nonoptimized queries including multiple complex joins, and tables not being indexed properly. --- Data growth in the data store: Data stores containing more data than it can handle. --- Concurrency: Concurrent operations on the same table/row, locking data stores and impacting their performance. --- Lack of system resources such as CPU/memory/disk space: Data store nodes not having enough resources to efficiently serve the request. --- Unavailability of dependent system or replica: In distributed data stores, when its replica or other dependent systems such as a lookup service is not available, it may take more time as it needs to provision a new instance or discover and route the request to another instance.
– Query execution response: Whether the query execution is successful. If the query is failing, we may need to look at the logs for more detail (depending on the failure).– Audit of the query operations: Malicious queries or user operations can result in unexpected reduction in data store performance. We can use audit logs to identify and mitigate them. - System metrics: To identify a lack of system resources for efficient processing via CPU consumption, memory consumption, availability of disk space, network utilization, and disk I/O speed.
- Data store logs
- Time taken and throughput when communicating with primary and replicas: Helps to understand networking issues and bad data store
When analyzing metrics, we can use percentiles to compare historical and current behaviors. This can identify anomalies and deviations, so we can quickly identify the root cause of the problem.
257
What are some steps and key considerations for deploying and managing data stores?
1 - Select data store types. Select the data store type (relational, NoSQL, or filesystem) and its vendor to match our use case.
2 - Configure the deployment pattern. This can be influenced by the patterns applied in the cloud native application and the type of data store we have selected. Based on this selection, high availability and scalability should be determined by answering the following questions:
1 - Who are the clients? 2 - How many nodes? 3 - Are we going to use a data store managed by the cloud vendor or deploy our own? 4 - How does the replication work? 5 - How do we back up the data? 6 - How does it handle disaster recovery? 7 - How do we secure the data store? 8 - How do we monitor the data store? 9 - How much does the data store/management cost?
3 - Enforce security. Data stores should be protected because they contain business-critical information. This can be enforced by applying relevant physical and software security as discussed in the preceding section. This may include enabling strict access control, data encryption, and use of audit logs.
4 - Set up observability and monitoring. Like microservices, data stores should be configured with observability and monitoring tools to guarantee continuous operation. This can provide early insights on possible scaling problems, such as a requirement to rebalance data shards, or to apply a different design pattern altogether to improve scalability and performance of the application.
5 - Automate continuous delivery. When it comes to data stores, automation and continuous delivery are not straightforward. Although we can easily come up with an initial data store schema, maintaining backward compatibility is difficult as the application evolves. Backward compatibility is critical; without it, we will not be able to achieve smooth application updates and rollbacks during failures. To improve productivity, we should always use proper automation tools such as scripts to automate continuous delivery. We also recommend having guardrails and using multiple deployment environments, such as development, and staging/preproduction, to reduce the impact of the changes and to validate the application before moving it to production.
By following these steps, we can safely deploy and maintain cloud native applications while allowing rapid innovation and adoption to other systems.
260