Chapter 4, Data Management Patterns GPT Flashcards
What are the unique characteristics of cloud native data compared to traditional data processing practices?
Cloud native data can be stored in many forms, in a variety of data formats and data stores, does not maintain a fixed schema, and is encouraged to have duplicate data to facilitate availability and performance over consistency. Multiple services are encouraged to call respective service APIs that own the data store rather than accessing the same database directly. This provides separation of concerns and allows cloud native data to scale out.
Page 162
Just as cloud native microservices have characteristics such as being scalable, resilient, and manageable, cloud native data has its own unique characteristics that are quite different from traditional data processing practices. Most important, cloud native data can be stored in many forms, in a variety of data formats and data stores. They are not expected to maintain a fixed schema and are encouraged to have duplicate data to facilitate availability and performance over consistency. Furthermore, in cloud native applications, multiple services are not encouraged to access the same database; instead, they should call respective service APIs that own the data store to access the data. All these provide separation of concerns and allow cloud native data to scale out.
What are stateless applications, and why are they simpler to implement and scale compared to stateful applications?
Stateless applications depend only on input and configuration data, making their failure or restart have almost no impact on execution. In contrast, stateful applications depend on input, config, and state data, which makes them more complex to implement and scale, as application failures can corrupt their state leading to incorrect execution.
Page 162
Applications that depend only on input and configuration (config) data are called stateless applications. These applications are relatively simple to implement and scale because their failure or restart has almost no impact on their execution. In contrast, applications that depend on input, config, and state data—stateful applications—are much more complex to implement and scale. The state of the application is stored in data stores, so application failures can result in partial writes that corrupt their state, which can lead to incorrect execution of the application.
What are relational databases best suited for, and what principle do they follow for schema definition?
Relational databases are ideal for storing structured data that has a predefined schema and use Structured Query Language (SQL) for processing, storing, and accessing data. They follow the principle of defining schema on write, meaning the data schema is defined before writing the data to the database.
Page 165
Relational databases are ideal for storing structured data that has a predefined schema. These databases use Structured Query Language (SQL) for processing, storing, and accessing data. They also follow the principle of defining schema on write: the data schema is defined before writing the data to the database.
What are the advantages of using relational databases for cloud native application data?
Relational databases can optimally store and retrieve data using database indexing and normalization, provide transaction guarantees through ACID properties, and help deploy and scale the data along with microservices as a single deployment unit.
Page 165-166
Relational databases can optimally store and retrieve data by using database indexing and normalization. Because these databases support atomicity, consistency, isolation, and durability (ACID) properties, they can also provide transaction guarantees.
Relational databases are a good option for storing cloud native application data. We recommend using a relational database per microservice, as this will help deploy and scale the data along with the microservice as a single deployment unit.
What is the principle of schema on read, and which type of databases follow this principle?
The principle of schema on read means that the schema of the data is defined only at the time of accessing the data for processing, not when it is written to the disk. NoSQL databases follow this principle.
Page 166
NoSQL databases follow the principle of schema on read: the schema of the data is defined only at the time of accessing the data for processing, and not when it is written to the disk.
Why are NoSQL databases suitable for handling big data, and what is a general recommendation regarding their use for transaction guarantees?
NoSQL databases are designed for scalability and performance, making them suitable for handling big data. However, it is generally not recommended to store data in NoSQL stores that need transaction guarantees.
Page 166-167
NoSQL databases are best suited to handling big data, as they are designed for scalability and performance.
it is generally not recommended to store data in NoSQL stores that need transaction guarantees.
How does a column store database manage data, and what are some common examples?
A column store database stores multiple key (column) and value pairs in each of its rows, allowing for writing any number of columns during the write phase and specifying only the columns of interest during data retrieval. Examples include Apache Cassandra and Apache HBase.
Page 167
Column store stores multiple key (column) and value pairs in each of its rows, as shown in Figure 4-2. These stores are a good example of schema on read: we can write any number of columns during the write phase, and when data is retrieved, we can specify only the columns we are interested in processing. The most widely used column store is Apache Cassandra. For those who use big data and Apache Hadoop infrastructure, Apache HBase can be an option as it is part of the Hadoop ecosystem.
What type of data is stored in a document store, and which databases are popular for this purpose?
A document store can store semi-structured data such as JSON and XML documents, allowing processing with JSON and XML path expressions. Popular document stores include MongoDB, Apache CouchDB, and CouchBase.
Page 169
Document store can store semi-structured data such as JSON and XML documents. This also allows us to process stored documents by using JSON and XML path expressions. These data stores are popular as they can store JSON and XML messages, which are usually used by frontend applications and APIs for communication. MongoDB, Apache CouchDB, and CouchBase are popular options for storing JSON documents.
What is the CAP theorem, and how does it apply to NoSQL stores?
The CAP theorem states that a distributed application can provide either full availability or consistency, but not both, while ensuring partition tolerance. Availability means the system is fully functional when some nodes are down, consistency means an update/change in one node is immediately propagated to others, and partition tolerance means the system can work even when some nodes cannot connect to each other.
Page 169
NoSQL stores are distributed, so they need to adhere to the CAP theorem; CAP stands for consistency, availability, and partition tolerance. This theorem states that a distributed application can provide either full availability or consistency; we cannot achieve both while providing network partition tolerance. Here, availability means that the system is fully functional when some of its nodes are down, consistency means an update/change in one node is immediately propagated to other nodes, and partition tolerance means that the system can continue to work even when some nodes cannot connect to each other.
Why is filesystem storage preferred for unstructured data in cloud native applications?
Filesystem storage is preferred for unstructured data because it optimizes data storage and retrieval without trying to understand the data. It can also be used to store large application data as a cache, which can be cheaper than retrieving data repeatedly over the network.
Page 171
Filesystem storage is the best for storing unstructured data in cloud native applications. Unlike NoSQL stores, it does not try to understand the data but rather purely optimizes data storage and retrieval. We can also use filesystem storage to store large application data as a cache, as it can be cheaper than retrieving data repeatedly over the network.
When should cloud native applications use relational data stores, NoSQL stores, or filesystem storage?
Cloud native applications should use relational data stores when they need transactional guarantees and data needs to be tightly coupled with the application. NoSQL or filesystem stores should be used for semi-structured or unstructured fields to achieve scalability while preserving transactional guarantees. NoSQL is also suitable for extremely large data, querying capability, or specific application use cases like graph processing.
Page 172
Cloud native applications should use relational data stores when they need transactional guarantees and when data needs to be tightly coupled with the application.
When data contains semi-structured or unstructured fields, they can be separated and stored in NoSQL or filesystem stores to achieve scalability while still preserving transactional guarantees. The applications can choose to store in NoSQL when the data quantity is extremely large, needs a querying capability, or is semi- structured, or the data store is specialized enough to handle the specific application use case such as graph processing.
What are the advantages and disadvantages of centralized data management in traditional data-centric applications?
Centralized data management allows data normalization for high consistency, enables running stored procedures across multiple tables for faster retrieval, and provides tight coupling between applications. However, it hinders the ability to evolve applications independently and is considered an antipattern for cloud native applications.
Page 172
Centralized data management is the most common type in traditional data-centric applications. In this approach, all data is stored in a single database, and multiple components of the application are allowed to access the data for processing (Figure 4-3).
This approach has several advantages; for instance, the data in these database tables can be normalized, providing high data consistency. Furthermore, as components can access all the tables, the centralized data storage provides the ability to run stored procedures across multiple tables and to retrieve results faster. On the other hand, this provides tight coupling between applications, and hinders the ability to evolve the applications independently. Therefore, it is considered an antipattern when building cloud native applications.
How does decentralized data management benefit microservices, and what are its potential disadvantages?
Decentralized data management allows scaling microservices independently, improving development time and release cycles, and solving data management and ownership problems. However, it can increase the cost of running separate data stores for each service.
Page 174
In Decentralized Data Management each independent functional component can be modeled as a microservice that has separate data stores, exclusive to each of them. This decentralized data management approach, illustrated in Figure 4-4, allows us to scale microservices independently without impacting other microservices.
Although application owners have less freedom to manage or evolve the data, segregating it in each microservice so that it’s managed by its teams/owners not only solves data management and ownership problems, but also improves the development time of new feature implementations and release cycles.
Decentralized data management allows services to choose the most appropriate data store for their use case. For example, a Payment service may use a relational database to perform transactions, while an Inquiry service may use a document store to store the details of the inquiry, and a Shopping Cart service may use a distributed key-value store to store the items picked by the customer.
one of the disadvantages of decentralized data management is the cost of running separate data stores for each service.
What is hybrid data management, and how does it help with data protection and security enforcement?
Hybrid data management helps achieve compliance with modern data-protection laws and ease security enforcement by having customer data managed via a few microservices within a secured bounded context. It provides ownership of the data to one or a few well-trained teams to apply data-protection policies.
Page 175
Hybrid Data Management helps achieve compliance with modern data-protection laws and ease security enforcement as data resides in a central place. Therefore, it is advisable to have all customer data managed via a few microservices within a secured bounded context, and to provide ownership of the data to one or a few well-trained teams to apply data-protection policies.
What benefits does exposing data as a data service provide, and in what situations is the Data Service Pattern useful?
Exposing data as a data service allows control over data presentation, security, and priority-based throttling. The Data Service Pattern is useful when data does not belong to a specific microservice and multiple microservices depend on it, or for exposing legacy on-premises or proprietary data stores to cloud native applications.
Page 180, 182
Exposing data as a data service, shown in Figure 4-10, provides us more control over that data. This allows us to present data in various compositions to various clients, apply security, and enforce priority-based throttling, allowing only critical services to access data during resource-constraint situations such as load spikes or system failures.
These data services can perform simple read and write operations to a database or even perform complex logic such as joining multiple tables or running stored procedures to build responses much more efficiently. These data services can also utilize caching to enhance their read performance.
We can use the Data Service Pattern when the data does not belong to any particular microservice; no microservice is the rightful owner of that data, yet multiple microservices are depending on it for their operation. In such cases, the common data should be exposed as an independent data service, allowing all dependent applications to access the data via APIs.
We can also use the Data Service Pattern to expose legacy on-premises or proprietary data stores to other cloud native applications
Why is accessing the same data via multiple microservices considered an antipattern, and how can the Data Service Pattern help?
Accessing the same data via multiple microservices introduces tight coupling and hinders scalability and independent evolution of microservices. The Data Service Pattern helps reduce coupling by providing managed APIs to access data.
Page 183
Considerations: When building cloud native applications, accessing the same data via multiple microservices is considered an antipattern. This will introduce tight coupling between the microservices and not allow the microservices to scale and evolve on their own. The Data Service pattern can help reduce coupling by providing managed APIs to access data.
the Data Service Pattern should not be used when the data can clearly be associated with an existing microservice, as introducing unnecessary microservices will cause additional management complexity.
What is the primary purpose of the Sharding Pattern, and what should be avoided when generating shard keys?
The primary purpose of the Sharding Pattern is to improve data retrieval time by distributing data across multiple shards. When generating shard keys, avoid using auto-incrementing fields and ensure the fields that contribute to the shard key remain fixed to avoid time-consuming data migration.
Page 198, 200
For sharding to be useful, the data should contain one or a collection of fields that uniquely identifies the data or meaningfully groups it into subsets. The combination of these fields generates the shard/partition key that will be used to locate the data. The values stored in the fields that contribute to the shard key should be fixed and never be changed upon data updates. This is because when they change, they will also change the shard key, and if the updated shard key now points to a different shard location, the data also needs to be migrated from the current shard to the new shard location. Moving data among shards is time-consuming, so this should be avoided at all costs.
We don’t recommend using auto-incrementing fields when generating shard keys. Shards do not communicate with each other, and because of the use of auto-incrementing fields, multiple shards may have generated the same keys and refer to different data with those keys locally. This can become a problem when the data is redistributed during data-rebalancing operations.