Data Store Management Flashcards
A data engineer is designing an application that will transform data in containers managed by Amazon Elastic Kubernetes Service (Amazon EKS). The containers run on Amazon EC2 nodes. Each containerized application will transform independent datasets and then store the data in a data lake. Data does not need to be shared to other containers. The data engineer must decide where to store data before transformation is complete.
Which solution will meet these requirements with the LOWEST latency?
Containers should use an ephemeral volume provided by the node’s RAM.
Amazon EKS is a container orchestrator that provides Kubernetes as a managed service. Containers run in pods. Pods run on nodes. Nodes can be EC2 instances, or nodes can use AWS Fargate. Ephemeral volumes exist with the pod’s lifecycle. Ephemeral volumes can access drives or memory that is local to the node. The data does not need to be shared, and the node provides storage. Therefore, this solution will have lower latency than storage that is external to the node.
A company has data in an on-premises NFS file share. The company plans to migrate to AWS. The company uses the data for data analysis. The company has written AWS Lambda functions to analyze the data. The company wants to continue to use NFS for the file system that Lambda accesses. The data must be shared across all concurrently running Lambda functions.
Which solution should the company use for this data migration?
Migrate the data to Amazon Elastic File System (EFS). Configure the Lambda functions to mount the file system.
Amazon EFS is a scalable file storage service that you can integrate with Lambda or other compute options. A solution that uses Amazon EFS for file storage meets the requirements. Lambda can access the data by using NFS. Additionally, the data is accessible from all concurrently running Lambda functions.
A company is running an Amazon Redshift cluster. A data engineer must design a solution that would give the company the ability to provide analysis on a separate test environment in Amazon Redshift. The solution would use the data from the main Redshift cluster. The second cluster is expected to be used for only 2 hours every 2 weeks as part of the new testing process.
Which solution will meet these requirements in the MOST cost-effective manner?
Create a data share from the main Redshift cluster to the Redshift test cluster. Use Redshift Serverless for the test environment.
Redshift data sharing gives you the ability to share live data across Redshift clusters and Redshift Serverless endpoints at no additional cost. Redshift Serverless automatically provisions and scales data warehouse capacity to run the test workloads. You pay only for the compute capacity provisioned. There are no compute costs when no workloads are running. The test environment is used for only 2 hours every 2 weeks. Therefore, a solution that uses Redshift Serverless for the test environment will help reduce compute costs.
An ecommerce company is running an application on AWS. The application sources recent data from tables in Amazon Redshift. Data that is older than 1 year is accessible in Amazon S3. Recently, a new report has been written in SQL. The report needs to compare a few columns from the current year sales table with the same columns from tables with sales data from previous years. The report runs slowly, with poor performance and long wait times to get results.
A data engineer must optimize the back-end storage to accelerate the query.
Which solution will meet these requirements MOST efficiently?
Run the report SQL statement to gather the data from S3. Store the result set in a Redshift materialized view. Configure the report to run SQL REFRESH. Then, query the materialized view.
You can use Redshift materialized views to speed up queries that are predictable and repeated. A solution that runs SQL REFRESH on the materialized view would ensure that the latest data from the current sales table is included in the report.
A company is running an Amazon Redshift data warehouse on AWS. The company has recently started using a software as a service (SaaS) sales application that is supported by several AWS services. The company wants to transfer some of the data in the SaaS application to Amazon Redshift for reporting purposes.
A data engineer must configure a solution that can continuously send data from the SaaS application to Amazon Redshift.
Which solution will meet these requirements with the LEAST operational overhead?
Create an Amazon AppFlow flow to ingest the selected source data to Redshift. Configure the flow to run on event.
With Amazon AppFlow, a flow transfers data between a source and a destination. Amazon AppFlow supports many AWS services and SaaS applications as sources or destinations. A solution that uses Amazon AppFlow can continuously send data from the SaaS application to Amazon Redshift with the least operational overhead.