DA Flashcards
A near-real-time solution is needed that only collects non-confidential data from sensitive streaming data and stores it in durable storage.
Use Amazon Kinesis Data Firehose to ingest streaming data and enable record transformation to utilize AWS Lambda for excluding sensitive data. Store the processed data in Amazon S3.
Large files are compressed into a single GZIP file and uploaded into an S3 bucket. You have to speed up the COPY process to load data into Amazon Redshift.
Split the GZIP file into smaller files and make sure that their number is a multiple of the number of the Redshift cluster’s slices.
An Amazon EMR cluster needs to use a centralized metadata layer that will expose data in Amazon S3 as tables.
AWS Glue Catalogue
Ways to fix Amazon Kinesis Data Streams throttling issues on write requests.
Increase the number of shards using the UpdateShardCount API command.
Use random partition keys
A company needs a cost-effective solution for detecting anomalous data coming from an Amazon Kinesis Data stream.
Create a Kinesis Data Analytics application and use the RANDOM_CUT_FOREST function for anomaly detection.
A company wants a cost-effective solution that will enable them to query a subset of data from a CSV file.
Use Amazon S3 Select
You need to populate a data catalog using data stored in Amazon S3, Amazon RDS, and Amazon DynamoDB.
Use an AWS Glue crawler schedule
A Data Analyst used the COPY command to migrate CSV files into a Redshift cluster. However, no data was imported and no errors were found after the process was finished.
The CSV files uses carriage returns as a line terminator.
The IGNOREHEADER parameter was included in the COPY command.
What is a cost-effective solution to save Redshift query results to an external storage?
Use the Amazon Redshift UNLOAD command
A company is using Amazon S3 Standard-IA and Amazon S3 Glacier as its data storage.
Some data cannot be accessed with Amazon Athena queries. Which best explains this event?
Amazon Athena is trying to access data stored in Amazon S3 Glacier.
A company uses an Amazon EMR cluster to process 10 batch jobs every day. Each job takes about 20 minutes to complete. A solution to lower down the cost of the EMR cluster must be implemented.
Use transient Amazon EMR clusters
An Amazon Kinesis Client Library (KCL) application is processing data in a DynamoDB table that has provisioned write capacity. The application’s latency increases during peak times and it must be resolved immediately.
Increase the DynamoDB tables’ write throughput.
Thousands of files are being loaded in a central fact table hosted on Amazon Redshift. You need to optimize the cluster resource utilization when loading data into the fact table.
Use a single COPY command to load data.
A Lambda function is used to process data from a Kinesis Data stream. Results are delivered into Amazon ES. During peak hours, the processing time slows down.
Use multiple Lambda function to prococess data concurrently.
A Data Analyst needs to join data stored in Amazon Redshift and data stored in Amazon S3. The Analyst wants a serverless solution that will reduce the workload of the Redshift cluster.
Create an external table using Amazon Redshift Spectrum for the S3 data and use Redshift SQL queries for join operations.