DEA-C01 Flashcards

Question 1

Q

A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.

The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.

Which solution will meet this requirement?

A.

Update the AWS Glue security group to allow inbound traffic from the Amazon S3 VPC gateway endpoint.

B.

Configure an S3 bucket policy to explicitly grant the AWS Glue job permissions to access the S3 bucket.

C.

Review the AWS Glue job code to ensure that the AWS Glue connection details include a fully qualified domain name.

D.

Verify that the VPC’s route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint.

Answer

A

Verify that the VPC’s route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint.

Question 2

Q

A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.

B.

Create an S3 event notification that has an event type of s3:ObjectTagging:* for objects that have a tag set to .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.

C.

Create an S3 event notification that has an event type of s3:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.

D.

Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set an Amazon Simple Notification Service (Amazon SNS) topic as the destination for the event notification. Subscribe the Lambda function to the SNS topic.

Answer

A

Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.

Question 3

Q

An insurance company stores transaction data that the company compressed with gzip.

The company needs to query the transaction data for occasional audits.

Which solution will meet this requirement in the MOST cost-effective way?

A.

Store the data in Amazon Glacier Flexible Retrieval. Use Amazon S3 Glacier Select to query the data.

B.

Store the data in Amazon S3. Use Amazon S3 Select to query the data.

C.

Store the data in Amazon S3. Use Amazon Athena to query the data.

D.

Store the data in Amazon Glacier Instant Retrieval. Use Amazon Athena to query the data.

Answer

A

Store the data in Amazon S3. Use Amazon S3 Select to query the data.

Question 4

Q

A data engineer finished testing an Amazon Redshift stored procedure that processes and inserts data into a table that is not mission critical. The engineer wants to automatically run the stored procedure on a daily basis.

Which solution will meet this requirement in the MOST cost-effective way?

A.

Create an AWS Lambda function to schedule a cron job to run the stored procedure.

B.

Schedule and run the stored procedure by using the Amazon Redshift Data API in an Amazon EC2 Spot Instance.

C.

Use query editor v2 to run the stored procedure on a schedule.

D.

Schedule an AWS Glue Python shell job to run the stored procedure.

Answer

A

Use query editor v2 to run the stored procedure on a schedule.

Question 5

Q

A marketing company collects clickstream data. The company sends the clickstream data to Amazon Kinesis Data Firehose and stores the clickstream data in Amazon S3. The company wants to build a series of dashboards that hundreds of users from multiple departments will use.

The company will use Amazon QuickSight to develop the dashboards. The company wants a solution that can scale and provide daily updates about clickstream activity.

Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)

A.

Use Amazon Redshift to store and query the clickstream data.

B.

Use Amazon Athena to query the clickstream data

C.

Use Amazon S3 analytics to query the clickstream data.

D.

Access the query data through a QuickSight direct SQL query.

E.

Access the query data through QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine). Configure a daily refresh for the dataset.

Answer

A

Use Amazon Athena to query the clickstream data

Access the query data through QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine). Configure a daily refresh for the dataset.

Question 6

Q

A data engineer is building a data orchestration workflow. The data engineer plans to use a hybrid model that includes some on-premises resources and some resources that are in the cloud. The data engineer wants to prioritize portability and open source resources.

Which service should the data engineer use in both the on-premises environment and the cloud-based environment?

A.

AWS Data Exchange

B.

Amazon Simple Workflow Service (Amazon SWF)

C.

Amazon Managed Workflows for Apache Airflow (Amazon MWAA)

D.

AWS Glue

Answer

A

Amazon Managed Workflows for Apache Airflow (Amazon MWAA)

Question 7

Q

A gaming company uses a NoSQL database to store customer information. The company is planning to migrate to AWS.

The company needs a fully managed AWS solution that will handle high online transaction processing (OLTP) workload, provide single-digit millisecond performance, and provide high availability around the world.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Amazon Keyspaces (for Apache Cassandra)

B.

Amazon DocumentDB (with MongoDB compatibility)

C.

Amazon DynamoDB

D.

Amazon Timestream

Answer

A

Amazon DynamoDB

Question 8

Q

A data engineer creates an AWS Lambda function that an Amazon EventBridge event will invoke. When the data engineer tries to invoke the Lambda function by using an EventBridge event, an AccessDeniedException message appears.

How should the data engineer resolve the exception?

A.

Ensure that the trust policy of the Lambda function execution role allows EventBridge to assume the execution role.

B.

Ensure that both the IAM role that EventBridge uses and the Lambda function’s resource-based policy have the necessary permissions.

C.

Ensure that the subnet where the Lambda function is deployed is configured to be a private subnet.

D.

Ensure that EventBridge schemas are valid and that the event mapping configuration is correct.

Answer

A

Ensure that both the IAM role that EventBridge uses and the Lambda function’s resource-based policy have the necessary permissions.

Question 9

Q

A company uses a data lake that is based on an Amazon S3 bucket. To comply with regulations, the company must apply two layers of server-side encryption to files that are uploaded to the S3 bucket. The company wants to use an AWS Lambda function to apply the necessary encryption.

Which solution will meet these requirements?

A.

Use both server-side encryption with AWS KMS keys (SSE-KMS) and the Amazon S3 Encryption Client.

B.

Use dual-layer server-side encryption with AWS KMS keys (DSSE-KMS).

C.

Use server-side encryption with customer-provided keys (SSE-C) before files are uploaded.

D.

Use server-side encryption with AWS KMS keys (SSE-KMS).

Answer

A

Use dual-layer server-side encryption with AWS KMS keys (DSSE-KMS).

Question 10

Q

A data engineer notices that Amazon Athena queries are held in a queue before the queries run.

How can the data engineer prevent the queries from queueing?

A.

Increase the query result limit.

B.

Configure provisioned capacity for an existing workgroup.

C.

Use federated queries.

D.

Allow users who run the Athena queries to an existing workgroup.

Answer

A

Configure provisioned capacity for an existing workgroup.

Question 11

Q

A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job.

The data engineer has set the maximum concurrency for the AWS Glue job to 1.

The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.

What is the likely reason the AWS Glue job is reprocessing the files?

A.

The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.

B.

The maximum concurrency for the AWS Glue job is set to 1.

C.

The data engineer incorrectly specified an older version of AWS Glue for the Glue job.

D.

The AWS Glue job does not have a required commit statement.

Answer

A

The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.

Question 12

Q

An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.

The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.

Which solution will meet these requirements with the LEAST operational overhead?

A.

AWS Lambda

B.

Amazon Managed Workflows for Apache Airflow (Amazon MVVAA)

C.

AWS Step Functions

D.

AWS Glue

Answer

A

Amazon Managed Workflows for Apache Airflow (Amazon MVVAA)

Question 13

Q

A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column.

Which solution will MOST speed up the Athena query performance?

A.

Change the data format from .csv to JSON format. Apply Snappy compression.

B.

Compress the .csv files by using Snappy compression.

C.

Change the data format from .csv to Apache Parquet. Apply Snappy compression.

D.

Compress the .csv files by using gzip compression.

Answer

A

Change the data format from .csv to Apache Parquet. Apply Snappy compression.

Question 14

Q

A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.

The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.

The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.

Which solution will meet these requirements with the LEAST development effort?

A.

Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.

B.

Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.

C.

Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.

D.

Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the tasks.

Answer

A

Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.

Question 15

Q

A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN clause on S3 objects that are stored in separate buckets.

The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users the ability to query data by partitioning the data. The solution must maintain the atomicity, consistency, isolation, and durability (ACID) properties of the data.

Which solution will meet these requirements MOST cost-effectively?

A.

Amazon S3 Select

B.

Amazon Redshift Spectrum

C.

Amazon Athena

D.

Amazon EMRModify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.

Answer

A

Amazon Athena

Question 16

Q

A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.

Which solution will give AWS Database Migration Service (AWS DMS) the ability to replicate data between two data stores?

A.

Set up an AWS DMS replication instance in Account_B in eu-west-1.

B.

Set up an AWS DMS replication instance in Account_B in eu-east-1.

C.

Set up an AWS DMS replication instance in a new AWS account in eu-west-1.

D.

Set up an AWS DMS replication instance in Account_A in eu-east-1.

Answer

A

Set up an AWS DMS replication instance in Account_B in eu-west-1.

Question 17

Q

A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.

The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files into the table. The company must increase the speed of the data ingestion. The company does not want to increase the cost of the process.

Which solution will meet these requirements?

A.

Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.

B.

Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift.

C.

Use an AWS Give job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.

D.

Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.

Answer

A

Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.

Question 18

Q

A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apache Parquet format.

Which solution will meet these requirements with the LEAST development effort?

A.

Use Kinesis Data Firehose to convert the .csv files to JSON. Use an AWS Lambda function to store the files in Parquet format.

B.

Use Kinesis Data Firehose to convert the .csv files to JSON and to store the files in Parquet format.

C.

Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON and stores the files in Parquet format.

D.

Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON. Use Kinesis Data Firehose to store the files in Parquet format.

Answer

A

Use Kinesis Data Firehose to convert the .csv files to JSON and to store the files in Parquet format.

Question 19

Q

A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to encrypt the data in transit.

Which solution will meet these requirements?

A.

Generate new SSH keys for the Transfer Family server. Make the old keys and the new keys available for use.

B.

Update the security group rules for the on-premises network to allow only connections that use TLS 1.2 or above.

C.

Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2

D.

Install an SSL certificate on the Transfer Family server to encrypt data transfers by using TLS 1.2.

Answer

A

Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2

Question 20

Q

A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy instead of the refactor strategy.

Which solution will meet these requirements with the LEAST management overhead?

A.

Amazon Kinesis Data Streams

B.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster

C.

Amazon Kinesis Data Firehose

D.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless

Answer

A

Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless

Question 21

Q

A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support incremental data processing.

Which AWS Glue feature should the data engineer use to meet this requirement?

A.

Workflows

B.

Triggers

C.

Job bookmarks

D.

Classifiers

Answer

A

Job bookmarks

Question 22

Q

A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company’s application uses the PutRecord action to send data to Kinesis Data Streams.

A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline.

Which solution will meet this requirement?

A.

Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.

B.

Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events.

C.

Design the data source so events are not ingested into Kinesis Data Streams multiple times.

D.

Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Spark Streaming in Amazon EMR.

Answer

A

Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.

Question 23

Q

A company stores logs in an Amazon S3 bucket. When a data engineer attempts to access several log files, the data engineer discovers that some files have been unintentionally deleted.

The data engineer needs a solution that will prevent unintentional file deletion in the future.

Which solution will meet this requirement with the LEAST operational overhead?

A.

Manually back up the S3 bucket on a regular basis.

B.

Enable S3 Versioning for the S3 bucket.

C.

Configure replication for the S3 bucket.

D.

Use an Amazon S3 Glacier storage class to archive the data that is in the S3 bucket.

Answer

A

Enable S3 Versioning for the S3 bucket.

Question 24

Q

A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket.

The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility.

Which solution will meet these requirements with the LOWEST latency?

A.

Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.

B.

Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard.

C.

Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard.

D.

Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.

Answer

A

Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.

Question 25

Q

A telecommunications company collects network usage data throughout each day at a rate of several thousand data points each second. The company runs an application to process the usage data in real time. The company aggregates and stores the data in an Amazon Aurora DB instance.

Sudden drops in network usage usually indicate a network outage. The company must be able to identify sudden drops in network usage so the company can take immediate remedial actions.

Which solution will meet this requirement with the LEAST latency?

A.

Create an AWS Lambda function to query Aurora for drops in network usage. Use Amazon EventBridge to automatically invoke the Lambda function every minute.

B.

Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.

C.

Replace the Aurora database with an Amazon DynamoDB table. Create an AWS Lambda function to query the DynamoDB table for drops in network usage every minute. Use DynamoDB Accelerator (DAX) between the processing application and DynamoDB table.

D.

Create an AWS Lambda function within the Database Activity Streams feature of Aurora to detect drops in network usage.

Answer

A

Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.

Question 26

Q

A data engineer is processing and analyzing multiple terabytes of raw data that is in Amazon S3. The data engineer needs to clean and prepare the data. Then the data engineer needs to load the data into Amazon Redshift for analytics.

The data engineer needs a solution that will give data analysts the ability to perform complex queries. The solution must eliminate the need to perform complex extract, transform, and load (ETL) processes or to manage infrastructure.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use Amazon EMR to prepare the data. Use AWS Step Functions to load the data into Amazon Redshift. Use Amazon QuickSight to run queries.

B.

Use AWS Glue DataBrew to prepare the data. Use AWS Glue to load the data into Amazon Redshift. Use Amazon Redshift to run queries.

C.

Use AWS Lambda to prepare the data. Use Amazon Kinesis Data Firehose to load the data into Amazon Redshift. Use Amazon Athena to run queries.

D.

Use AWS Glue to prepare the data. Use AWS Database Migration Service (AVVS DMS) to load the data into Amazon Redshift. Use Amazon Redshift Spectrum to run queries.

Answer

A

Use AWS Glue DataBrew to prepare the data. Use AWS Glue to load the data into Amazon Redshift. Use Amazon Redshift to run queries.

Question 27

Q

A company uses an AWS Lambda function to transfer files from a legacy SFTP environment to Amazon S3 buckets. The Lambda function is VPC enabled to ensure that all communications between the Lambda function and other AVS services that are in the same VPC environment will occur over a secure network.

The Lambda function is able to connect to the SFTP environment successfully. However, when the Lambda function attempts to upload files to the S3 buckets, the Lambda function returns timeout errors. A data engineer must resolve the timeout issues in a secure way.

Which solution will meet these requirements in the MOST cost-effective way?

A.

Create a NAT gateway in the public subnet of the VPC. Route network traffic to the NAT gateway.

B.

Create a VPC gateway endpoint for Amazon S3. Route network traffic to the VPC gateway endpoint.

C.

Create a VPC interface endpoint for Amazon S3. Route network traffic to the VPC interface endpoint.

D.

Use a VPC internet gateway to connect to the internet. Route network traffic to the VPC internet gateway.

Answer

A

Create a VPC gateway endpoint for Amazon S3. Route network traffic to the VPC gateway endpoint.

Question 28

Q

A company reads data from customer databases that run on Amazon RDS. The databases contain many inconsistent fields. For example, a customer record field that iPnamed place_id in one database is named location_id in another database. The company needs to link customer records across different databases, even when customer record fields do not match.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Create a provisioned Amazon EMR cluster to process and analyze data in the databases. Connect to the Apache Zeppelin notebook. Use the FindMatches transform to find duplicate records in the data.

B.

Create an AWS Glue crawler to craw the databases. Use the FindMatches transform to find duplicate records in the data. Evaluate and tune the transform by evaluating the performance and results.

C.

Create an AWS Glue crawler to craw the databases. Use Amazon SageMaker to construct Apache Spark ML pipelines to find duplicate records in the data.

D.

Create a provisioned Amazon EMR cluster to process and analyze data in the databases. Connect to the Apache Zeppelin notebook. Use an Apache Spark ML model to find duplicate records in the data. Evaluate and tune the model by evaluating the performance and results.

Answer

A

Create an AWS Glue crawler to craw the databases. Use the FindMatches transform to find duplicate records in the data. Evaluate and tune the transform by evaluating the performance and results.

Question 29

Q

A finance company receives data from third-party data providers and stores the data as objects in an Amazon S3 bucket.

The company ran an AWS Glue crawler on the objects to create a data catalog. The AWS Glue crawler created multiple tables. However, the company expected that the crawler would create only one table.

The company needs a solution that will ensure the AVS Glue crawler creates only one table.

Which combination of solutions will meet this requirement? (Choose two.)

A.

Ensure that the object format, compression type, and schema are the same for each object.

B.

Ensure that the object format and schema are the same for each object. Do not enforce consistency for the compression type of each object.

C.

Ensure that the schema is the same for each object. Do not enforce consistency for the file format and compression type of each object.

D.

Ensure that the structure of the prefix for each S3 object name is consistent.

E.

Ensure that all S3 object names follow a similar pattern.

Answer

A

Ensure that the object format, compression type, and schema are the same for each object.

Ensure that the structure of the prefix for each S3 object name is consistent.

Question 30

Q

An application consumes messages from an Amazon Simple Queue Service (Amazon SQS) queue. The application experiences occasional downtime. As a result of the downtime, messages within the queue expire and are deleted after 1 day. The message deletions cause data loss for the application.

Which solutions will minimize data loss for the application? (Choose two.)

A.

Increase the message retention period

B.

Increase the visibility timeout.

C.

Attach a dead-letter queue (DLQ) to the SQS queue.

D.

Use a delay queue to delay message delivery

E.

Reduce message processing time.

Answer

A

Increase the message retention period

Attach a dead-letter queue (DLQ) to the SQS queue.

Question 31

Q

A company is creating near real-time dashboards to visualize time series data. The company ingests data into Amazon Managed Streaming for Apache Kafka (Amazon MSK). A customized data pipeline consumes the data. The pipeline then writes data to Amazon Keyspaces (for Apache Cassandra), Amazon OpenSearch Service, and Apache Avro objects in Amazon S3.

Which solution will make the data available for the data visualizations with the LEAST latency?

A.

Create OpenSearch Dashboards by using the data from OpenSearch Service.

B.

Use Amazon Athena with an Apache Hive metastore to query the Avro objects in Amazon S3. Use Amazon Managed Grafana to connect to Athena and to create the dashboards.

C.

Use Amazon Athena to query the data from the Avro objects in Amazon S3. Configure Amazon Keyspaces as the data catalog. Connect Amazon QuickSight to Athena to create the dashboards.

D.

Use AWS Glue to catalog the data. Use S3 Select to query the Avro objects in Amazon S3. Connect Amazon QuickSight to the S3 bucket to create the dashboards.

Answer

A

Create OpenSearch Dashboards by using the data from OpenSearch Service.

Question 32

Q

A company stores petabytes of data in thousands of Amazon S3 buckets in the S3 Standard storage class. The data supports analytics workloads that have unpredictable and variable data access patterns.

The company does not access some data for months. However, the company must be able to retrieve all data within milliseconds. The company needs to optimize S3 storage costs.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use S3 Storage Lens standard metrics to determine when to move objects to more cost-optimized storage classes. Create S3 Lifecycle policies for the S3 buckets to move objects to cost-optimized storage classes. Continue to refine the S3 Lifecycle policies in the future to optimize storage costs.

B.

Use S3 Storage Lens activity metrics to identify S3 buckets that the company accesses infrequently. Configure S3 Lifecycle rules to move objects from S3 Standard to the S3 Standard-Infrequent Access (S3 Standard-IA) and S3 Glacier storage classes based on the age of the data.

C.

Use S3 Intelligent-Tiering. Activate the Deep Archive Access tier.

D.

Use S3 Intelligent-Tiering. Use the default access tier.

Answer

A

Use S3 Intelligent-Tiering. Use the default access tier.

Question 33

Q

A media company wants to use Amazon OpenSearch Service to analyze rea-time data about popular musical artists and songs. The company expects to ingest millions of new data events every day. The new data events will arrive through an Amazon Kinesis data stream. The company must transform the data and then ingest the data into the OpenSearch Service domain.

Which method should the company use to ingest the data with the LEAST operational overhead?

A.

Use Amazon Kinesis Data Firehose and an AWS Lambda function to transform the data and deliver the transformed data to OpenSearch Service.

B.

Use a Logstash pipeline that has prebuilt filters to transform the data and deliver the transformed data to OpenSearch Service.

C.

Use an AWS Lambda function to call the Amazon Kinesis Agent to transform the data and deliver the transformed data OpenSearch Service.

D.

Use the Kinesis Client Library (KCL) to transform the data and deliver the transformed data to OpenSearch Service.

Answer

A

Use Amazon Kinesis Data Firehose and an AWS Lambda function to transform the data and deliver the transformed data to OpenSearch Service.

Question 34

Q

A company stores customer data tables that include customer addresses in an AWS Lake Formation data lake. To comply with new regulations, the company must ensure that users cannot access data for customers who are in Canada.

The company needs a solution that will prevent user access to rows for customers who are in Canada.

Which solution will meet this requirement with the LEAST operational effort?

A.

Set a row-level filter to prevent user access to a row where the country is Canada.

B.

Create an IAM role that restricts user access to an address where the country is Canada.

C.

Set a column-level filter to prevent user access to a row where the country is Canada.

D.

Apply a tag to all rows where Canada is the country. Prevent user access where the tag is equal to “Canada”.

Answer

A

Set a row-level filter to prevent user access to a row where the country is Canada.

Question 35

Q

A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.

The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.

Which solution will meet these requirements?

A.

Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler’s data store. Create a daily schedule to run the crawler. Configure the output destination to a new path in the existing S3 bucket.

B.

Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler’s data store. Create a daily schedule to run the crawler. Specify a database name for the output.

C.

Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler’s data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for the output.

D.

Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler’s data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket.

Answer

A

Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler’s data store. Create a daily schedule to run the crawler. Specify a database name for the output.

Question 36

Q

A company has implemented a lake house architecture in Amazon Redshift. The company needs to give users the ability to authenticate into Redshift query editor by using a third-party identity provider (IdP).

A data engineer must set up the authentication mechanism.

What is the first step the data engineer should take to meet this requirement?

A.

Register the third-party IdP as an identity provider in the configuration settings of the Redshift cluster.

B.

Register the third-party IdP as an identity provider from within Amazon Redshift.

C.

Register the third-party IdP as an identity provider for AVS Secrets Manager. Configure Amazon Redshift to use Secrets Manager to manage user credentials.

D.

Register the third-party IdP as an identity provider for AWS Certificate Manager (ACM). Configure Amazon Redshift to use ACM to manage user credentials.

Answer

A

Register the third-party IdP as an identity provider in the configuration settings of the Redshift cluster.

Question 37

Q

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company’s long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.

When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.

The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.

Which solution will meet these requirements MOST cost-effectively?

A.

Increase the maximum number of task nodes for EMR managed scaling to 10.

B.

Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.

C.

Switch the task node type from general purpose Re instances to compute optimized EC2 instances.

D.

Reduce the scaling cooldown period for the provisioned EMR cluster.

Answer

A

Switch the task node type from general purpose Re instances to compute optimized EC2 instances.

Question 38

Q

A company uploads .csv files to an Amazon S3 bucket. The company’s data platform team has set up an AWS Glue crawler to perform data discovery and to create the tables and schemas.

An AWS Glue job writes processed data from the tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creates the Amazon Redshift tables in the Redshift database appropriately.

If the company reruns the AWS Glue job for any reason, duplicate records are introduced into the Amazon Redshift tables. The company needs a solution that will update the Redshift tables without duplicates.

Which solution will meet these requirements?

A.

Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.

B.

Modify the AWS Glue job to load the previously inserted data into a MySQL database. Perform an upsert operation in the MySQL database. Copy the results to the Amazon Redshift tables.

C.

Use Apache Spark’s DataFrame dropDuplicates() API to eliminate duplicates. Write the data to the Redshift tables.

D.

Use the AWS Glue ResolveChoice built-in transform to select the value of the column from the most recent record.

Answer

A

Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.

Question 39

Q

A company is using Amazon Redshift to build a data warehouse solution. The company is loading hundreds of files into a fact table that is in a Redshift cluster.

The company wants the data warehouse solution to achieve the greatest possible throughput. The solution must use cluster resources optimally when the company loads data into the fact table.

Which solution will meet these requirements?

A.

Use multiple COPY commands to load the data into the Redshift cluster.

B.

Use S3DistCp to load multiple files into Hadoop Distributed File System (HDFS). Use an HDFS connector to ingest the data into the Redshift cluster.

C.

Use a number of INSERT statements equal to the number of Redshift cluster nodes. Load the data in parallel into each node.

D.

Use a single COPY command to load the data into the Redshift cluster.

Answer

A

Use a single COPY command to load the data into the Redshift cluster.

Question 40

Q

A company ingests data from multiple data sources and stores the data in an Amazon S3 bucket. An AWS Glue extract, transform, and load (ETL) job transforms the data and writes the transformed data to an Amazon S3 based data lake. The company uses Amazon Athena to query the data that is in the data lake.

The company needs to identify matching records even when the records do not have a common unique identifier.

Which solution will meet this requirement?

A.

Use Amazon Macie pattern matching as part of the ETL job.

B.

Train and use the AWS Glue PySpark Filter class in the ETL job.

C.

Partition tables and use the ETL job to partition the data on a unique identifier.

D.

Train and use the AWS Lake Formation FindMatches transform in the ETL job.

Answer

A

Train and use the AWS Lake Formation FindMatches transform in the ETL job.

Question 41

Q

A data engineer is using an AWS Glue crawler to catalog data that is in an Amazon S3 bucket. The S3 bucket contains both .csv and json files. The data engineer configured the crawler to exclude the .json files from the catalog.

When the data engineer runs queries in Amazon Athena, the queries also process the excluded .json files. The data engineer wants to resolve this issue. The data engineer needs a solution that will not affect access requirements for the .csv files in the source S3 bucket.

Which solution will meet this requirement with the SHORTEST query times?

A.

Adjust the AWS Glue crawler settings to ensure that the AWS Glue crawler also excludes .json files.

B.

Use the Athena console to ensure the Athena queries also exclude the .json files.

C.

Relocate the .json files to a different path within the S3 bucket.

D.

Use S3 bucket policies to block access to the .json files.

Answer

A

Relocate the .json files to a different path within the S3 bucket.

Question 42

Q

A data engineer set up an AWS Lambda function to read an object that is stored in an Amazon S3 bucket. The object is encrypted by an AWS KMS key.

The data engineer configured the Lambda function’s execution role to access the S3 bucket. However, the Lambda function encountered an error and failed to retrieve the content of the object.

What is the likely cause of the error?

A.

The data engineer misconfigured the permissions of the S3 bucket. The Lambda function could not access the object.

B.

The Lambda function is using an outdated SDK version, which caused the read failure.

C.

The S3 bucket is located in a different AWS Region than the Region where the data engineer works. Latency issues caused the Lambda function to encounter an error.

D.

The Lambda function’s execution role does not have the necessary permissions to access the KMS key that can decrypt the S3 object.

Answer

A

The Lambda function’s execution role does not have the necessary permissions to access the KMS key that can decrypt the S3 object.

Question 43

Q

A data engineer has implemented data quality rules in 1,000 AWS Glue Data Catalog tables. Because of a recent change in business requirements, the data engineer must edit the data quality rules.

How should the data engineer meet this requirement with the LEAST operational overhead?

A.

Create a pipeline in AWS Glue ETL to edit the rules for each of the 1,000 Data Catalog tables. Use an AWS Lambda function to call the corresponding AWS Glue job for each Data Catalog table.

B.

Create an AWS Lambda function that makes an API call to AWS Glue Data Quality to make the edits.

C.

Create an Amazon EMR cluster. Run a pipeline on Amazon EMR that edits the rules for each Data Catalog table. Use an AWS Lambda function to run the EMR pipeline.

D.

Use the AWS Management Console to edit the rules within the Data Catalog.

Answer

A

Create an AWS Lambda function that makes an API call to AWS Glue Data Quality to make the edits.

Question 44

Q

Two developers are working on separate application releases. The developers have created feature branches named Branch A and Branch B by using a GitHub repository’s master branch as the source.

The developer for Branch A deployed code to the production system. The code for Branch B will merge into a master branch in the following week’s scheduled application release.

Which command should the developer for Branch B run before the developer raises a pull request to the master branch?

A.

git diff branchB master git commit -m

B.

git pull master

C.

git rebase master

D.

git fetch -b master

Answer

A

git rebase master

Question 45

Q

A company stores employee data in Amazon Resdshift. A table names Employee uses columns named Region ID, Department ID, and Role ID as a compound sort key.

Which queries will MOST increase the speed of query by using a compound sort key of the table? (Choose two.)

A.

Select *from Employee where Region ID=’North America’;

B.

Select *from Employee where Region ID=’North America’ and Department ID=20;

C.

Select *from Employee where Department ID=20 and Region ID=’North America’;

D.

Select *from Employee where Role ID=50;

E.

Select *from Employee where Region ID=’North America’ and Role ID=50;

Answer

A

Select * from Employee where Region ID=’North America’ and Department ID=20;

Select * from Employee where Department ID=20 and Region ID=’North America’;

Question 46

Q

A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.

A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.

How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?

A.

Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.

B.

Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.

C.

Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.

D.

Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.

Answer

A

Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.

Question 47

Q

A company receives test results from testing facilities that are located around the world. The company stores the test results in millions of 1 KB JSON files in an Amazon S3 bucket. A data engineer needs to process the files, convert them into Apache Parquet format, and load them into Amazon Redshift tables. The data engineer uses AWS Glue to process the files, AWS Step Functions to orchestrate the processes, and Amazon EventBridge to schedule jobs.

The company recently added more testing facilities. The time required to process files is increasing. The data engineer must reduce the data processing time.

Which solution will MOST reduce the data processing time?

A.

Use AWS Lambda to group the raw input files into larger files. Write the larger files back to Amazon S3. Use AWS Glue to process the files. Load the files into the Amazon Redshift tables.

B.

Use the AWS Glue dynamic frame file-grouping option to ingest the raw input files. Process the files. Load the files into the Amazon Redshift tables.

C.

Use the Amazon Redshift COPY command to move the raw input files from Amazon S3 directly into the Amazon Redshift tables. Process the files in Amazon Redshift.

D.

Use Amazon EMR instead of AWS Glue to group the raw input files. Process the files in Amazon EMR. Load the files into the Amazon Redshift tables.

Answer

A

Use the AWS Glue dynamic frame file-grouping option to ingest the raw input files. Process the files. Load the files into the Amazon Redshift tables.

Question 48

Q

A data engineer uses Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run data pipelines in an AWS account.

A workflow recently failed to run. The data engineer needs to use Apache Airflow logs to diagnose the failure of the workflow.

Which log type should the data engineer use to diagnose the cause of the failure?

A.

YourEnvironmentName-WebServer

B.

YourEnvironmentName-Scheduler

C.

YourEnvironmentName-DAGProcessing

D.

YourEnvironmentName-Task

Answer

A

YourEnvironmentName-Task

Question 49

Q

A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.

Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?

A.

AWS DataSync

B.

AWS Glue

C.

AWS Direct Connect

D.

Amazon S3 Transfer Acceleration

Answer

A

AWS DataSync

Question 50

Q

A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently.

The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database.

Which AWS service should the company use to meet these requirements?

A.

AWS Lambda

B.

AWS Database Migration Service (AWS DMS)

C.

AWS Direct Connect

D.

AWS DataSync

Answer

A

AWS Database Migration Service (AWS DMS)

Question 51

Q

A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour.

Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)

A.

Configure AWS Glue triggers to run the ETL jobs every hour.

B.

Use AWS Glue DataBrew to clean and prepare the data for analytics.

C.

Use AWS Lambda functions to schedule and run the ETL jobs every hour.

D.

Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.

E.

Use the Redshift Data API to load transformed data into Amazon Redshift.

Answer

A

Configure AWS Glue triggers to run the ETL jobs every hour.

Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.

Question 52

Q

A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling.

Which solution will meet this requirement?

A.

Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups.

B.

Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.

C.

Turn on concurrency scaling in the settings during the creation of any new Redshift cluster.

D.

Turn on concurrency scaling for the daily usage quota for the Redshift cluster.

Answer

A

Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.

Question 53

Q

A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes.

Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)

A.

Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.

B.

Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running.

C.

Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.

D.

Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running.

E.

Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch.

Answer

A

Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.

Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running.

Question 54

Q

A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company’s data analysts can access data only for customers who are within the same country as the analysts.

Which solution will meet these requirements with the LEAST operational effort?

A.

Create a separate table for each country’s customer data. Provide access to each analyst based on the country that the analyst serves.

B.

Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company’s access policies.

C.

Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves.

D.

Load the data into Amazon Redshift. Create a view for each country. Create separate IAM roles for each country to provide access to data from each country. Assign the appropriate roles to the analysts.

Answer

A

Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company’s access policies.

Question 55

Q

A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options.

The company’s current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.

Which extract, transform, and load (ETL) service will meet these requirements?

A.

AWS Glue

B.

Amazon EMR

C.

AWS Lambda

D.

Amazon Redshift

Answer

A

Amazon EMR

Question 56

Q

A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII.

Which solution will meet this requirement with the LEAST operational effort?

A.

Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream.

B.

Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.

C.

Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.

D.

Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data. Use the same Lambda function to ingest the data into the S3 data lake.

Answer

A

Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.

Question 57

Q

A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company’s operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data.

The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.

Which solution will meet these requirements with the LEAST operational overhead?

A.

AWS Glue workflows

B.

AWS Step Functions tasks

C.

AWS Lambda functions

D.

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows

Answer

A

AWS Step Functions tasks

Question 58

Q

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.

A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.

The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.

Which solution will meet these requirements in the MOST cost-effective way?

A.

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

B.

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

C.

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

D.

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

Answer

A

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

Question 59

Q

A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.

The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team’s BI cluster.

The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.

Which solution will meet these requirements?

A.

Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing.

B.

Create materialized views based on the sales team’s requirements. Grant the sales team direct access to the ETL cluster.

C.

Create database views based on the sales team’s requirements. Grant the sales team direct access to the ETL cluster.

D.

Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.

Answer

A

Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing.

Question 60

Q

A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.

Which solution will meet this requirement MOST cost-effectively?

A.

Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.

B.

Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.

C.

Use Amazon Athena Federated Query to join the data from all data sources.

D.

Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.

Answer

A

Use Amazon Athena Federated Query to join the data from all data sources.

Question 61

Q

A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company’s current level of performance.

Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)

A.

Use Hadoop Distributed File System (HDFS) as a persistent data store.

B.

Use Amazon S3 as a persistent data store.

C.

Use x86-based instances for core nodes and task nodes.

D.

Use Graviton instances for core nodes and task nodes.

E.

Use Spot Instances for all primary nodes.

Answer

A

Use Amazon S3 as a persistent data store.

Use Graviton instances for core nodes and task nodes.

Question 62

Q

A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.

B.

Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data.

C.

Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh.

D.

Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Answer

A

Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Question 63

Q

A company uses an Amazon QuickSight dashboard to monitor usage of one of the company’s applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day.

A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs.

Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)

A.

Partition the data that is in the S3 bucket. Organize the data by year, month, and day.

B.

Increase the AWS Glue instance size by scaling up the worker type.

C.

Convert the AWS Glue schema to the DynamicFrame schema class.

D.

Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.

E.

Modify the IAM role that grants access to AWS glue to grant access to all S3 features.

Answer

A

Partition the data that is in the S3 bucket. Organize the data by year, month, and day.

Increase the AWS Glue instance size by scaling up the worker type.

Question 64

Q

A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.

Which Step Functions state should the data engineer use to meet these requirements?

A.

Parallel state

B.

Choice state

C.

Map state

D.

Wait state

Answer

A

Map state

Answer 65

A

Use API calls to access and integrate third-party datasets from AWS Data Exchange.

Answer 66

A

Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.

Answer 67

A

Use a columnar storage file format.

Partition the data based on the most common query predicates.

Answer 68

A

Configure the Lambda function to run in the same subnet that the DB instance uses.

Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port.

Answer 69

A

Create an AWS Lambda Python function with provisioned concurrency.

Answer 70

A

Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the production AWS account.

Answer 71

A

Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data.

Answer 72

A

Create an AWS Glue partition index. Enable partition filtering.

Use Athena partition projection based on the S3 bucket prefix.

Answer 73

A

Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.

Answer 74

A

Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput.

Answer 75

A

Use a SQL query to create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create and run an AWS Glue crawler to read the view. Create an AWS Glue job that retrieves the data and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.

Answer 76

A

Use Amazon S3 for data storage. Use Amazon Athena for data analysis.

Use AWS Lake Formation for centralized data governance and access control.

Answer 77

A

STL_ALERT_EVENT_LOG

Answer 78

A

Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.

Answer 79

A

Register the S3 path as an AWS Lake Formation location.

Enable fine-grained access control in AWS Lake Formation. Add a data filter for each Region.

Answer 80

A

Verify that the Step Functions state machine code has all IAM permissions that are necessary to create and run the EMR jobs. Verify that the Step Functions state machine code also includes IAM permissions to access the Amazon S3 buckets that the EMR jobs use. Use Access Analyzer for S3 to check the S3 access properties.

Query the flow logs for the VPC. Determine whether the traffic that originates from the EMR cluster can successfully reach the data providers. Determine whether any security group that might be attached to the Amazon EMR cluster allows connections to the data source servers on the informed ports.

Answer 81

A

Launch new EC2 instances by using an AMI that is backed by an EC2 instance store volume. Attach an Amazon Elastic Block Store (Amazon EBS) volume to contain the application data. Apply the default settings to the EC2 instances.

Answer 82

A

Athena workgroup

Answer 83

A

Use code that writes data to Amazon S3 to invoke the Boto3 AWS Glue create_partition API call.

Answer 84

A

Amazon AppFlow

Answer 85

A

Change WHERE year = 2023 to WHERE extract(year FROM sales_data) = 2023.

Answer 86

A

Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.

Answer 87

A

Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.

Answer 88

A

Use the query editor v2 in Amazon Redshift to refresh the materialized views.

Answer 89

A

Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job.

Answer 90

A

Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog.

Answer 91

A

Use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times. Schedule lower capacity during off-peak times.

Answer 92

A

Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company’s data catalog as an external data catalog.

Answer 93

A

Change the distribution key to the table column that has the largest dimension.

Answer 94

A

Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.

Answer 95

A

Create a trail of data events in AWS CloudTraiL. Configure the trail to receive data from the transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logs S3 bucket as the destination bucket.

Answer 96

A

Use the AWS Glue Data Catalog.

Answer 97

A

Use Amazon S3 for data lake storage. Use AWS Lake Formation to restrict data access by rows and columns. Provide data access through AWS Lake Formation.

Answer 98

A

AWS Glue workflows

Answer 99

A

Use an S3 bucket that is in the same AWS Region where the company runs Athena queries.

Preprocess the .csv data to Apache Parquet format by fetching only the data blocks that are needed for predicates.

Answer 100

A

Use the Performance Insights feature of Amazon RDS to identify queries that have high CPU utilization. Optimize the problematic queries.

Upgrade to a larger instance size.

Answer 101

A

VACUUM REINDEX Orders

Answer 102

A

Use Amazon Kinesis Data Streams to capture the sensor data. Store the data in Amazon DynamoDB for querying.

Answer 103

A

Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company’s IAM roles. Assign each user to the IAM role that matches the user’s PII access requirements.

Answer 104

A

Use an Amazon EventBridge rule to invoke an AWS Glue workflow job every 15 minutes. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.

Configure an AWS Lambda function to invoke an AWS Glue workflow when a file is loaded into the S3 bucket. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.

Answer 105

A

Use the query result reuse feature of Amazon Athena for the SQL queries.

Answer 106

A

Use the ALL distribution style for rarely updated small tables. Specify primary and foreign keys for all tables.

Answer 107

A

Use AWS Glue DataBrew to read the files. Use the NEST_TO_MAP transformation to create the new column.

Answer 108

A

Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the KMS keys that encrypt the objects.

Answer 109

A

Use the Amazon Redshift Data API.

Answer 110

A

Use S3 Intelligent-Tiering. Use the default access tier.

Answer 111

A

Store the credentials in AWS Secrets Manager.

Grant the AWS Glue job IAM role access to the stored credentials.

Answer 112

A

Use Amazon Redshift Serverless to automatically process the analytics workload.

Answer 113

A

Use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers.

Answer 114

A

Use the streaming ingestion feature of Amazon Redshift.

Answer 115

A

QuickSight does not have access to the S3 bucket.

QuickSight does not have access to decrypt S3 data.

Answer 116

A

Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Amazon Athena to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format.

Answer 117

A

Add a policy to the data engineer’s IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service principals in the trust policy.

Answer 118

A

Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.

Answer 119

A

Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data.

Answer 120

A

Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions to the workgroup.

Answer 121

A

Write an AWS Glue Python shell job. Use pandas to transform the data.

Answer 122

A

ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/order_date=2023-01-01’; ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/order_date=2023-01-02’;

Answer 123

A

Apache Parquet format compressed with Snappy

Answer 124

A

Migrate the existing Airflow orchestration configuration into Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Create the data quality checks during the ingestion to validate the data quality by using SQL tasks in Airflow.

Answer 125

A

AWS Glue Workflows

Answer 126

A

Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.

Answer 127

A

Set up an Amazon EventBridge event that initiates the AWS Glue workflow after every successful S3 File Gateway file transfer event.

Answer 128

A

Configure the Amazon Redshift Federated Query feature to query live transactional data that is in the PostgreSQL database.

Schedule a monthly job to copy data that is older than 15 months to Amazon S3 by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Amazon Redshift Spectrum to access historical data in Amazon S3.

Answer 129

A

Change the partition key from facility ID to a randomly generated key.

Answer 130

A

EXPLAIN ANALYZE SELECT * FROM sales;

Answer 131

A

Choose the FLEX execution class in the Glue job properties.

Answer 132

A

Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the delivery stream.

Answer 133

A

Set up AWS Lake Formation. Define security policy-based rules for the users and applications by IAM role in Lake Formation.

Answer 134

A

Enable job bookmarks for the ETL jobs to update the state after a run to keep track of previously processed data.

Answer 135

A

Publish flow logs to Amazon S3 in text format. Use Amazon Athena for analytics.

Answer 136

A

Change the distribution style of the store location table from EVEN distribution to ALL distribution.

Answer 137

A

Select * from Sales where city_name ~ ‘^(San|El)*’;

Answer 138

A

Use Amazon CloudWatch to monitor the DMS task. Examine the CDCLatencySource metric to identify delays in the CDC from the source database.

Answer 139

A

Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application.

Answer 140

A

For daily incoming data, use AWS Glue crawlers to scan and identify the schema.

For daily and archived data, use Amazon EMR to perform data transformations.

Answer 141

A

Create custom AWS Glue Data Quality rulesets to define specific data quality checks.

Brainscape's Knowledge GenomeTM

DEA-C01 Flashcards

Brainscape's Knowledge Genome^TM