Repaso Flashcards by Miguel Guzman

A financial services company runs its flagship web application on AWS. The application serves thousands of users during peak hours. The company needs a scalable near-real-time solution to share hundreds of thousands of financial transactions with multiple internal applications. The solution should also remove sensitive details from the transactions before storing the cleansed transactions in a document database for low-latency retrieval.

Which of the following would you recommend?

A) Batch process the raw transactions data into Amazon S3 flat files. Use S3 events to trigger an AWS Lambda function to remove sensitive data from the raw transactions in the flat file and then store the cleansed transactions in Amazon DynamoDB. Leverage DynamoDB Streams to share the transaction data with the internal applications

B) Persist the raw transactions into Amazon DynamoDB. Configure a rule in Amazon DynamoDB to update the transaction by removing sensitive data whenever any new raw transaction is written. Leverage Amazon DynamoDB Streams to share the transaction data with the internal applications

C) Feed the streaming transactions into Amazon Kinesis Data Streams. Leverage AWS Lambda integration to remove sensitive data from every transaction and then store the cleansed transactions in Amazon DynamoDB. The internal applications can consume the raw transactions off the Amazon Kinesis Data Stream

D) Feed the streaming transactions into Amazon Kinesis Data Firehose. Leverage AWS Lambda integration to remove sensitive data from every transaction and then store the cleansed transactions in Amazon DynamoDB. The internal applications can consume the raw transactions off the Amazon Kinesis Data Firehose

How well did you know this?

Not at all

Perfectly

A university has tie-ups with local hospitals to share anonymized health statistics of people. The data is stored in Amazon S3 as .csv files. Amazon Athena is used to run extensive analytics on the data for finding correlations between different parameters in the data. The university is facing high costs and performance-related issues as the volume of data is growing rapidly. The data in the S3 bucket is already partitioned by date and the university does not want to change this partition scheme.

As a data engineer, how can you further improve query performance? (Select two)

A) Transform .csv files to Parquet format by fetching only the data fields required for predicates

B) The S3 bucket should be configured in the same AWS Region where the Athena queries are being run

C) Transform .csv files to JSON format by fetching the required key-value pairs only

D) Remove partitions and perform data bucketing on the S3 bucket

E) The S3 bucket should be configured in the same Availability Zone where the Athena queries are being run

How well did you know this?

Not at all

Perfectly

A CRM company has a software as a service (SaaS) application that feeds updates to other in-house and third-party applications. The SaaS application and the in-house applications are being migrated to use AWS services for this inter-application communication.

Which of the following would you suggest to asynchronously decouple the architecture?

A) Use Elastic Load Balancing (ELB) for effective decoupling of system architecture

B) Use Amazon Simple Notification Service (Amazon SNS) to communicate between systems and decouple the architecture

C) Use Amazon Simple Queue Service (Amazon SQS) to decouple the architecture

D) Use Amazon EventBridge to decouple the system architecture

How well did you know this?

Not at all

Perfectly

The data engineering team at a logistics company leverages AWS Cloud to process Internet of Things (IoT) sensor data from the field devices of the company. The team stores the sensor data in Amazon DynamoDB tables. To detect anomalous behaviors and respond quickly, all changes to the items stored in the DynamoDB tables must be logged in near real-time.

As an AWS Certified Data Engineer Associate, which of the following solutions would you suggest to meet the requirements of the given use case so that it requires minimal custom development and infrastructure maintenance?

A) Set up DynamoDB Streams to capture and send updates to a Lambda function that outputs records directly to Kinesis Data Analytics (KDA). Detect and analyze anomalies in KDA and send notifications via SNS

B) Set up CloudTrail to capture all API calls that update the DynamoDB tables. Leverage CloudTrail event filtering to analyze anomalous behaviors and send SNS notifications in case anomalies are detected

C) Set up DynamoDB Streams to capture and send updates to a Lambda function that outputs records to Kinesis Data Analytics (KDA) via Kinesis Data Streams (KDS). Detect and analyze anomalies in KDA and send notifications via SNS

D) Configure event patterns in CloudWatch Events to capture DynamoDB API call events and set up Lambda function as a target to analyze anomalous behavior. Send SNS notifications when anomalous behaviors are detected

How well did you know this?

Not at all

Perfectly

A healthcare company has recently migrated to Amazon Redshift. The technology team at the company is now working on the Disaster Recovery (DR) plans for the Redshift cluster deployed in the eu-west-1 Region. The existing cluster is encrypted via AWS KMS and the team wants to copy the Redshift snapshots to another Region to meet the DR requirements.

Which of the following solutions would you recommend to meet the given requirements?

A) Create a snapshot copy grant in the destination Region for a KMS key in the destination Region. Configure Redshift cross-Region snapshots in the source Region

B) Create an IAM role in the destination Region with access to the KMS key in the source Region. Create a snapshot copy grant in the destination Region for this KMS key in the source Region. Configure Redshift cross-Region snapshots in the source Region

C) Create a snapshot copy grant in the source Region for a KMS key in the source Region. Configure Redshift cross-Region snapshots in the destination Region

D) Create a snapshot copy grant in the destination Region for a KMS key in the destination Region. Configure Redshift cross-Region replication in the source Region

How well did you know this?

Not at all

Perfectly

A health care application processes the real-time health data of the patients into an analytics workflow. With a sharp increase in the number of users, the system has become slow and sometimes even unresponsive as it does not have a retry mechanism. The startup is looking at a scalable solution that has minimal implementation overhead.

Which of the following would you recommend as a scalable alternative to the current solution?

A) Use Amazon Simple Notification Service (Amazon SNS) for data ingestion and configure AWS Lambda to trigger logic for downstream processing

B) Use Amazon API Gateway with the existing REST-based interface to create a high-performing architecture

C) Use Amazon Kinesis Data Streams to ingest the data, process it using AWS Lambda, or run analytics using Amazon Kinesis Data Analytics

D) Use Amazon Simple Queue Service (Amazon SQS) for data ingestion and configure AWS Lambda to trigger logic for downstream processing

How well did you know this?

Not at all

Perfectly

A Silicon Valley based startup helps its users legally sign highly confidential contracts. To meet the compliance guidelines, the startup must ensure that the signed contracts are encrypted using the AES-256 algorithm via an encryption key that is generated as well as managed internally. The startup is now migrating to AWS Cloud and would like the data to be encrypted on AWS. The startup wants to continue using its existing encryption key generation as well as key management mechanism.

What do you recommend?

A) SSE-S3

B) SSE-C

C) SSE-KMS

D) Client-Side Encryption

How well did you know this?

Not at all

Perfectly

A data engineering team wants to orchestrate multiple Amazon ECS task types running on Amazon EC2 instances that are part of the Amazon ECS cluster. The output and state data for all tasks need to be stored. The amount of data output by each task is approximately 20 megabytes and there could be hundreds of tasks running at a time. As old outputs are archived, the storage size is not expected to exceed 1 terabyte.

Which of the following would you recommend as an optimized solution for high-frequency reading and writing?

A) Use Amazon DynamoDB table that is accessible by all ECS cluster instances

B) Use Amazon EFS with Bursting Throughput mode

C) Use an Amazon EBS volume mounted to the Amazon ECS cluster instances

D) Use Amazon EFS with Provisioned Throughput mode

How well did you know this?

Not at all

Perfectly

A company has created a data warehouse using Redshift that is used to analyze data from Amazon S3. From the usage patterns, the data engineering team has noticed that after 30 days, the data is rarely queried in Redshift and it’s not “hot data” anymore. The team would like to preserve the SQL querying capability on the data and have the query execution start immediately. Also, the team wants to adopt a pricing model that allows the company to save the maximum amount of cost on Redshift.

As an AWS Certified Data Engineer Associate, which of the following options would you recommend? (Select two)

A) Migrate the Redshift cluster’s underlying storage class to Standard-IA

B) Move the data to S3 Standard IA after 30 days

C) Move the data to S3 Glacier Deep Archive after 30 days

D) Create a smaller Redshift Cluster with the cold data

E) Analyze the cold data with Athena

How well did you know this?

Not at all

Perfectly

A company wants to use AWS for its connected cab application that would collect sensor data from its electric cab fleet to give drivers dynamically updated map information. The company would like to build its new sensor service by leveraging fully serverless components that are provisioned and managed automatically by AWS. The development team at the company does not want an option that requires the capacity to be manually provisioned, as it does not want to respond manually to changing volumes of sensor data. The company has hired you to provide consultancy for this strategic initiative.

Given these constraints, which of the following solutions would you suggest as the BEST fit to develop this service?

A) Ingest the sensor data in Kinesis Data Firehose, which directly writes the data into an auto-scaled DynamoDB table for downstream processing

B) Ingest the sensor data in an Amazon SQS standard queue, which is polled by an application running on an EC2 instance, and the data is written into an auto-scaled DynamoDB table for downstream processing

C) Ingest the sensor data in a Kinesis Data Stream, which is polled by an application running on an EC2 instance, and the data is written into an auto-scaled DynamoDB table for downstream processing

D) Ingest the sensor data in an Amazon SQS standard queue, which is polled by a Lambda function in batches, and the data is written into an auto-scaled DynamoDB table for downstream processing

How well did you know this?

Not at all

Perfectly

A company has noticed several provisioned throughput exceptions on its Amazon DynamoDB database due to major spikes in the writes to the database. The development team wants to decouple the application layer from the database layer and dedicate a worker process to writing the data to Amazon DynamoDB.

Which of the following options can scale infinitely and meet these requirements in the most cost-effective way?

A) Amazon DynamoDB DAX

B) Amazon Simple Notification Service (Amazon SNS)

C) Amazon Kinesis Data Streams

D) Amazon Simple Queue Service (Amazon SQS)

How well did you know this?

Not at all

Perfectly

A company utilizes AWS Step Functions to manage a data pipeline that includes Amazon EMR jobs for data ingestion from various sources and subsequent storage in an Amazon S3 bucket. This pipeline also incorporates EMR jobs that transfer the data to Amazon Redshift. The cloud infrastructure team has manually configured a Step Functions state machine and initiated an EMR cluster within a VPC to facilitate the EMR jobs. However, the Step Functions state machine is currently unable to execute the EMR jobs.

What are the two steps that the company should take to determine the root cause behind the AWS Step Functions state machine’s failure to run the EMR jobs? (Select two)

A) Ensure that the AWS Step Functions state machine has the necessary IAM permissions to both create and execute the EMR jobs. Additionally, confirm that it has the required IAM permissions to interact with the Amazon S3 buckets utilized by the EMR jobs. To verify the access settings of the S3 buckets, utilize Access Analyzer for Amazon S3

B) Add a Fail state in the AWS Step Functions state machine to handle the failure of the EMR jobs. Address the failure in a Catch block to send an SNS notification to a human user for further action

C) Examine the VPC flow logs to assess whether traffic from the EMR cluster can effectively reach the data providers. Also, check if the security groups associated with the Amazon EMR cluster permit connections to the data source servers through the specified ports

D) Ensure that the AWS Step Functions state machine has the necessary IAM permissions to both create and execute the EMR jobs. Additionally, confirm that it has the required IAM permissions to interact with the Amazon S3 buckets utilized by the EMR jobs. To verify the access settings of the S3 buckets, utilize S3 Analytics storage class analysis for Amazon S3

E) Add a Fail state in the AWS Step Functions state machine to handle the failure of the EMR jobs. Address the failure in a Retry block by increasing the number of seconds in the interval between each EMR task

How well did you know this?

Not at all

Perfectly

A company is looking at transferring its archived digital media assets of around 20 petabytes to AWS Cloud in the shortest possible time.

Which of the following is an optimal solution for this requirement, given that the company’s archives are located at a remote location?

A) AWS DataSync

B) AWS Storage Gateway

C) AWS Snowmobile

D) AWS Snowball

How well did you know this?

Not at all

Perfectly

A data engineer is working on the throughput capacity of a newly provisioned table in Amazon DynamoDB. The data engineer has provisioned 20 Read Capacity Units for the table.

Which of the following options represents the correct throughput that the table will support for the various read modes?

A) Read throughput of 80KB/sec with strong consistency, Read throughput of 160KB/sec with eventual consistency, Transactional read throughput of 40KB/sec

B) Read throughput of 80KB/sec with strong consistency, Read throughput of 160KB/sec with eventual consistency, Transactional read throughput of 320KB/sec

C) Read throughput of 40KB/sec with strong consistency, Read throughput of 80KB/sec with eventual consistency, Transactional read throughput of 60KB/sec

D) Read throughput of 40KB/sec with strong consistency, Read throughput of 80KB/sec with eventual consistency, Transactional read throughput of 120KB/sec

How well did you know this?

Not at all

Perfectly

A logistics company is building a multi-tier application to track the location of its trucks during peak operating hours. The company wants these data points to be accessible in real-time in its analytics platform via a REST API. The company has hired you as an AWS Certified Data Engineer Associate to build a multi-tier solution to store and retrieve this location data for analysis.

Which of the following options addresses the given use case?

A) Leverage Amazon API Gateway with AWS Lambda

B) Leverage Amazon Athena with Amazon S3

C) Leverage Amazon QuickSight with Amazon Redshift

D) Leverage Amazon API Gateway with Amazon Kinesis Data Analytics

How well did you know this?

Not at all

Perfectly

A digital media company has hired you to improve the data backup solution for applications running on the AWS Cloud. Currently, all of the applications running on AWS use at least two Availability Zones (AZs). The updated backup policy at the company mandates that all nightly backups for its data are durably stored in at least two geographically distinct Regions for Production and Disaster Recovery (DR) and the backup processes for both Regions must be fully automated. The new backup solution must ensure that the backup is available to be restored immediately for the Production Region and should be restored within 24 hours in the DR Region.

Which of the following represents the MOST cost-effective solution that will address the given use-case?

A) Create a backup process to persist all the data to an S3 bucket A using the S3 standard storage class in the Production Region. Set up cross-Region replication of this S3 bucket A to an S3 bucket B using S3 standard-IA storage class in the DR Region and set up a lifecycle policy in the DR Region to immediately move this data to Amazon Glacier Deep Archive

B) Create a backup process to persist all the data to Amazon Glacier Deep Archive in the Production Region. Set up cross-Region replication of this data to Amazon Glacier Deep Archive in the DR Region to ensure minimum possible costs in both Regions

C) Create a backup process to persist all the data to a large Amazon EBS volume attached to the backup server in the Production Region. Run nightly cron jobs to snapshot these volumes and then copy these snapshots to the DR Region

D) Create a backup process to persist all the data to an S3 bucket A using the S3 standard storage class in the Production Region. Set up cross-Region replication of this S3 bucket A to an S3 bucket B using S3 standard storage class in the DR Region and set up a lifecycle policy in the DR Region to immediately move this data to Amazon Glacier Deep Archive

An e-commerce company performs analytics on the company’s data using the Amazon Redshift cluster. The Redshift cluster has two important tables: the orders table and the product table which have millions of rows each. A few small tables with supporting data are also present. The team is looking for the right distribution patterns for the tables, to optimize query speed.

Which of the following are the key points to consider while planning for the best distribution style for your data? (Select two)

A) Data should be distributed in such a way that the rows that participate in joins are already collocated on the nodes with their joining rows in other tables

B) Choose a column with low cardinality in the filtered result set

C) A fact table with multiple distribution keys is useful when multiple dimension tables have to be joined to it

D) If a dimension table cannot be collocated with the fact table or other important joining tables, use ALL distribution style for such tables

E) Small Dimension tables should be marked to use KEY distribution style, which will cause them to be replicated to each physical node in the cluster

A company wants to store all of its consumer data on Amazon S3. Before storing the data, the company must clean it by standardizing the formats of a few of the data columns. A single data record might range in size from 500 KB to 10 MB.

Which of these options represents the right solution?

A) Use Amazon Simple Queue Service (Amazon SQS) to ingest incoming data. Configure an AWS Lambda function to read events from the SQS queue and upload the events to Amazon S3

B) Use Amazon Kinesis Data Firehose to ingest data. Configure an AWS Lambda function to cleanse/transform the data written into the Firehose delivery stream which is then delivered to Amazon S3

C) Use Amazon Managed Streaming for Apache Kafka. Create a topic for the initial raw data. Use a Kafka producer to write data on this topic. Use the Apache Kafka consumer API to create a consumer application (that can be hosted on Amazon EC2 instance) that reads data from this topic, transforms the data as needed, and writes it to Amazon S3 for final storage

D) Use Amazon Kinesis Data Streams. Configure a stream for incoming raw data. Kinesis Agent can be used to write data to the stream. Configure an Amazon Kinesis Data Analytics application to read the raw data and transform it to the necessary format before writing it to Amazon S3

A retail company uses Amazon RDS to store sales data. For the analytics workloads that require high performance, only the last six months of data (approximately 50 TB) will be frequently queried. At the end of each month, the monthly sales data will be merged with the historical sales data for the last 5 years, which should also be available for analysis. The CTO at the company is looking at a cost-optimal solution that offers the best performance for this use case.

Which of the following would you select for the given requirement?

A) Configure a read replica of the RDS database to store the last six months of data and execute more frequent queries on the read replica. Export RDS data to S3 and schedule an AWS data pipeline for an incremental copy of RDS data to S3. Configure an AWS Glue Data Catalog of the data in S3 and use Amazon Athena to query the historical data in S3

B) Use AWS data pipeline to incrementally load the last six months of data into Amazon Redshift and execute more frequent queries on Redshift. Set up a read replica of the RDS database to run queries on the historical data

C) Export RDS data to S3 and schedule an AWS data pipeline for an incremental copy of RDS data to S3. Load and store the last six months of data from S3 in Amazon Redshift. Configure an Amazon Redshift Spectrum table to connect to all the historical data in S3

D) Export RDS data to S3 and schedule an AWS data pipeline for an incremental copy of RDS data to S3. Configure an AWS Glue Data Catalog of the data in S3 and use Amazon Athena to query the entire data in S3

A company has hired you to help with redesigning a real-time data processor. The company wants to build custom applications that process and analyze the streaming data for its specialized needs.

Which solution will you recommend to address this use-case?

A) Use Amazon Kinesis Data Streams to process the data streams as well as decouple the producers and consumers for the real-time data processor

B) Use Amazon Kinesis Data Firehose to process the data streams as well as decouple the producers and consumers for the real-time data processor

C) Use Amazon Simple Queue Service (Amazon SQS) to process the data streams as well as decouple the producers and consumers for the real-time data processor

D) Use Amazon Simple Notification Service (Amazon SNS) to process the data streams as well as decouple the producers and consumers for the real-time data processor

A legacy application is built using a tightly-coupled monolithic architecture. Due to a sharp increase in the number of users, the application performance has degraded. The company now wants to decouple the architecture and adopt AWS microservices architecture. Some of these microservices need to handle fast-running processes whereas other microservices need to handle slower processes.

Which of these options would you identify as the right way of connecting these microservices?

A) Configure Amazon Simple Queue Service (Amazon SQS) queue to decouple microservices running faster processes from the microservices running slower ones

B) Use Amazon Simple Notification Service (Amazon SNS) to decouple microservices running faster processes from the microservices running slower ones

C) Configure Amazon Kinesis Data Streams to decouple microservices running faster processes from the microservices running slower ones

D) Add Amazon EventBridge to decouple the complex architecture

An Internet-of-Things (IoT) company needs a solution that can collect near real-time data from all its devices/sensors and store them in nested JSON format. The solution must offer data persistence and support the capability to query the data with a maximum latency of 10 milliseconds.

As a data engineer, how will you implement an optimal solution such that it has the LEAST operational overhead?

A) Configure Amazon Simple Queue Service (Amazon SQS) to capture the real-time sensor data. Define an AWS Lambda function to poll the SQS queues and process the data. Store the data in Amazon DynamoDB for querying

B) Use Amazon Kinesis Data Streams to capture the sensor data. Define an AWS Lambda function to process the data and write to the DynamoDB table. Store the data in Amazon DynamoDB for querying

C) Use Amazon Data Firehose to capture the sensor data. Directly store the data in Amazon DynamoDB for querying

D) Use the fully managed Apache Kafka cluster to capture the sensor data in near real-time. Store the data in Amazon S3 for querying

A digital media company needs to manage uploads of around 1 terabyte each from an application being used by a partner company.

How will you handle the upload of these files to Amazon S3?

A) Use AWS Direct Connect to provide extra bandwidth

B) Use AWS Snowball Edge Storage Optimized device

C) Use Amazon S3 Versioning

D) Use multi-part upload feature of Amazon S3

The data engineering team at a company is working on the Disaster Recovery (DR) plans for a Redshift cluster deployed in the us-east-1 Region. The existing cluster is encrypted via AWS KMS and the team wants to copy the Redshift snapshots to another Region to meet the DR requirements.

Which of the following solutions would you suggest to address the given use-case?

A) Create an IAM role in the destination Region with access to the KMS key in the source Region. Create a snapshot copy grant in the destination Region for this KMS key in the source Region. Configure Redshift cross-Region snapshots in the source Region

B) Create a snapshot copy grant in the destination Region for a KMS key in the destination Region. Configure Redshift cross-Region snapshots in the source Region

C) Create a snapshot copy grant in the source Region for a KMS key in the source Region. Configure Redshift cross-Region snapshots in the destination Region

D) Create a snapshot copy grant in the destination Region for a KMS key in the destination Region. Configure Redshift cross-Region replication in the source Region

A US-based healthcare startup manages an interactive diagnostic tool for COVID-19 related assessments. The users are required to capture their personal health records via this tool. As this is sensitive health information, the backup of the user data must be kept encrypted in Amazon Simple Storage Service (Amazon S3). The startup does not want to provide its own encryption keys but still wants to maintain an audit trail on the usage of the encryption key. What do you recommend? A) Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt the user data on Amazon S3 B) Use client-side encryption with client-provided keys and then upload the encrypted user data to Amazon S3 C) Use server-side encryption with AWS Key Management Service keys (SSE-KMS) to encrypt the user data on Amazon S3 D) Use server-side encryption with customer-provided keys (SSE-C) to encrypt the user data on Amazon S3

A company is using a fleet of Amazon EC2 instances to ingest Internet-of-Things (IoT) data from various data sources. The data is in JSON format and ingestion rates can be as high as 1 MB/s. When an EC2 instance is restarted, the in-flight data is lost. The data engineering team at the company wants to store as well as query the ingested data in near-real-time. Which of the following solutions provides near-real-time data querying that is scalable with minimal data loss? A) Capture data in Amazon Kinesis Data Streams. Use Amazon Kinesis Data Analytics to query and analyze this streaming data in real-time B) Capture data in Amazon Kinesis Data Firehose with Amazon Redshift as the destination. Use Amazon Redshift to query the data C) Capture data in an Amazon EC2 instance store and then publish this data to Amazon Kinesis Data Firehose with Amazon S3 as the destination. Use Amazon Athena to query the data D) Capture data in an Amazon EBS volume and then publish this data to Amazon ElastiCache for Redis. Subscribe to the Redis channel to query the data