AWS Certified DE 150-200 Flashcards
A data engineer needs to use Amazon Neptune to develop graph applications.
Which programming languages should the engineer use to develop the graph applications? (Choose two.)
A. Gremlin
B. SQL
C. ANSI SQL
D. SPARQL
E. Spark SQL
AD
A mobile gaming company wants to capture data from its gaming app.
The company wants to make the data available to three internal consumers of the data. The data records are approximately 20 KB in size.
The company wants to achieve optimal throughput from each device that runs the gaming app. Additionally, the company wants to develop an application to process data streams. The stream-processing application must have dedicated throughput for each internal consumer.
Which solution will meet these requirements?
A. Configure the mobile app to call the PutRecords API operation to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature with a stream for each internal consumer.
B. Configure the mobile app to call the PutRecordBatch API operation to send data to Amazon Kinesis Data Firehose. Submit an AWS Support case to turn on dedicated throughput for the company’s AWS account. Allow each internal consumer to access the stream.
C. Configure the mobile app to use the Amazon Kinesis Producer Library (KPL) to send data to Amazon Kinesis Data Firehose. Use the enhanced fan-out feature with a stream for each internal consumer.
D. Configure the mobile app to call the PutRecords API operation to send data to Amazon Kinesis Data Streams. Host the stream-processing application for each internal consumer on Amazon EC2 instances. Configure auto scaling for the EC2 instances.
A
A retail company uses an Amazon Redshift data warehouse and an Amazon S3 bucket.
The company ingests retail order data into the S3 bucket every day.
The company stores all order data at a single path within the S3 bucket. The data has more than 100 columns. The company ingests the order data from a third-party application that generates more than 30 files in CSV format every day. Each CSV file is between 50 and 70 MB in size.
The company uses Amazon Redshift Spectrum to run queries that select sets of columns. Users aggregate metrics based on daily orders. Recently, users have reported that the performance of the queries has degraded. A data engineer must resolve the performance issues for the queries.
Which combination of steps will meet this requirement with LEAST developmental effort? (Choose two.)
A. Configure the third-party application to create the files in a columnar format.
B. Develop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.
C. Partition the order data in the S3 bucket based on order date.
D. Configure the third-party application to create the files in JSON format.
E. Load the JSON data into the Amazon Redshift table in a SUPER type column.
AC
A company stores customer records in Amazon S3.
The company must not delete or modify the customer record data for 7 years after each record is created. The root user also must not have the ability to delete or modify the data.
A data engineer wants to use S3 Object Lock to secure the data.
Which solution will meet these requirements?
A. Enable governance mode on the S3 bucket. Use a default retention period of 7 years.
B. Enable compliance mode on the S3 bucket. Use a default retention period of 7 years.
C. Place a legal hold on individual objects in the S3 bucket. Set the retention period to 7 years.
D. Set the retention period for individual objects in the S3 bucket to 7 years.
B
A data engineer needs to create a new empty table in Amazon Athena that has the same schema as an existing table named old_table.
Which SQL statement should the data engineer use to meet this requirement?
A. CREATE TABLE new_table AS SELECT * FROM old_table;
B. INSERT INTO new_table SELECT * FROM old_table;
C. CREATE TABLE new_table (LIKE old_table);
D. CREATE TABLE new_table AS (SELECT * FROM old_table) WITH NO DATA;
D
A data engineer needs to create an Amazon Athena table based on a subset of data from an existing Athena table named cities_world.
The cities_world table contains cities that are located around the world. The data engineer must create a new table named cities_us to contain only the cities from cities_world that are located in the US.
Which SQL statement should the data engineer use to meet this requirement?
A. INSERT INTO cities_usa (city,state) SELECT city, state FROM cities_world WHERE country=‘usa’;
B. MOVE city, state FROM cities_world TO cities_usa WHERE country=‘usa’;
C. INSERT INTO cities_usa SELECT city, state FROM cities_world WHERE country=‘usa’;
D. UPDATE cities_usa SET (city, state) = (SELECT city, state FROM cities_world WHERE country=‘usa’);
A
A company implements a data mesh that has a central governance account.
The company needs to catalog all data in the governance account. The governance account uses AWS Lake Formation to centrally share data and grant access permissions.
The company has created a new data product that includes a group of Amazon Redshift Serverless tables. A data engineer needs to share the data product with a marketing team. The marketing team must have access to only a subset of columns. The data engineer needs to share the same data product with a compliance team. The compliance team must have access to a different subset of columns than the marketing team needs access to.
Which combination of steps should the data engineer take to meet these requirements? (Choose two.)
A. Create views of the tables that need to be shared. Include only the required columns.
B. Create an Amazon Redshift data share that includes the tables that need to be shared.
C. Create an Amazon Redshift managed VPC endpoint in the marketing team’s account. Grant the marketing team access to the views.
D. Share the Amazon Redshift data share to the Lake Formation catalog in the governance account.
E. Share the Amazon Redshift data share to the Amazon Redshift Serverless workgroup in the marketing team’s account.
BD
A company has a data lake in Amazon S3.
The company uses AWS Glue to catalog data and AWS Glue Studio to implement data extract, transform, and load (ETL) pipelines.
The company needs to ensure that data quality issues are checked every time the pipelines run. A data engineer must enhance the existing pipelines to evaluate data quality rules based on predefined thresholds.
Which solution will meet these requirements with the LEAST implementation effort?
A. Add a new transform that is defined by a SQL query to each Glue ETL job. Use the SQL query to implement a ruleset that includes the data quality rules that need to be evaluated.
B. Add a new Evaluate Data Quality transform to each Glue ETL job. Use Data Quality Definition Language (DQDL) to implement a ruleset that includes the data quality rules that need to be evaluated.
C. Add a new custom transform to each Glue ETL job. Use the PyDeequ library to implement a ruleset that includes the data quality rules that need to be evaluated.
D. Add a new custom transform to each Glue ETL job. Use the Great Expectations library to implement a ruleset that includes the data quality rules that need to be evaluated.
B
A company has an application that uses a microservice architecture.
The company hosts the application on an Amazon Elastic Kubernetes Services (Amazon EKS) cluster.
The company wants to set up a robust monitoring system for the application. The company needs to analyze the logs from the EKS cluster and the application. The company needs to correlate the cluster’s logs with the application’s traces to identify points of failure in the whole application request flow.
Which combination of steps will meet these requirements with the LEAST development effort? (Choose two.)
A. Use FluentBit to collect logs. Use OpenTelemetry to collect traces.
B. Use Amazon CloudWatch to collect logs. Use Amazon Kinesis to collect traces.
C. Use Amazon CloudWatch to collect logs. Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to collect traces.
D. Use Amazon OpenSearch to correlate the logs and traces.
E. Use AWS Glue to correlate the logs and traces.
AD
A company has a gaming application that stores data in Amazon DynamoDB tables.
A data engineer needs to ingest the game data into an Amazon OpenSearch Service cluster. Data updates must occur in near real time.
Which solution will meet these requirements?
A. Use AWS Step Functions to periodically export data from the Amazon DynamoDB tables to an Amazon S3 bucket. Use an AWS Lambda function to load the data into Amazon OpenSearch Service.
B. Configure an AWS Glue job to have a source of Amazon DynamoDB and a destination of Amazon OpenSearch Service to transfer data in near real time.
C. Use Amazon DynamoDB Streams to capture table changes. Use an AWS Lambda function to process and update the data in Amazon OpenSearch Service.
D. Use a custom OpenSearch plugin to sync data from the Amazon DynamoDB tables.
C
A company uses Amazon Redshift as its data warehouse service. A data engineer needs to design a physical data model.
The data engineer encounters a de-normalized table that is growing in size. The table does not have a suitable column to use as the distribution key.
Which distribution style should the data engineer use to meet these requirements with the LEAST maintenance overhead?
A. ALL distribution
B. EVEN distribution
C. AUTO distribution
D. KEY distribution
C
A retail company is expanding its operations globally.
The company needs to use Amazon QuickSight to accurately calculate currency exchange rates for financial reports. The company has an existing dashboard that includes a visual that is based on an analysis of a dataset that contains global currency values and exchange rates.
A data engineer needs to ensure that exchange rates are calculated with a precision of four decimal places. The calculations must be precomputed. The data engineer must materialize results in QuickSight’s super-fast, parallel, in-memory calculation engine (SPICE).
Which solution will meet these requirements?
A. Define and create the calculated field in the dataset.
B. Define and create the calculated field in the analysis.
C. Define and create the calculated field in the visual.
D. Define and create the calculated field in the dashboard.
A
A company has three subsidiaries. Each subsidiary uses a different data warehousing solution.
The first subsidiary hosts its data warehouse in Amazon Redshift. The second subsidiary uses Teradata Vantage on AWS. The third subsidiary uses Google BigQuery.
The company wants to aggregate all the data into a central Amazon S3 data lake. The company wants to use Apache Iceberg as the table format.
A data engineer needs to build a new pipeline to connect to all the data sources, run transformations by using each source engine, join the data, and write the data to Iceberg.
Which solution will meet these requirements with the LEAST operational effort?
A. Use native Amazon Redshift, Teradata, and BigQuery connectors to build the pipeline in AWS Glue. Use native AWS Glue transforms to join the data. Run a Merge operation on the data lake Iceberg table.
B. Use the Amazon Athena federated query connectors for Amazon Redshift, Teradata, and BigQuery to build the pipeline in Athena. Write a SQL query to read from all the data sources, join the data, and run a Merge operation on the data lake Iceberg table.
C. Use the native Amazon Redshift connector, the Java Database Connectivity (JDBC) connector for Teradata, and the open source Apache Spark BigQuery connector to build the pipeline in Amazon EMR. Write code in PySpark to join the data and run a Merge operation on the data lake Iceberg table.
D. Use the native Amazon Redshift, Teradata, and BigQuery connectors in Amazon AppFlow to write data to Amazon S3 and AWS Glue Data Catalog. Use Amazon Athena to join the data. Run a Merge operation on the data lake Iceberg table.
B
A company is building a data stream processing application.
The application runs in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The application stores processed data in an Amazon DynamoDB table.
The company needs the application containers in the EKS cluster to have secure access to the DynamoDB table. The company does not want to embed AWS credentials in the containers.
Which solution will meet these requirements?
A. Store the AWS credentials in an Amazon S3 bucket. Grant the EKS containers access to the S3 bucket to retrieve the credentials.
B. Attach an IAM role to the EKS worker nodes. Grant the IAM role access to DynamoDB. Use the IAM role to set up IAM roles service accounts (IRSA) functionality.
C. Create an IAM user that has an access key to access the DynamoDB table. Use environment variables in the EKS containers to store the IAM user access key data.
D. Create an IAM user that has an access key to access the DynamoDB table. Use Kubernetes secrets that are mounted in a volume of the EKS cluster nodes to store the user access key data.
B
A data engineer needs to onboard a new data producer into AWS.
The data producer needs to migrate data products to AWS.
The data producer maintains many data pipelines that support a business application. Each pipeline must have service accounts and their corresponding credentials. The data engineer must establish a secure connection from the data producer’s on-premises data center to AWS. The data engineer must not use the public internet to transfer data from an on-premises data center to AWS.
Which solution will meet these requirements?
A. Instruct the new data producer to create Amazon Machine Images (AMIs) on Amazon Elastic Container Service (Amazon ECS) to store the code base of the application. Create security groups in a public subnet that allow connections only to the on-premises data center.
B. Create an AWS Direct Connect connection to the on-premises data center. Store the service account credentials in AWS Secrets Manager.
C. Create a security group in a public subnet. Configure the security group to allow only connections from the CIDR blocks that correspond to the data producer. Create Amazon S3 buckets that contain presigned URLs that have one-day expiration dates.
D. Create an AWS Direct Connect connection to the on-premises data center. Store the application keys in AWS Secrets Manager. Create Amazon S3 buckets that contain presigned URLs that have one-day expiration dates.
B
A data engineer configured an AWS Glue Data Catalog for data that is stored in Amazon S3 buckets.
The data engineer needs to configure the Data Catalog to receive incremental updates.
The data engineer sets up event notifications for the S3 bucket and creates an Amazon Simple Queue Service (Amazon SQS) queue to receive the S3 events.
Which combination of steps should the data engineer take to meet these requirements with LEAST operational overhead? (Choose two.)
A. Create an S3 event-based AWS Glue crawler to consume events from the SQS queue.
B. Define a time-based schedule to run the AWS Glue crawler, and perform incremental updates to the Data Catalog.
C. Use an AWS Lambda function to directly update the Data Catalog based on S3 events that the SQS queue receives.
D. Manually initiate the AWS Glue crawler to perform updates to the Data Catalog when there is a change in the S3 bucket.
E. Use AWS Step Functions to orchestrate the process of updating the Data Catalog based on S3 events that the SQS queue receives.
AC
A company uses AWS Glue Data Catalog to index data that is uploaded to an Amazon S3 bucket every day.
The company uses a daily batch process in an extract, transform, and load (ETL) pipeline to upload data from external sources into the S3 bucket.
The company runs a daily report on the S3 data. Some days, the company runs the report before all the daily data has been uploaded to the S3 bucket. A data engineer must be able to send a message that identifies any incomplete data to an existing Amazon Simple Notification Service (Amazon SNS) topic.
Which solution will meet this requirement with the LEAST operational overhead?
A. Create data quality checks for the source datasets that the daily reports use. Create a new AWS managed Apache Airflow cluster. Run the data quality checks by using Airflow tasks that run data quality queries on the column data type and the presence of null values. Configure Airflow Directed Acyclic Graphs (DAGs) to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
B. Create data quality checks on the source datasets that the daily reports use. Create a new Amazon EMR cluster. Use Apache Spark SQL to create Apache Spark jobs in the EMR cluster that run data quality queries on the columns data type and the presence of null values. Orchestrate the ETL pipeline by using an AWS Step Functions workflow. Configure the workflow to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
C. Create data quality checks on the source datasets that the daily reports use. Create data quality actions by using AWS Glue workflows to confirm the completeness and consistency of the datasets. Configure the data quality actions to create an event in Amazon EventBridge if a dataset is incomplete. Configure EventBridge to send the event that informs the data engineer about the incomplete datasets to the Amazon SNS topic.
D. Create AWS Lambda functions that run data quality queries on the column data type and the presence of null values. Orchestrate the ETL pipeline by using an AWS Step Functions workflow that runs the Lambda functions. Configure the Step Functions workflow to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
C