Data Analytics Flashcards

1
Q

Abbr for ETL

A

Extract Transform Load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is AWS alternative to Apache Kafka?

A

AWS Kinesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is Kinesis Streams divided

A

Shards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Kinesis Streams retention period

A
  • default 24H

- up to 365 Days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can multiple applications consume the same stream in Kinesis?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How the billing looks like in Kinesis Data Streams?

A

per shard provisioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Size of Data Blob in Kinesis Streams

A

up to 1MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kinesis Producer max write

A

1MB/s or 1000 messages/s PER SHARD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Message received if producer go above provisioned throughput?

A

ProvisionedThroughputException

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Two types of consumers in Kinesis Streams

A
  • Consumer Classic

- Consumer Enhanced Fan-Out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Kinesis Agent?

A

Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is hot shard in Kinesis Streams?

A

Some shards in your Kinesis data stream might receive more records than others. This can lead to throttling errors in the stream, resulting in overworked shards, also known as hot shards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Potential solutions to ProvisionedThroughputExceeded

A
  • retries with backoff
  • increase shards (scaling)
  • ensure the partition key is optimal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Kinesis Producer Library?

A

Easy to use and highly configurable C++Java library that helps you write to a Kinesis data stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Two types of API in KPL?

A
  • Synchronous

- Asynchronous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of Batching in Kinesis Producer Library?

A

decrease throughput and decrease cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Kinesis Producer Library two types of batching

A
  • Aggregation

- Collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What might be the effect of increasing RecordMaxBufferTime in KPL?

A
  • additional processing delay

- higher packing efficiencies and better performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can KPL be used if the application cannot tolerate additional delay?

A

NO. SDK should be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Shard Kinesis Consumer max throughput?

A

2MB/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Shard Kinesis Producer max throughput?

A

1MB/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When to use Enhanced Kinesis Fan Out Consumers?

A
  • Multiple Consumer applications for the same stream

- Low latency requirement (70ms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When to use Standard Kinesis Consumers?

A
  • low number of consuming applications (1,2,3) for the same stream
  • Can tolerate 200ms latency
  • minimize cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Default limit of consumers when using Enhanced Fan Out Kinesis Consumer

A

5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Can you perform many resharding operations at the same time

A

No, only one operations is allowed at a time and it takes a few seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

AWS Kinesis Firehose destinations

A
  • S3
  • Redshift
  • Opensearch
  • HTTP Endpoint
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What’s the minimum latency for non-full batch in Kinesis Firehose?

A

60s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Is Kinesis Firehose auto-scaled?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Embedded data transformation format in Kinesis Firehose

A

JSON -> Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Is compression supported by Kinesis firehose

A

Yes, when the target is S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Kinesis Firehose payment schema

A

Pay only for the amount of data going through Firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Buffer flushing logic for the Kinesis Firehose

A

based on time and size rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Which place can CloudWatch Logs can be streamed to?

A
  • Kinesis Data Streams
  • Kinesis Data Firehose
  • AWS Lambda
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Kinesis Firehose minimum buffer interval

A

60 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Kinesis Firehose maximum buffer interval

A

900 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Maximum write capacity on On-demand Kinesis Data Stream

A

200MB/s and 200.000 record/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Maximum read capacity on On-demand Kinesis Data Stream

A

400MB/s per consumer (extra capacity in Enhanced fan out)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Command to restart kinesis agent on linux

A

sudo service aws-kinesis-agent restart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Processing capacity of SQS

A

1 message/s to 10.000 messages/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How many messages can be in SQS queue?

A

No limit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What’s the latency of SQS

A

<10ms on publish and receive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

max message size in SQS

A

256KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How to send messages over 256KB in SQS

A

use SQS Extended Client (Java Library)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What can be a content of SQS message

A

XML, JSON, Unformatted text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Max size of Batch request in SQS

A

10 messages - max 256KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Max transactions per second in Standard SQS queue

A

unlimited

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Max transactions per second in SQS FIF queue

A

3000 messages/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Data retention period in SQS

A

1 minute to 14 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

SQS model pricing

A
  • pay per API Request

- pay per network usage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What’s encrypted in SQS when using SSE?

A

body only, metadata is NOT encrypted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

How many times can data be consumed on Kinesis Data Stream?

A

Many times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

When records are deleted from SQS?

A

After consumption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

When data is deleted from Kinesis Data Stream?

A

After the retention period

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Which AWS service allow replay of data?

A

Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is IoT Rules Engine?

A

It evaluates inbound messages published into AWS IoT, transforms and delivers them to another thing or a cloud based on business rules you define

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is IoT device shadow?

A

A Device Shadow is a persistent, virtual representation of a device that is managed by a thing resource you create in the AWS IoT registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

The purpose of Device Gateway in IoT

A

Entry point for IoT devices connecting to AWS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Protocols supported by IoT Device Gateway

A

MQTT, WebSockets, and HTTP 1.1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is IoT Message Broker?

A

The Message Broker is a high throughput pub/sub message broker that securely transmits messages to and from all of your IoT devices and applications with low latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

How are messages published in IoT Message Broker?

A

messages are published into topics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Which devices will receive Message Broker message in IoT?

A

all clients connected to the topic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Purpose of IoT Thing Registry

A

Organizes the resources associated with each device in the AWS Cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

3 authentication methods for IoT

A
  • Create X.509 certificate and load them securely into the Things
  • AWS SigV4
  • Custom tokens with Custom authorizers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

How device shadow is represented in IoT?

A

JSON document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is IoT Greengrass?

A

AWS IoT Greengrass provides cloud-based management of application logic that runs on devices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is DMS?

A

Database Migration Service - quickly and securely migrate databases to AWS, resilient, self healing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

When to use SCT (Schema Conversion Tool) in database migration?

A

When migrating to different DB engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Snowball Edge Storage Optimized capacity

A

80 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Snowball Edge Compute Optimized capacity

A

42 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

AWS Snowcone capacity

A

8 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Which Snow service has DataSync agent pre-installed

A

Snowcone only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What is AWS OpsHub?

A

A software you install on your computer/laptop to manage your Snow Family Device

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

MSK encryption in-flight between brokers

A

TLS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

MSK encryption in-flight between clients

A

TLS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

MSK EBS encryption

A

KMS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Three MKS Cloud Watch metric levels

A
  • basic
  • enhanced
  • topic-level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Message size for MSK

A

1MB default, up to 10MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Kafka Topic unit

A

Topics with Partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Kafka scaling limitation

A

Can only add partition to a topic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

In-flight encryption options for MSK

A

PLAINTEXT or TLS In-flight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

what is multi-part upload in S3

A

feature to upload files larger than 5GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Max object size in S3

A

5TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Three Glacier retrieval options

A
  • expedited (1-5 mins)
  • standard (3-5 hours)
  • bulk (5-12 hours)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

Amazon Glacier Deep Archive retrieval options

A
  • standard (12h)

- bulk (48h)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Minimum storage duration for Glacier

A

90 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

Glacier Deep Archive minimum storage duration

A

180 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Two types of replication in S3

A
  • CRR - Cross Region Replication

- SRR - Same Region Replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

What is S3 Byte-Range Fetch

A

You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Which S3 feature can be used to retrieve partial data of file?

A

S3 Byte-Range Fetch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

DynamoDB maxim size of an item?

A

400KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

What Write Capacity Unit represent in DynamoDB?

A

one write/s for an item up to 1KB in size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

The logic behind Eventually Consistent Read

A

If we read just after a write, it’s possible we’ll get unexpected response because of replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

The logic behind Strongly Consistent Read

A

If we read just after a write, we will get the correct data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

What Read Capacity Unit represent in DynamoDB?

A

one strongly consistent read per second or

two eventually consistent reads per second, for an item up to 4KB in size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Max- partition RCU/WCU in DynamoDB

A

3000RCU/1000WCU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Max DynamoDB partition size

A

10GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Three ways of writing data in DynamoDB

A
  • PutItem
  • UpdateItem
  • Conditional writes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Two ways of deleting data in DynamoDB

A
  • DeleteItem

- DeleteTable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Max BatchWriteItem capacity in DynamoDB

A
  • up to 25 PutItem/DeleteItem in one call
  • up to 16Mb of data
  • up to 400KB of data per item
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

Default read in DynamoDB

A

Eventually consistent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Max capacity of BatchGetItem in DYnamoDB

A
  • up to 100 items

- up to 16MB of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

On which fields Query operate in DynamoDB?

A

Partition key and Sort key only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

DynamoDB index that must be defined at the table creation time

A

LSI (Local Secondary Index)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

Which DynamoDB index can be modified?

A

GSI (Global Secondary Index) only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

For which index RCU/WCI must be defined?

A

GSI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

What is DynamoDB DAX?

A

DynamoDB Accelerator - seamless cache, no application re-write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

Default DynamoDB DAX cache TTL?

A

5 minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

Max number of nodes in DynamoDB DAX cluster?

A

10 nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

Retention time of DynamoDB Streams

A

up to 24H

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

What is DynamoDB Streams?

A

Captures a time-ordered sequence of item-level modifications in a DynamoDB table and durably stores the information for up to 24 hours

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

How to access DynamoDB without Internet?

A

VPC Endpoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

What are DynamoDB Global Tables?

A

multi-region, fully replicated, high performance tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

How can you migrate DynamoDB to RDS?

A

Use DMS (Database Migration Service)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

How to store large objects in DynamoDB?

A

Sore them in S3 and reference them in DynamoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

Will Redis cache survive reboot?

A

Yes - by default

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

You would like to react in real-time to users de-activating their account and send them an email to try to bring them back. The best way of doing it is to…

A

Integrate Lambda with DynamoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

You would like to have DynamoDB automatically delete old data for you. What should you use?

A

TTL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

You are looking to improve the performance of your RDS database by caching some of the most common rows and queries. Which technology do you recommend?

A

ElastiCache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

How Glue Crawler extract partitions?

A

Extraction is based on how your S3 data is organized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

What are the targets of Glue ETL?

A
  • S3
  • JDBC (RDS, Redshift)
  • Glue Data Catalog
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

Which platform is Glue ETL running on?

A

Serverless Spark platform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

Three ways of running Glue jobs

A
  • time based schedules
  • job bookmarks
  • CloudWatch Events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

How Glue job which prevent reprocessing of old data?

A

Job Bookmark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

Glue cost model

A

Billing by the minute for Crawler and ETL jobs
First million objects stored and accesses are free for the Glue Data Catalog
Development endpoint for developing ETL code charged by the minute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

Does Glue ETL support streaming ETL?

A

yes, runs on Apache Spark Structured Streaming (serverless)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?

A

Schedule crawlers to run periodically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

Which programming languages can be used to write ETL code for AWS Glue?

A

Python and Scala

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

Can you run existing ETL jobs with AWS Glue?

A

YES
Upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

How can you be notified of the execution of AWS Glue jobs?

A

CloudWatch + SNS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

What is AWS Glue Studio

A

Visual interface for ETL workflows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

What is AWS Glue DataBrew?

A

A visual data preparation tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
132
Q

Three types of nodes in EMR

A
  • master
  • core
  • task
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
133
Q

What HDFS stand for?

A

Hadoop Distributed Files System

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
134
Q

How are files stored in HDFS?

A

files are stored as blocks (128MB default size)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
135
Q

What is EMRFS in AWS?

A

The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
136
Q

What happen when you manually detach an EBS volume in EMR?

A

EMR treats that as a failure and replaces it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
137
Q

What local storage is suitable for in EMR?

A

buffers, caches, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
138
Q

EMR charging schema

A

per hour plus EC2 charges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
139
Q

What happen when the core node fail in EMR?

A

provisions a new node automatically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
140
Q

How to increase processing capacity but not HDFS capacity in EMR?

A

Add more task nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
141
Q

How to increase both processing and HDFS capacity in EMR?

A

Resize or add core nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
142
Q

Scale-Up strategy in EMR

A

first add core nodes, then task nodes, up to max units specified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
143
Q

Scale-Down strategy in EMR

A
  • first removes task nodes, then core nodes, no further than minim constraints
  • spot nodes always removed before on-demand instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
144
Q

What YARN stand for?

A

Yet Another Resource Navigator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
145
Q

What is Apache Spark?

A

Open-source distributed processing framework for big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
146
Q

Which languages are supported by Apache Spark?

A

Java, Scala, Python and R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
147
Q

What is Apache Tez?

A

Apache Tez is an open-source framework for big data processing based on MapReduce technology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
148
Q

What is Apache Pig?

A

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
149
Q

What is HBase?

A

non-relational, petabyte-scale database based on Google’s BigTable, on top of HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
150
Q

What Presto is used for?

A
  • it connect to many different “big data” databases and data stores at once, and query across them
  • interactive queries at petabyte scale
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
151
Q

What’s under the hood of AWS Athena?

A

Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
152
Q

What Apache Zeppelin is used for?

A

Apache Zeppelin is a new and incubating multi-purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
153
Q

What is Hue?

A

Graphical front-end for applications on EMR cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
154
Q

What’s the usage of Splunk?

A

operational tool - can be used to visualize EMR and S3 data using EMR Hadoop Cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
155
Q

What’s the usage of Flume?

A

Another way to stream data into cluster. Originally made to handle log aggregation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
156
Q

What is MXNet?

A

Like tensorflow, a library for building and accelerating neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
157
Q

What is S3DistCP?

A

Tool for copying large amounts of data (s3 HDFS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
158
Q

Which Amazon EMR tool is used for querying multiple data stores at once?

A

Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
159
Q

When you delete your EMR cluster, what happens to the EBS volumes?

A

EMR will delete the volumes once the EMR cluster is terminated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
160
Q

What’s under the hood of Kinesis Data Analytics?

A

Apache Flink

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
161
Q

Is Kinesis Analytics serverless?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
162
Q

Is Kinesis Analytics scaled automatically?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
163
Q

What is the usage of RANDOM_CUT_FOREST in Kinesis Analytics?

A

SQL function used for anomaly detection on numeric columns in a stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
164
Q

As recommended by AWS, you are going to ensure you have dedicated master nodes for high performance. As a user, what can you configure for the master nodes?

A

The count and instance types of master nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
165
Q

Which are supported ways to import data into your Amazon ES domain?

A
  • Kinesis
  • Logstash
  • Elasticsearch’s API’s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
166
Q

What can you do to prevent data loss due to nodes within your ES domain failing?

A

Elasticsearch snapshots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
167
Q

Athena cost model

A

Pay-as-you-go

  • $5 per TB scanned
  • Successful or cancelled queries count
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
168
Q

What OLAP stand for?

A

On-Line Analytical Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
169
Q

Is redshift designed for OLAP or OLTP?

A

OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
170
Q

Max number of Compute Nodes in Redshift

A

128 Nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
171
Q

What VACUUM command is used for in Redshift?

A

Recovers space from deleted rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
172
Q

What is Redshift Elastic resize?

A
  • quickly add or remove nodes of the same type
  • cluster is down for a few minutes
  • tries to keep connections open across the downtime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
173
Q

What is Redshift Classic resize?

A
  • change node type and/or number of nodes

- cluster is read-only for hours to days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
174
Q

Max number of read replicas in AWS Aurora

A

15 read replicas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
175
Q

Max number of storage in Amazon Aurora

A

Up to 64TB per database instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
176
Q

What is GraphQL used for?

A

GraphQL is designed to make APIs fast, flexible, and developer-friendly. It can even be deployed within an integrated development environment (IDE) known as GraphiQL. As an alternative to REST.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
177
Q

What is Amazon Kendra used for?

A

Amazon Kendra is a highly accurate intelligent search service that enables your users to search unstructured data using natural language. It returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
178
Q

You have an S3 bucket that your entire organization can read. For security reasons you would like the data sits encrypted there and you would like to define a strategy in which users can only read the data which they are allowed to decrypt, which may be a different partial set of objects within the bucket for each user. How can you achieve that?

A

Use SSE-KMS to encrypt the files

SSE-KMS will allow you to use different KMS keys to encrypt the objects, and then you can grant users access to specific sets of KMS keys to give them access to the objects in S3 they should be able to decrypt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
179
Q

An application processes sensor data in real-time by publishing it to Kinesis Data Streams, which in turn sends data to an AWS Lambda function that processes it and feeds it to DynamoDB. During peak usage periods, it’s been observed that some data is lost. You’ve determined that you have sufficient capacity allocated for Kinesis shards and DynamoDB reads and writes. What might be TWO possible solutions to the problem?

A
  • Increase your Lambda function’s timeout value

- Process data in smaller batches to avoid hitting Lambda’s timeout

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
180
Q

As part of your application development, you would like your users to be able to get Row Level Security. The application is to be deployed on web servers and the users of the application should be able to use their amazon.com accounts. What do you recommend for the database and security?

A

Enable Web Identity federation. Use DynamoDB and reference ${www.amazon.com:user_id} in the attached IAM policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
181
Q

What SSE security mechanisms are supported by EMR?

A

SSE-S3 – Amazon S3 manages keys for you.

SSE-KMS – You use an AWS KMS key to set up with policies suitable for Amazon EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
182
Q

Is SSE-CMK available for use in EMR?

A

NO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
183
Q

Two EMR EBS encryption options

A
  • EBS encryption - available only when you specify AWS Key Management Service as your key provider.
  • LUKS encryption – If you choose to use LUKS encryption for Amazon EBS volumes, the LUKS encryption applies only to attached storage volumes, not to the root device volume.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
184
Q

You are working for an e-commerce website and that website uses an on-premise PostgreSQL database as its main OLTP engine. You would like to perform analytical queries on it, but the Solutions Architect recommended not doing it off of the main database. What do you recommend?

A

Use DMS to replicate the database to RDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
185
Q

You are processing data using a long running EMR cluster and you like to ensure that you can recover data in case an entire availability zone goes down, as well as process the data locally for the various Hive jobs you plan on running. What do you recommend to do this at a minimal cost?

A

Store the data in S3 and keep a warm copy in HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
186
Q

A financial services company wishes to back up its encrypted data warehouse in Amazon Redshift daily to a different region. What is the simplest solution that preserves encryption in transit and at rest?

A

Configure Redshift to automatically copy snapshots to another region, using an AWS KMS customer master key in the destination region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
187
Q

Does Redshift have cross region snapshots?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
188
Q

A company wishes to copy 500GB of data from their Amazon Redshift cluster into an Amazon RDS PostgreSQL database, in order to have both columnar and row-based data stores available. The Redshift cluster will continue to receive large amounts of new data every day that must be kept in sync with the RDS database. What strategy would be most efficient?

A

Copy data using the dblink function into PostgreSQL tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
189
Q

What is Ganglia?

A

Ganglia is the operational dashboard provided with EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
190
Q

Your company has data from a variety of sources, including Microsoft Excel spreadsheets stored in S3, log data stored in a S3 data lake, and structured data stored in Redshift. Which is the simplest solution for providing interactive dashboards that span this data?

A

Use Amazon Quicksight directly on top of the Excel, S3, and Redshift data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
191
Q

As part of an effort to limit cost and maintain under control the size of your DynamoDB table, your AWS account manager would like to ensure old data is deleted in DynamoDB after 1 month. How can you do so with as little maintenance as possible and without impacting the current read and write operations?

A

Enable DynamoDB TTL and add a TTL column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
192
Q

You are dealing with PII datasets and would like to leverage Kinesis Data Streams for your pub-sub solution. Regulators imposed the constraint that the data must be encrypted end-to-end using an internal key management system. What do you recommend?

A

Implement a custom encryption code in the Kinesis Producer Library (KPL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
193
Q

A manager wishes to make a case for hiring more people in her department, by showing that the number of incoming tasks for her department have grown at a faster rate than other departments over the past year. Which type of graph in Amazon Quicksight would be best suited to illustrate this data?

A

Area line chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
194
Q

You are looking to reduce the latency down from your Big Data processing job that operate in Singapore but source data from Virginia. The Big Data job must always operate against the latest version of the data. What do you recommend?

A

Enable S3 Cross Region Replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
195
Q

You have an ETL process that collects data from different sources and 3rd party providers and would like to ensure that data is loaded into Redshift once all the parts from all the providers related to one specific jobs have been gathered, which is the process that can happen over the course of one hour to one day. What the least costly way of doing that?

A

Create an AWS Lambda that responds to S3 upload events and will check if all the parts are there before uploading to Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
196
Q

A financial services company has a large, secure data lake stored in Amazon S3. They wish to analyze this data using a variety of tools, including Apache Hive, Amazon Athena, Amazon Redshift, and Amazon QuickSight.

How should they connect their data and analysis tools in a way that minimizes costs and development work?

A

Run an AWS Glue Crawler on the data lake to populate a AWS Glue Data Catalog. Share the glue data catalog as a metadata repository between Athena, Redshift, Hive, and QuickSight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
197
Q

You are working for a data warehouse company that uses Amazon RedShift cluster. For security reasons, it is required that VPC flow logs should be analyzed by Athena to monitor all COPY and UNLOAD traffic of the cluster that moves in and out of the VPC. Which of the following helps you in this regard ?

A

Use Enhanced VPC Routing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
198
Q

A hospital monitoring sensor data from heart monitors wishes to raise immediate alarms if an anomaly in any individual’s heart rate is detected.

Which architecture meets these requirements in a scalable manner?

A

Publish sensor data into a Kinesis data stream, and create a Kinesis Data Analytics application using RANDOM_CUT_FOREST to detect anomalies. When an anomaly is detected, use a Lambda function to route an alarm to Amazon SNS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
199
Q

A produce export company has multi-dimensional data for all of its shipments, such as the date, price, category, and destination of every shipment. A data analyst wishes to interactively explore this data, applying statistical functions to different rows and columns and sorting them in different ways.

Which QuickSight visualization would be best suited for this?

A

Pivot table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
200
Q

You are an online retailer and your website is a storefront for millions of products. You have recently run a big sale on one specific electronic and you have encountered Provisioned Throughput Exceptions. You would like to ensure you can properly survive an upcoming sale that will be three times as big. What do you recommend?

A

DynamoDB DAX

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
201
Q

Your daily Spark jobs runs against files created by a Kinesis Firehose pipeline in S3. Due to a low throughput, you observe that each of the many files created by Kinesis Firehose is about 100KB. You would like to optimise your Spark job as best as possible to query the data efficiently. What do you recommend?

A

Consolidate files on a daily basis using DataPipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
202
Q

A data scientist wishes to develop a machine learning model to predict stock prices using Python in a Jupyter Notebook, and use a cluster on AWS to train and tune this model, and to vend predictions from it at large scale.

Which system allows you to do this?

A

Amazon SageMaker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
203
Q

You are tasked with using Hive on Elastic MapReduce to analyze data that is currently stored in a large relational database.

Which approach could meet this requirement?

A

Use Apache Sqoop on the EMR cluster to copy the data into HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
204
Q

What is Sqoop?

A

Sqoop is an open-source system for transferring data between Hadoop and relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
205
Q

Your exports application hosted on AWS need to process game results immediately in real time and later perform analytics on the same game results in the order they came at the end of business hours. Which of the AWS service will be the best fit for your needs?

A

Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
206
Q

You wish to use Amazon Redshift Spectrum to analyze data in an Amazon S3 bucket that is in a different account than Redshift Spectrum.

How would you authorize access between Spectrum and S3 across accounts?

A

Add a policy to the S3 bucket allowing S3 GET and LIST operations for an IAM role for Spectrum on the Redshift account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
207
Q

You need to ETL streaming data from web server logs as it is streamed in, for analysis in Athena. Upon talking to the stakeholders, you’ve determined that the ETL does not strictly need to happen in real-time, but transforming the data within a minute is desirable.

What is a viable solution to this requirement?

A

Perform any initial ETL you can using Amazon Kinesis, store the data in S3, and trigger a Glue ETL job to complete the transformations needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
208
Q

An organization has a large body of web server logs stored on Amazon S3, and wishes to quickly analyze their data using Amazon Athena. Most queries are operational in nature, and are limited to a single day’s logs.

How should the log data be prepared to provide the most performant queries in Athena, and to minimize costs?

A

Convert the data into Apache Parquet format, compressed with Snappy, stored in a directory structure of year=XXXX/month=XX/day=XX/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
209
Q

You are creating an EMR cluster that will process the data in several MapReduce steps. Currently you are working against the data in S3 using EMRFS, but the network costs are extremely high as the processes write back temporary data to S3 before reading it. You are tasked with optimizing the process and bringing the cost down, what should you do?

A

Add a preliminary step that will use a S3DistCp command

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
210
Q

You are required to maintain a real-time replica of your Amazon Redshift data warehouse across multiple availability zones.

What is one approach toward accomplishing this?

A

Spin up separate redshift clusters in multiple availability zones, using Amazon Kinesis to simultaneously write data into each cluster. Use Route 53 to direct your analytics tools to the nearest cluster when querying your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
211
Q

You work for a gaming company and each game’s data is stored in DynamoDB tables. In order to provide a game search functionality to your users, you need to move that data over to ElasticSearch. How can you achieve it efficiently and as close to real time as possible?

A

Enable DynamoDB Streams and write a Lambda function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
212
Q

A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably.

Which system would provide the most cost-effective and reliable solution?

A

Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
213
Q

You are working for a bank and your company is regularly uploading 100 MB files to Amazon S3 and analyzed by Athena. It has come to light that recently some of the uploads have been corrupted and made a critical big data job fails. Your company would like a stronger guarantee that uploads are done successfully and that the files have the same content on premise and on S3. It looks to do so at minimal cost. What do you recommend?

A

Use the S3 ETag and compare to the local MD5 hash

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
214
Q

Three modules of SageMaker

A
  • Build
  • Train
  • Deploy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
215
Q

What limit, if any, is there to the size of your training dataset in Amazon Machine Learning by default?

A

100GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
216
Q

Is there a limit to the size of the dataset that you can use for training models with Amazon SageMaker? If so, what is the limit?

A

No fixed limit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
217
Q

Does Kinesis Stream preserve client ordering?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
218
Q

Can Kinesis Streams data be consumed in parallel?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
219
Q

EMR deployment options

A
  • EC2
  • Amazon EKS
  • AWS Outposts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
220
Q

Description of Task node in EMR?

A

A mode with software components that only runs tasks and DOES NOT store data in HDFS.
Task node is optional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
221
Q

What is EMRFS file system in EMR?

A

Implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
222
Q

What is HDFS in EMR?

A

Instance store and Amazon Elastic Block Store (Amazon EBS) volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications may “spill” to the local file system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
223
Q

What is Apache Presto?

A

Presto, also known as PrestoDB, is an open source, distributed SQL query engine that enables fast analytic queries against data of any size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
224
Q

What is EMR notebook?

A

Amazon EMR notebooks provide a managed analysis environment based on open-source Jupyter notebooks so that data scientists, analysts, and developers can prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
225
Q

How Redshift traffic is routed when Enhanced VPC routing is not enabled?

A

Amazon Redshift routes traffic through the internet, including traffic to other services within the AWS network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
226
Q

What is Apache Airflow?

A

Apache Airflow is an open-source task scheduler that can be installed on EC2 instances or bootstrapped on primary nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
227
Q

What is Amazon MWAA?

A

The Amazon MWAA is a managed service that reduces the burden of provisioning and ongoing maintenance of Airflow and offers seamless integration with CloudWatch for system metrics and logs. It offers a rich UI and troubleshooting tools and can be used to orchestrate jobs across hybrid environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
228
Q

Two types of cluster types used by Amazon EMR

A
  • long-running

- transient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
229
Q

Use cases for long-running EMR cluster

A
  • Spark Streaming or Flink

- online transaction processing (OLTP) workload like Apache HBase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
230
Q

What is EMR transient cluster?

A

Cluster to be automatically shut down, it shuts down after all the steps complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
231
Q

What is Hive Metastore in AWS?

A

Apache Hive is an open-source data warehouse and analytics package that runs on top of an Apache Hadoop cluster. A Hive metastore contains a description of the table and the underlying data making up its foundation, including the partition names and data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
232
Q

Where is Hive Metastore information recorded by default?

A

In a MySQL database on the master node’s file system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
233
Q

Patterns to deploy a Hive Metastore on Amazon EMR:

A
  • AWS Glue Data Catalog

- external data store such as Amazon Relational Database Service (Amazon RDS) or Amazon Aurora

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
234
Q

What is Apache Ranger

A

Apache Ranger is an open-source project that provides authorization and audit capabilities for Hadoop and related big data applications like Apache Hive, Apache HBase, and Apache Kafka.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
235
Q

What is S3DistCp in Amazon EMR?

A

Primary data transfer utility used in Amazon EMR and is an extension of the open-source Apache DistCp and is optimized to work with Amazon S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
236
Q

Can extra EBS volumes be added to EMR cluster?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
237
Q

Is AWS Glue using servers?

A

NO It’s a serverless service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
238
Q

What is AWS Redshift spectrum?

A

Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
239
Q

What is AWS QuickSight?

A

Amazon QuickSight allows everyone in your organization to understand your data by asking questions in natural language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
240
Q

What is AWS Glue Data Catalog?

A

Persistent metadata store, you can use this managed service to store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
241
Q

What is AWS Glue DataBrew?

A

Visual data preparation tool, you can clean, enrich, format, and normalize your datasets with over 250 built-in transformations. You can create a “recipe” for a dataset using the transformations of your choice, and then reuse that recipe repeatedly as your business continues to collect new data.

242
Q

What is AWS Glue Streaming ETL?

A

Consume real-time data from either an Amazon Kinesis data stream or an Amazon Managed Streaming for Apache Kafka stream.

243
Q

What is AWS Glue Crawler?

A

A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the Data Catalog.

244
Q

What are AWS Glue Data store, data source, data target

A

A data store is a repository for persistently storing your data. Examples include Amazon S3 buckets and relational databases. A data source is a data store that is used as input to a process or transform. A data target is a data store that a process or transform writes to.

245
Q

What is AWS Glue Development endpoint?

A

An environment that you can use to develop and test AWS Glue ETL scripts.

246
Q

What is AWS Glue Job

A

The business logic that is required to perform ETL work. It is composed of a transformation script, data sources, and data targets. Job runs are initiated by triggers that can be scheduled or activated by events.

247
Q

What is Parquet format?

A

Parquet format refers to a type of file format that structures data in a columnar format rather than row-based format like a CSV or Microsoft Excel file. Parquet format is optimal for analytical engines like Athena or Redshift Spectrum to query over.

248
Q

What is AWS Glue table?

A

The metadata definition that represents your data. Whether your data is in an Amazon S3 file, an Amazon Relational Database Service (Amazon RDS) table, or another set of data, a table defines the schema of your data. A table in the Data Catalog consists of the names of columns, data type definitions, partition information, and other metadata about a base dataset.

249
Q

The roles in AWS Quicksight

A
  • Author
  • Admin
  • Reader
250
Q

Two authentication options of AWS Quicksight

A
  • Role Based Federation (SSO)

- Active Directory

251
Q

Entry point of AWS IoT Analytics

A

Channel

252
Q

What Velocity mean in Data Analytics?

A

The speed of data entering a solution.

253
Q

What Variety mean in Data Analytics?

A

The number of different sources - and the types of sources - the solution will use.

254
Q

What Variety mean in Data Analytics?

A

The number of different sources - and the types of sources - that the solution will use

255
Q

What Veracity mean in Data Analytics?

A

The degree to which data is accurate, precise, and trusted. It is contingent on the integrity and trustworthiness of the data.

256
Q

What is a data mart?

A

A subset of data warehouse. Data mart focus on one subject of functional area.

257
Q

What is Hadoop YARN?

A

Resource management framework responsible for scheduling and executing processing jobs.

258
Q

What is Hadoop MapReduce?

A

YARN-based system that allows for parallel processing of large data sets on the cluster.

259
Q

What is Curation in BigData?

A

The action or process of selecting, organizing, and looking after the items in a collection

260
Q

What is Data integrity in BigData?

A

the maintenance and assurance of the accuracy and consistency of data over its entire lifecycle

261
Q

What is Data veracity in Big Data?

A

The degree to which data is accurate, precise, and trusted.

262
Q

What is Data cleansing in Big Data?

A

the process of detecting and correcting corruptions within data

263
Q

What is Referential integrity in Big Data?

A

process of ensuring that the constraints of table relationships are enforced

264
Q

What is Domain integrity in Big Data?

A

process of ensuring that the data being entered into a field matches the data type defined for that field

265
Q

What is Entity integrity in Big Data?

A

process of ensuring that the values stored within a field match the constraints defined for that field

266
Q

At which stage of the data lifecycle will consumers test the veracity of data?

A

Share

267
Q

What ACID stand for?

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability
268
Q

Meaning of Atomicity

A

When executing a transaction in a database, atomicity ensures that your transactions either completely succeed or completely fail.

269
Q

Meaning of Consistency in DB

A

Consistency ensures that all transactions provide valid data to the database. This data must adhere to all defined rules and constraints

270
Q

Meaning of Isolation in DB

A

Isolation ensures that one transaction cannot interfere with another concurrent transaction.

271
Q

Meaning of Durability

A

Data durability is all about making sure your changes actually stick. Once a transaction has successfully completed, durability ensures that the result of the transaction is permanent even in the event of a system failure.

272
Q

What BASE stand for?

A

Basically Available Soft state Eventually consistent

273
Q

What BASE is used for?

A

Method for maintaining consistency and integrity in a structured or semistructured database.

274
Q

Meaning of Basically Available in BASE

A

BA allows for one instance to receive a change request and make that change available immediately.

275
Q

Meaning of Soft State in BASE

A

In a BASE system, there are allowances for partial consistency across distributed instances. For this reason, BASE systems are considered to be in a soft state, also known as a changeable state.

276
Q

Meaning of Eventual Consistency in BASE

A

The data will be eventually consistent. In other words, a change will eventually be made to every copy. However, the data will be available in whatever state it is during propagation of the change.

277
Q

What is Amazon DynamoDB transactions?

A

Feature that implements ACID compliance across one or more tables within a single AWS account and region

278
Q

What is Information analytics?

A

The process of analyzing information to find the value contained within it. This term is often synonymous with data analytics.

279
Q

What is Hadoop Common?

A

Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules.

280
Q

Kinesis Data Analytics charging model

A

Hourly rate based on the average number of Kinesis Processing Units (or KPUs) used to run your stream processing application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory.

281
Q

Kinesis Video Streams pricing

A

Pay only for the volume of data you ingest, store, and consume through the service.

282
Q

Amazon SageMaker pricing model

A

Pay only for what you use. The process of building, training, and deploying ML models is billed by the second, with no minimum

283
Q

How Athena performance can be improved?

A

Compressing, partitioning, and converting your data into columnar formats

284
Q

How Amazon EMR block public work?

A

Amazon EMR block public access prevents a cluster in a public subnet from launching when any security group associated with the cluster has a rule that allows inbound traffic from IPv4 0.0.0.0/0 or IPv6 ::/0 (public access) on a port, unless the port has been specified as an exception.

285
Q

What is Redshift audit logging

A

Amazon Redshift logs information about connections and user activities in your database. The logs are stored in Amazon S3 buckets.

286
Q

In how many AZ can EMR cluster reside?

A

The cluster can reside only in one Availability Zone or subnet

287
Q

What is EMRFS consistent view?

A

EMRFS consistent view is an optional feature available when using Amazon EMR release version 3.2. 1 or later. Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS

288
Q

Every ten seconds, a streaming application reads data from Amazon Kinesis Data Streams and promptly writes it to an Amazon S3 bucket. Data is being read from hundreds of shards by the application. Due to a different need, the batch interval cannot be modified. Amazon Athena has access to the data. As time passes, users notice a deterioration in query performance.

Which step may aid in query performance optimization?

A

Merge the files in Amazon S3 to form larger files.

289
Q

A utility firm is installing thousands of smart meters in order to get real-time data on energy use. The firm is collecting data streams from smart meters using Amazon Kinesis Data Streams. The consumer application retrieves the stream data using the Kinesis Client Library (KCL). There is just one consumer application available from the corporation.
Between the time a record is written to the stream and the time it is read by a consumer application, the business notices an average delay of one second. This delay must be reduced below 500 milliseconds.
Which solution satisfies these criteria?

A

Reduce the propagation delay by overriding the KCL default settings.

290
Q

Two Kinesis Data Streams Capacity Modes

A
  • provisioned

- on-demand

291
Q

SQS latency

A

<10 Ms on publish and receive

292
Q

Max message size in Standard SQS

A

256KB

293
Q

Can message be processed by many consumers in SQS?

A

No

That’s different from Kinesis

294
Q

SQS uses cases

A
  • decouple applications
  • buffer writes to a dabases (voting application)
  • handle large loads of messages coming in
295
Q

Message size in MSK

A

configurable (default 1 MB)

296
Q

What is broker in Kafka?

A

A Broker is a Kafka server that runs in a Kafka Cluster. Kafka Brokers form a cluster. The Kafka Cluster consists of many Kafka Brokers on many servers.

297
Q

What is MSK ZooKeeper?

A

Apache ZooKeeper is an open-source server that enables highly reliable distributed coordination. Producers, consumers, and topic creators — Amazon MSK lets you use Apache Kafka data-plane operations to create topics and to produce and consume data.

298
Q

MSK encryption between brokers

A

in-flight using TLS

299
Q

What is Kafka Connect?

A

Kafka Connect is an open-source component of Apache Kafka that provides a framework for connecting with external systems such as databases, key-value stores, search indexes, and file systems.

300
Q

You are accumulating data from IoT devices and you must send data within 10 seconds to Amazon ElasticSearch service. That data should also be consumed by other services when needed. Which service do you recommend using?

A

Kinesis Data Streams

301
Q

You need a managed service that can deliver data to Amazon S3 and scale automatically for you. You want to be billed only for the actual usage of the service and be able to handle peak loads. Which service do you recommend?

A

Kinesis Data Firehose

302
Q

You are sending a lot of 100B data records and would like to ensure you can use Kinesis to receive your data. What should you use to ensure optimal throughput, that has asynchronous features ?

A

Kinesis Producer Library

303
Q

You would like to collect log files in mass from your Linux servers running on premise. You need a retry mechanism embedded and monitoring through CloudWatch. Logs should end up in Kinesis. What will help you accomplish this?

A

Kinesis Agent

304
Q

You would like to perform batch compression before sending data to Kinesis, in order to maximize the throughput. What should you use?

A

Kinesis Producer Library + Implement Compression Yourself

305
Q

You have 10 consumers applications consuming concurrently from one shard, in classic mode by issuing GetRecords() commands. What is the average latency for consuming these records for each application?

A

2 sec

306
Q

You have 10 consumers applications consuming concurrently from one shard, in enhanced fan out mode. What is the average latency for consuming these records for each application?

A

70 ms

307
Q

You would like to have data delivered in near real time to Amazon ElasticSearch, and the data delivery to be managed by AWS. What should you use?

A

Kinesis Firehose

308
Q

You are consuming from a Kinesis stream with 10 shards that receives on average 8 MB/s of data from various producers using the KPL. You are therefore using the KCL to consume these records, and observe through the CloudWatch metrics that the throughput is 2 MB/s, and therefore your application is lagging. What’s the most likely root cause for this issue?

A

Your DynamoDB table is under-provisioned

309
Q

You would like to increase the capacity of your Kinesis streams. What is the best approach?

A

Use Kinesis Data Streams on-demand mode

310
Q

Can Spark Streaming read from Kinesis Data Firehose

A

NO

311
Q

Can Kinesis Firehose write to DynamoDB?

A

NO

312
Q

You are looking to decouple jobs and ensure data is deleted after being processes. Which technology would you choose?

A

SQS

313
Q

You are collecting data from IoT devices at scale and would like to forward that data into Kinesis Data Firehose. How should you proceed?

A

Send that data into an IoT topic and define a rule action

314
Q

You would like to control the target temperature of your room using an IoT thing thermostat. How can you change its state for target temperature even in the case it’s temporarily offline?

A

Change the state of device shadow

315
Q

You are looking to continuously replicate a MySQL database that’s on premise to Aurora. Which service will allow you to do so securely?

A

DMS (Database Migration Service)

316
Q

You are looking to continuously replicate a MySQL database that’s on premise to Aurora. Which service will allow you to do so securely?

A

DMS

317
Q

You are gathering various files from providers and plan on analyzing them once every month using Athena, which must return the query results immediately. You do not want to run a high risk of losing files and want to minimise costs. Which storage type do you recommend?

A

S3 Infrequent Access

318
Q

As part of your compliance as a bank, you must archive all logs created by all applications and ensure they cannot be modified or deleted for at least 7 years. Which solution should you use?

A

Glacier with Vault Lock Policy

319
Q

In order to perform fast big data analytics, it has been recommended by your analysts in Japan to continuously copy data from your S3 bucket in us-east-1. How do you recommend doing this at a minimal cost?

A

Enable Cross Region Replication

320
Q

Your big data application is taking a lot of files from your local on-premise NFS storage and inserting them into S3. As part of the data integrity verification process, you would like to ensure the files have been properly uploaded at minimal cost. How do you proceed?

A

Compute the local ETag for each file and compare them with AWS S3’s ETag

321
Q

Your application plans to have 15,000 reads and writes per second to S3 from thousands of device ids. Which naming convention do you recommend?

A

/yyy-mm-dd/…

322
Q

You are looking to have your files encrypted in S3 and do not want to manage the encryption yourself. You would like to have control over the encryption keys and ensure they’re securely stored in AWS. What encryption do you recommend?

A

SSE-KMS

323
Q

What’s the maximum number of fields that can make a primary key in DynamoDB?

A

2

324
Q

What’s the maximum size of a row in DynamoDB?

A

400KB

325
Q

We are getting a ProvisionedThroughputExceededExceptions but after checking the metrics, we see we haven’t exceeded the total RCU we had provisioned. What happened?

A

We have hot partition/hot key

326
Q

You are about to enter the Christmas sale and you know a few items in your website are very popular and will be read often. Last year you had a ProvisionedThroughputExceededException. What should you do this year?

A

Create DAX cluster

327
Q

You would like to react in real-time to users de-activating their account and send them an email to try to bring them back. The best way of doing it is to…

A

Integrate Lambda with DynamoDB stream

328
Q

You would like to have DynamoDB automatically delete old data for you. What should you use?

A

Use TTL

329
Q

What are two columnar data formats supported by Athena?

A

Parquet and ORC

330
Q

Your organization is querying JSON data stored in S3 using Athena, and wishes to reduce costs and improve performance with Athena. What steps might you take?

A

Convert the data from JSON to ORC format and analyze the data with Athena

331
Q

When using Athena, you are charged separately for using the AWS Glue Data Catalog. True or False ?

A

True

332
Q

Redshift command to copy table into S3

A

UNLOAD

333
Q

Redshift command to copy data already in another table

A
  • INSERT INTO … SELECT

- CREATE TABLE AS

334
Q

What is Short Query Acceleration in Redshift?

A

Prioritize short-running queries over long-running queries

335
Q

Two types of resizing Redshift clusters?

A
  • elastic resize

- classic resize

336
Q

What is RA3 node type in Redshift?

A

RA3 is that it has a completely separate storage layer called Redshift Managed Storage (RMS).

337
Q

What is Redshift AQUA?

A

AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10x faster than other enterprise cloud data warehouses by automatically boosting certain types of queries.

338
Q

How is Redshift Serverless measured?

A

In Redshift Processing Units (RPU’s)

339
Q

How much memory each user gets in SPICE?

A

10GB

340
Q

Two major advantages of Quicksight Enterprise over Standard edition

A
  • encryption at rest

- MS Active Directory Integration

341
Q

Your manager has asked you to prepare a visual on QuickSight to see trends in how money was spent on entertainment in the office in the past 12 months. What visual will you use?

A

Line chart

342
Q

You wish to publish a visual you’ve created illustrating trends of work coming into your company, for your employees to view. Which tool in QuickSight would be appropriate?

A

Dashboard

343
Q

You want to build a visualization from the data-set you have imported, but you are unsure what visual to select for the best view. What can you do?

A

Select auto-graph

344
Q

The source data you wish to analyze is not in a clean format. What can you do to ensure your visual in QuickSight looks good, with a minimum of effort?

A

Select edit/preview data before loading it into analysis and edit it as needed

345
Q

How can you point QuickSight to fetch from your database stored on the EC2 instance in your VPC ?

A

Add Quicksight IP range to the allowed IPs of the hosted DB

346
Q

SSE-S3 header

A

x-amz-server-side-encryption:”AES256”

347
Q

SSE-KMS advantages over SSE-S3

A

user control + audit trial

348
Q

SSE-KMS header

A

x-amz-server-side-encryption:”aws:kms”

349
Q

What is Apache Ranger

A

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Apache Ranger has the following features: Centralized security administration to manage all security related tasks in a central UI or using REST APIs.

350
Q

Is SSE-C supported on EMR?

A

SSE with customer-provided keys (SSE-C) is not available for use with Amazon EMR.

351
Q

EMR at-rest data encryption for local disks

A
  • Open-source HDFS encryption
  • Instance store: NVMe or LUKS
  • EBS: KMS, LUKS (doesn’t work with root volume)
352
Q

Can Apache Spark read and write to Kinesis Data Streams

A

YES

353
Q

An e-commerce company wishes to assign product categories, such as sporting goods or books, to new products that have no category assigned to them. The company has a large corpus of existing product data with manually assigned categories in place. They wish to use their existing data to predict categories on new products, based on other attributes of the products such as its keywords and seller ID, using Amazon Machine Learning.

Which type of machine learning model would they use?

A

Multi-class classification model

354
Q

You are looking to query data storage in DynamoDB from your EMR cluster. Which technology will allow you to do so?

A

Hive

355
Q

What are THREE ways in which EMR integrates Pig with Amazon S3?

A
  • Directly writing to HCatalog tables in S3
  • Submitting work from the EMR console using Pig scripts stored in S3
  • Loading custom JAR files from S3 with the REGISTER command
356
Q

A real estate company wishes to display interactive charts on their public-facing website summarizing their prior month’s sales activity.

Which TWO solutions would provide this capability in a scalable and inexpensive manner?

A
  • Publish data in csv format to Amazon Cloudfront via S3, and use d3.js to visualize the data on the web
  • Publish data in csv format to Amazon Cloudfront via S3, and use Highcharts to visualize the data on the web.
357
Q

Your team has developed a Spark Streaming applications that performs real time transformations on an on-premise Apache Kafka cluster and finally delivers the data in real time to S3. As part of a migration to the cloud and switch to Kinesis as an underlying streaming store, what do you recommend?

A

Produce data using Spark Streaming to Kinesis Data Streams, and read the data with Spark Streaming from Kinesis Data Streams to write it to S3

358
Q

You are required to maintain a real-time replica of your Amazon Redshift data warehouse across multiple availability zones.

What is the approach toward accomplishing this?

A

Spin up separate redshift clusters in multiple availability zones, using Amazon Kinesis to simultaneously write data into each cluster. Use Route 53 to direct your analytics tools to the nearest cluster when querying your data

359
Q

Your daily Spark jobs runs against files created by a Kinesis Firehose pipeline in S3. Due to a low throughput, you observe that each of the many files created by Kinesis Firehose is about 100KB. You would like to optimise your Spark job as best as possible to query the data efficiently. What do you recommend?

A

Consolidate files on a daily basis using DataPipeline

360
Q

Which authentication is commonly used by mobile AWS IoT, applications?

A

Cognito

361
Q

Your management wants a dashboard to monitor current revenue against their annual revenue goal. Whick Quicksight visualization would be most appropriate?

A

KPI

362
Q

An Amazon Elasticsearch domain has been installed within a VPC.

What are TWO methods which could be employed to securely allow access to Kibana from outside the VPC?

A
  • Set up an SSH tunnel with port forwarding to allow access on port 5601
  • Set up a reverse proxy server between your browser and Amazon Elasticsearch Service.
363
Q

Kibana default port number

A

5601

364
Q

You are dealing with PII datasets and would like to leverage Kinesis Data Streams for your pub-sub solution. Regulators imposed the constraint that the data must be encrypted end-to-end using an internal key management system. What do you recommend?

A

Implement a custom encryption code in the Kinesis Producer Library (KPL)

365
Q

What is Sqoop?

A

Sqoop is an open-source system for transferring data between Hadoop and relational databases.

366
Q

You are tasked with using Hive on Elastic MapReduce to analyze data that is currently stored in a large relational database.

Which approach could meet this requirement?

A

Use Apache Sqoop on the EMR cluster to copy the data into HDFS

367
Q

Your financial organization has hundreds of Terabytes of data stored within its on premise data centers, and data is being produced at the rate of Gigabytes per second, and could be consumed within 3 days. As part of their AWS cloud migration, what solution do you recommend for them?

A

Transfer your historical data using Snowball and use Kinesis Data Streams for ongoing data collection

368
Q

A MapReduce job on an EMR cluster needs to process data that is currently stored in very large, compressed files in HDFS, which limits the cluster’s ability to distribute its processing.

Which TWO solutions would best help the MapReduce job to operate more efficiently?

A
  • Uncompress the data and split it into 64MB chunks

- Convert the file to AVRO format

369
Q

Which tool on Amazon Elastic MapReduce allows you to monitor your cluster’s performance as a whole, and at individual nodes?

A

Ganglia

370
Q

You are processing data using a long running EMR cluster and you like to ensure that you can recover data in case an entire availability zone goes down, as well as process the data locally for the various Hive jobs you plan on running. What do you recommend to do this at a minimal cost?

A

Store the data in S3 and keep a warm copy in HDFS

371
Q

You would like to process data coming from IoT devices, and processing that data takes approximately 2 minutes per data point. You would also like to be able to scale in terms of number of processes that will consume that data, based on the load your are receiving, and no ordering constraints are required. What do you recommend?

A

Define an IoT rules actions to send data to SQS and consume the data with EC2 instances in an Auto Scaling group

372
Q

You have created a system that recommends items similar to other items on an e-commerce website, by training a recommender system using Mahout on an EMR cluster.

Which would be a performant means to vend the resulting table of similar items for any given item to the website at high transaction rates?

A

This is an OLTP use case for which a “NoSQL” database is a good fit. HBase is the only option presented designed for OLTP and not OLAP, plus it has the advantage of already being present in EMR. DynamoDB would also be an appropriate technology to use.

373
Q

What are THREE ways in which EMR optionally integrates HBase with S3?

A
  • Snapshots of HBase data to S3
  • Storage of HBase StoreFiles and metadata on S3
  • HBase read-replicas on S3
374
Q

You wish to analyze an S3 data lake using standard SQL. Which solution results in the least amount of ongoing administration from you, as your data grows?

A

Amazon Athena

375
Q

You have programmed a Lambda function that will be automating the creation of an EMR cluster, which in turn should perform some transformations in S3 through EMRFS. Your Lambda function will be triggered by CloudWatch Events. How can you ensure your Lambda function can properly perform its actions?

A

Attach an IAM role

376
Q

Which are the two technologies that support VPC Endpoint Gateway?

A
  • S3

- DynamoDB

377
Q

You work for a gaming company and each game’s data is stored in DynamoDB tables. In order to provide a game search functionality to your users, you need to move that data over to ElasticSearch. How can you achieve it efficiently and as close to real time as possible?

A

Enable DynamoDB Streams and write a Lambda function

378
Q

You have an ETL process that collects data from different sources and 3rd party providers and would like to ensure that data is loaded into Redshift once all the parts from all the providers related to one specific jobs have been gathered, which is the process that can happen over the course of one hour to one day. What the least costly way of doing that?

A

Create an AWS Lambda that responds to S3 upload events and will check if all the parts are there before uploading to Redshift

379
Q

You need to ETL streaming data from web server logs as it is streamed in, for analysis in Athena. Upon talking to the stakeholders, you’ve determined that the ETL does not strictly need to happen in real-time, but transforming the data within a minute is desirable.

What is a viable solution to this requirement?

A

Perform any initial ETL you can using Amazon Kinesis, store the data in S3, and trigger a Glue ETL job to complete the transformations needed.

380
Q

An application processes sensor data in real-time by publishing it to Kinesis Data Streams, which in turn sends data to an AWS Lambda function that processes it and feeds it to DynamoDB. During peak usage periods, it’s been observed that some data is lost. You’ve determined that you have sufficient capacity allocated for Kinesis shards and DynamoDB reads and writes.

What might be TWO possible solutions to the problem?

A
  • Process data in smaller batches to avoid hitting Lambda’s timeout
  • Increase your Lambda function’s timeout value
381
Q

Your esports application hosted on AWS need to process game results immediately in real time and later perform analytics on the same game results in the order they came at the end of business hours. Which of the AWS service will be the best fit for your needs?

A

Here Kinesis Data Streams is the best fit as the data can be replayed in the same order

382
Q

A financial services company wishes to back up its encrypted data warehouse in Amazon Redshift daily to a different region.

What is the simplest solution that preserves encryption in transit and at rest?

A

Configure Redshift to automatically copy snapshots to another region, using an AWS KMS customer master key in the destination region.

383
Q

As an e-commerce retailer, you would like to onboard clickstream data onto Kinesis from your web servers Java applications. You want to ensure that a retry mechanism is in place, as well as good batching capability and asynchronous mode. You also want to collect server logs with the same constraints. What do you recommend?

A

Use the Kinesis Producer Library to send the clickstream and the Kinesis agent to collect the Server Logs

384
Q

You would like to design an application that will be able to sustain storing 100s of TB of data in a database that will get low latency on reads and won’t require you to manage scaling. What do you recommend?

A

DynamoDB

385
Q

You are launching an EMR cluster and plan on running custom python scripts that will end up invoking custom Lambda functions deployed within your VPC. How can you ensure the EMR cluster has the right to invoke the functions?

A

Create an IAM role and attach it to the EMR instances

386
Q

You are creating an EMR cluster that will process the data in several MapReduce steps. Currently you are working against the data in S3 using EMRFS, but the network costs are extremely high as the processes write back temporary data to S3 before reading it. You are tasked with optimizing the process and bringing the cost down, what should you do?

A

Here, using an S3DistCp command is the right thing to do to copy data from S3 into HDFS and then make sure the data is processed locally by the EMR cluster MapReduce job. Upon completion, you will use S3DistCp again to push back the final result data to S3.

387
Q

You need to add more nodes to your Redshift cluster and change the node type in the process. Which process allows you to do this while minimizing downtime for both reads and writes?

A

Classic resize

388
Q

A data scientist wishes to develop a machine learning model to predict stock prices using Python in a Jupyter Notebook, and use a cluster on AWS to train and tune this model, and to vend predictions from it at large scale.

Which system allows you to do this?

A

SageMaker enables developers and data scientists to build, train, and deploy machine learning models at any scale, using hosted Jupyter notebooks

389
Q

You need to load several hundred GB of data every day from Amazon S3 into Amazon Redshift, which is stored in a single file. You’ve found that loading this data is prohibitively slow.

Which approach would optimize this loading best?

A

Split the data into files between 1MB and 125MB (after compression,) and specify GZIP compression from a single COPY command.

390
Q

Amazon Athena is used by a business to do ad-hoc searches on data stored in Amazon S3. To comply with internal security regulations, the organization wishes to incorporate additional restrictions to isolate query execution and history among individuals, teams, or apps operating in the same AWS account.

Which solution satisfies these criteria?

A

Create an Athena workgroup for each given use case, apply tags to the workgroup, and create an IAM policy using the tags to apply appropriate permissions to the workgroup.

391
Q

A real estate business uses Apache HBase on Amazon EMR to power a mission-critical application. A single master node is setup for Amazon EMR. The company’s data is kept on a Hadoop Distributed File System in excess of 5 TB (HDFS). The organization is looking for a cost-effective way to increase the availability of its HBase data.

Which architectural design best fulfills the needs of the business?

A

Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Run two separate EMR clusters in two different Availability Zones. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.

392
Q

A human resources organization runs analytics queries on the company’s data using a 10-node Amazon Redshift cluster. The Amazon Redshift cluster comprises two tables: one for products and one for transactions, both of which have a product sku field. The tables span more than 100 GB. Both tables are used in the majority of queries.

A

A KEY distribution style for both tables

393
Q

A corporation gets a 100 MB.csv file compressed using gzip once a month. The file is hosted in Amazon S3 Glacier and comprises 50,000 property listing records.
The company’s data analyst is required to query a portion of the company’s data for a certain vendor.

Which approach is the most cost-effective?

A

Load the data into Amazon S3 and query it with Amazon S3 Select.

394
Q

Amazon S3 is being used by a marketing organization to store campaign response data. Each campaign’s data was compiled from a consistent set of sources. The data is uploaded to Amazon S3 in the form of.csv files. A business analyst will examine the data from each campaign using Amazon Athena. The organization requires a reduction in the cost of continuing data analysis using Athena.

Which steps should a data analytics professional perform in combination to achieve these requirements?

A

Convert the .csv files to Apache Parquet.

Partition the data by campaign.

395
Q

tilizing Amazon Kinesis Data Streams, an online store is redesigning its inventory management and inventory reordering systems to automate product reordering. Kinesis Producer Library (KPL) is used to publish data to a stream by the inventory management system. Kinesis Client Library (KCL) is used to ingest data from the stream by the inventory reordering mechanism. The stream is set to scale up or down as necessary. The merchant realizes that the inventory reordering system is getting duplicate data just before production deployment.

What causes may be responsible for the duplicated data?

A

The producer has a network-related timeout.

There was a change in the number of shards, record processors, or both.

396
Q

A business has an application that reads records from a Kinesis data stream using the Amazon Kinesis Client Library (KCL).
The application saw a considerable rise in use after a successful marketing effort. As a consequence, a data analyst was forced to separate certain data shards. When the shards were divided, the program began intermittently issuing ExpiredIteratorExceptions.

What is the data analyst’s role in resolving this?

A

Increase the provisioned write capacity units assigned to the stream’s Amazon DynamoDB table.

397
Q

A mortgage firm maintains a microservice for payment acceptance. This microservice encrypts sensitive data before it is written to DynamoDB using the Amazon DynamoDB encryption client and AWS KMS controlled keys. Finance should be able to import this data into Amazon Redshift and aggregate the information contained inside the sensitive fields. Other data analysts from other business divisions share the Amazon Redshift cluster.

Which actions should a data analyst take to effectively and safely do this task?

A

Create an AWS Lambda function to process the DynamoDB stream. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPY command with the IAM role that has access to the KMS key to load the data from S3 to the finance table.

398
Q

A retail organization is using Amazon Redshift to construct its data warehouse solution. The organization is now adding hundreds of files into the fact table established in its Amazon Redshift cluster as part of that endeavor. When loading data into the firm’s fact table, the company needs the solution to achieve the best possible throughput and to make optimum use of cluster resources.

How should the business go about meeting these requirements?

A

Use a single COPY command to load the data into the Amazon Redshift cluster.

399
Q

A corporation maintains a PostgreSQL database on-premises that includes historical data. The database is used by an internal legacy application for read-only operations. The business team wishes to migrate the data as quickly as possible to a data lake on Amazon S3 and enhance it for analytics.
Between its VPC and its on-premises network, the organization established an AWS Direct Connect link. A data analytics expert must provide a solution that accomplishes the business team’s objectives while incurring the fewest operating costs.

A

Configure an AWS Glue crawler to use a JDBC connection to catalog the data in the on-premises database. Use an AWS Glue job to enrich the data and save the result to Amazon S3 in Apache Parquet format. Use Amazon Athena to query the data.

400
Q

A data engineer is processing data at periodic intervals using AWS Glue ETL processes. Following that, the processed data is transferred to Amazon S3. The ETL operations are scheduled to run every 15 minutes. After each operation is completed, the AWS Glue Data Catalog partitions must be updated automatically.

Which approach will be the most cost-effective in meeting these requirements?

A

Use the AWS Glue Data Catalog to manage the data catalog. Update the AWS Glue ETL code to include the enableUpdateCatalog and partitionKeys arguments.

401
Q

Amazon Redshift is used to store revenue data for a business. A data analyst must develop a dashboard that enables the company’s sales staff to view previous revenue and anticipate revenue properly for the next months.

Which solution will satisfy these needs the MOST EFFECTIVELY?

A

Create an Amazon QuickSight analysis by using the data in Amazon Redshift. Add a forecasting widget Publish the analysis as a dashboard.

402
Q

In the previous 24 months, a corporation has accumulated more than 100 TB of log data. The files are saved in an Amazon S3 bucket as raw text. Each item is identified by a key of the type year-month-day_log_HHmmss.txt where HHmmss denotes the time the log file was produced first. In Amazon Athena, a table was constructed that links to the S3 bucket. Several times every hour, one-time queries are conducted against a subset of the table’s columns.
A data analyst must make adjustments to the way these queries are conducted in order to lower the cost. Management wants a solution that requires less upkeep.

Which actions should the data analyst perform in combination to achieve these requirements?

A

Add a key prefix of the form date=year-month-day/ to the S3 objects to partition the data.

Convert the log files to Apache Parquet format.

Drop and recreate the table with the PARTITIONED BY clause. Run the MSCK REPAIR TABLE statement.

403
Q

A business owns an Amazon Redshift cluster that is encrypted. The organization just enabled audit logs in Amazon Redshift and wants to guarantee that audit logs are likewise encrypted at rest. The logs are kept for one year. The auditor conducts a monthly audit of the logs.

How might these needs be met in the MOST cost-effective manner possible?

A

Enable default encryption on the Amazon S3 bucket where the logs are stored by using AES-256 encryption. Use Amazon Redshift Spectrum to query the data as required.

404
Q

A marketing business is running its workloads on Amazon EMR clusters. By logging onto the master nodes, the corporation manually installs third-party libraries on the clusters. A data analyst must develop an automated solution that will take the place of the human procedure.

Which choices meet these criteria?

A

Place the required installation scripts in Amazon S3 and execute them using custom bootstrap actions.

Launch an Amazon EC2 instance with Amazon Linux and install the required third-party libraries on the instance. Create an AMI and use that AMI to create the EMR cluster.

405
Q

A business stores its data on Amazon Redshift. The reporting team generates reports from the Amazon Redshift database using ad-hoc queries. Recently, the reporting team began to notice discrepancies in report creation. Ad-hoc queries, which are commonly used to create results in minutes, might take hours to perform. A data analytics professional troubleshooting the problem discovers that ad-hoc searches are being caught behind long-running inquiries in the queue.

How should the data analyst address the situation?

A

Configure automatic workload management (WLM) from the Amazon Redshift console.

406
Q

A major government entity is utilizing Amazon Managed Streaming for Apache Kafka to gather events from multiple internal applications (Amazon MSK).
To keep data distinct, the business has setup a separate Kafka topic for each application. To ensure data security, the Kafka cluster is set to accept only TLS encrypted data and to encrypt data in transit.
A recent application upgrade revealed that one of the apps had been set improperly, resulting in data being written to another application’s Kafka topic. As data from numerous apps surfaced on the same subject, this resulted in many failures in the analytics pipeline. Following this occurrence, the organization want to prohibit applicants from writing to a subject other than the one to which they are supposed to write.

Which option satisfies these criteria with the least effort?

A

Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the clients’ TLS certificates as the principal of the ACL.

407
Q

A business wishes to optimize the data loading time for a sales data display. The data was gathered using.csv files and saved in a date-partitioned Amazon S3 bucket. The data is subsequently placed into an Amazon Redshift data warehouse for in-depth analysis on a regular basis. Daily data consumption is limited to 500 GB.

A

Split large .csv files, then use a COPY command to load data into Amazon Redshift.

408
Q

Amazon Redshift is used by an online retailer to store past sales transactions. To comply with the Payment Card Industry Data Security Standard, the organization is obliged to encrypt data at rest inside the clusters (PCI DSS). A corporate governance policy requires encryption keys to be managed through an on-premises hardware security module (HSM).

Which solution satisfies these criteria?

A

Create a VPC and establish a VPN connection between the VPC and the on-premises network. Create an HSM connection and client certificate for the on- premises HSM. Launch a cluster in the VPC with the option to use the on-premises HSM to store keys.

409
Q

A business uses Amazon Redshift to manage a data warehouse that is around 500 TB in size. Each few hours, new data is imported, and read-only queries are executed throughout the day and nighttime. On business days, there is a very high load with no writes for many hours each morning. Certain queries are queued and require a long time to perform during such hours. The business must optimize query execution and minimize downtime.

Which approach is the MOST cost-effective?

A

Enable concurrency scaling in the workload management (WLM) queue.

410
Q

A manufacturing business wishes to construct an operational analytics dashboard for the purpose of seeing near-real-time equipment parameters. The firm streams the data to other apps through Amazon Kinesis Data Streams. The dashboard must be refreshed automatically every five seconds. A data analytics professional must build a solution that is as simple to deploy as feasible.

Which solution satisfies these criteria?

A

Use Amazon Kinesis Data Firehose to push the data into an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster. Visualize the data by using an OpenSearch Dashboards (Kibana).

411
Q

A firm established a new election reporting website that makes use of Amazon Kinesis Data Firehose to transfer complete AWS WAF logs to an Amazon S3 bucket using Amazon Kinesis Data Firehose.
The organization is now looking for a low-cost solution for doing this rare data analysis with little development work using log visualizations.

Which solution satisfies these criteria?

A

Use an AWS Glue crawler to create and update a table in the Glue data catalog from the logs. Use Athena to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.

412
Q

A corporation that specializes in media analytics consumes a stream of social media updates. The postings are sent via an Amazon Kinesis data stream segmented by user id. Before putting the articles into an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster, an AWS Lambda function gets the records and checks the content. The validation procedure must receive the postings for a specific user in the sequence in which the Kinesis data stream received them.
During peak hours, it takes more than an hour for social media postings to show in the Amazon OpenSearch Service (Amazon ES) cluster. A data analytics professional must create a system that minimizes operational overhead while reducing latency.

Which solution satisfies these criteria?

A

Increase the number of shards in the Kinesis data stream.

413
Q

A business requires the collection of streaming data from several sources and storage on the AWS Cloud. Although the dataset is well organized, analysts must execute multiple sophisticated SQL queries with consistent performance. Certain types of data are searched more often than others. The organization is looking for a cost-effective solution that satisfies its performance criteria.

Which solution satisfies these criteria?

A

Use Amazon Kinesis Data Firehose to ingest the data to save it to Amazon S3. Load frequently queried data to Amazon Redshift using the COPY command. Use Amazon Redshift Spectrum for less frequently queried data.

414
Q

A media corporation has developed a streaming media player. The organization must gather and analyze data in order to deliver near-real-time feedback within 30 seconds on playback difficulties. The business seeks a consumer application that can detect playback difficulties, such as reduced quality over a certain time period. JSON-formatted data will be transmitted. The schema is subject to modification throughout time.

Which solution will satisfy these criteria?

A

Send the data to Amazon Kinesis Data Streams and configure an Amazon Kinesis Analytics for Java application as the consumer. The application will consume the data and process it to identify potential playback issues. Persist the raw data to Amazon S3.

415
Q

A business is looking to reduce the cost of its data and analytics platform. The organization is importing a variety of.csv and JSON files from a variety of data sources into Amazon S3. Each day, 50 GB of data is projected to be received. The firm is directly querying the raw data in Amazon S3 using Amazon Athena. The majority of searches aggregate data from the last 12 months, whereas data older than five years is accessed seldom. A typical query would search around 500 MB of data and should provide results in less than one minute. For compliance purposes, raw data must be stored permanently.

Which option best fulfills the needs of the business?

A

Use an AWS Glue ETL job to compress, partition, and convert the data into a columnar data format. Use Athena to query the processed dataset. Configure a lifecycle policy to move the processed data into the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class 5 years after object creation. Configure a second lifecycle policy to move the raw data into Amazon S3 Glacier for long-term archival 7 days after object creation.

416
Q

An ecommerce organization is transferring its on-premises business intelligence system to the AWS Cloud. Amazon Redshift on a public subnet and Amazon QuickSight will be used by the firm. The tables have already been imported into Amazon Redshift and are accessible using a SQL tool.
QuickSight is launched for the first time by the corporation. A data analytics professional inputs the necessary information and attempts to confirm the connection throughout the data source creation process. There is a problem with the following message: €Connecting to your data source has expired. ג€

What should the data analytics professional do to rectify this situation?

A

Add the QuickSight IP address range into the Amazon Redshift security group.

417
Q

A manufacturer of network equipment has millions of customers. On an hourly basis, data is gathered from the devices and saved in an Amazon S3 data lake.
The organization conducts analysis of the previous 24 hours’ worth of data flow records to find anomalies and diagnose and resolve user difficulties. Additionally, the organization examines historical records stretching back two years to uncover trends and identify areas for development.
Numerous parameters are included in the data flow logs, including the date, timestamp, source IP, and destination IP. Each day, around ten billion occurrences occur.

How should this data be saved in order to get the best performance?

A

In Apache ORC partitioned by date and sorted by source IP

418
Q

A manufacturing business stores its data on Amazon S3. The organization intends to employ AWS Lake Formation to secure such data assets at the granular level. Apache Parquet is used to store the data. The corporation has assigned a consultant a deadline for constructing a data lake.

How should the consultant approach developing the MOST COST-EFFECTIVE solution that satisfies these requirements?

A

To create the data catalog, run an AWS Glue crawler on the existing Parquet data. Register the Amazon S3 path and then apply permissions through Lake Formation to provide granular-level security.

419
Q

Amazon QuickSight is being used by a data analyst to see data across numerous datasets created by apps. Each application has its own Amazon S3 bucket. AWS Glue Data Catalog is used to manage all application data stored in Amazon S3. A new application’s data is stored in its own S3 bucket. The data analyst built a new Amazon QuickSight data source from an Amazon Athena table after revising the catalog to incorporate the new application data source, however the import into SPICE failed.

How is the data analyst to fix the situation?

A

Edit the permissions for the new S3 bucket from within the Amazon QuickSight console.

420
Q

Salesforce, MySQL, and Amazon S3 are all used by a marketing organization to store data. The organization wishes to use data from these three sites in order to provide mobile dashboards for its consumers. The organization is unclear how to develop the dashboards and need a solution that requires as little modification and code as feasible.

Which solution satisfies these criteria?

A

Use Amazon QuickSight to connect to the data sources and generate the mobile dashboards.

421
Q

A marketing organization uses Amazon S3 to host a data lake. The firm manages the metadata via AWS Glue Data Catalog. The data lake is many years old, and its total size has grown dramatically as new data sources and metadata have been added. The data lake administrator wishes to build a system for synchronizing rights between Amazon S3 and the Data Catalog.

Which option will make rights management the simplest possible with the least amount of development effort?

A

Use AWS Lake Formation permissions

422
Q

Amazon Redshift serves as the data repository for a business. A new table has columns containing sensitive data as well as columns containing non-sensitive data. Eventually, the data in the database will be accessed by various existing queries that are executed several times each day.
A data analytics professional must verify that columns containing sensitive data are viewed only by members of the company’s auditing team. All other users must have read-only access to non-sensitive data columns.

Which method satisfies these criteria with the LEAST amount of operational overhead?

A

Grant all users read-only permissions to the columns that contain non-sensitive data. Use the GRANT SELECT command to allow the auditing team to access the columns that contain sensitive data.

423
Q

A university wishes to gather JSON-formatted batches of water quality values in Amazon S3 using Amazon Kinesis Data Firehose. The data comes from 50 sensors spread over a nearby lake. Students will use Amazon Athena to query the stored data to track changes in a recorded parameter over time, such as water temperature or acidity. The project has generated increased interest, pushing the institution to rethink how data would be maintained.

Which data format and partitioning scheme will result in the MOST substantial cost savings?

A

Partition the data by year, month, and day.

Store the data in Apache Parquet format using Snappy compression.

424
Q

On Amazon S3, a financial services business is developing a data lake solution. The firm intends to use AWS’s analytics capabilities to address consumer requests for one-time querying and business intelligence reports. A percentage of the columns will include information that may be used to identify you (PII) Plaintext PII data should be seen only by authorized users.

What is the MOST OPTIMAL option that satisfies these requirements?

A

Register the S3 locations with AWS Lake Formation. Create two IAM roles. Use Lake Formation data permissions to grant Select permissions to all of the columns for one role. Grant Select permissions to only columns that contain non-PII data for the other role.

425
Q

A data analyst uses Amazon Athena and the JDBC driver to execute a significant number of data manipulation language (DML) queries. Recently, a query failed after 30 minutes of execution. The following message was returned by the query: java.sql. SQLException: Query execution timed out
The data analyst does not need the query results immediately. The data analyst, on the other hand, need a long-term answer to this challenge.

Which solution will satisfy these criteria?

A

In the Service Quotas console, request an increase for the DML query timeout

426
Q

A significant financial institution is currently conducting its ETL process. This method includes transferring data from Amazon S3 to an Amazon Redshift cluster. The organization want to load the dataset onto Amazon Redshift in the most cost-effective manner possible.

Which sequence of steps would satisfy these conditions?

A

Use the COPY command with the manifest file to load data into Amazon Redshift.

Use temporary staging tables during the loading process.

427
Q

Amazon QuickSight dashboards are being used by a media business to display its nationwide sales statistics. The dashboard makes use of the following fields: ID, date, time zone, city, state, nation, longitude, latitude, sales volume, and item count.
To make adjustments to current initiatives, the organization need an interactive and comprehensible display of which states achieved considerably lower sales volumes than the national average.

Which enhancement to the QuickSight dashboard will satisfy this requirement?

A

A geospatial color-coded chart of sales volume data across the country.

428
Q

A technological business has an application that receives millions of daily users. The organization uses Amazon Athena to query daily use statistics in order to learn how users engage with the application. The date and time stamps, the location ID, and the services utilized are all included in the data. The firm wishes to utilize Athena to conduct queries for data analysis with the lowest feasible latency.

Which solution satisfies these criteria?

A

Store the data in Apache Parquet format with the date and time as the partition, with the data sorted by the location ID.

429
Q

A social networking corporation is analyzing its data for forecasting purposes using business intelligence tools. The organization is using Apache Kafka to rapidly absorb low-velocity data. The organization wishes to develop dynamic dashboards that use machine learning (ML) insights in order to anticipate important business trends. The dashboards must be updated hourly using data from Amazon S3. Various teams inside the organization wish to access the dashboards using Amazon QuickSight with machine learning insights. Additionally, the solution must address the scalability issues that the organization currently has while ingesting data through its present architecture.

Which solution will satisfy these needs the MOST EFFECTIVELY?

A

Replace Kafka with an Amazon Kinesis data stream. Use an Amazon Kinesis Data Firehose delivery stream to consume the data and store the data in Amazon S3. Use QuickSight Enterprise edition to refresh the data in SPICE from Amazon S3 hourly and create a dynamic dashboard with forecasting and ML insights.

430
Q

A business want to conduct user churn research by reviewing the previous three months’ worth of user activity. Each day, 1.5 TB of uncompressed data is created by millions of users. To reach the query performance targets, a 30-node Amazon Redshift cluster with 2.56 TB of solid state drive (SSD) storage per node is needed.
The business intends to do an extra examination of a year’s worth of historical data in order to determine which features are the most popular. This analysis will occur on a weekly basis.

Which approach is the MOST cost-effective?

A

Keep the data from the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. Then use Amazon Redshift Spectrum for the additional analysis.

431
Q

A prominent institution has set a strategic aim of boosting student diversity. The data analytics team is now working on developing a dashboard with data visualizations that will allow stakeholders to see historical patterns. Microsoft Active Directory must be used to authenticate all access. Encryption of data in transit and at rest is required.

Which solution satisfies these criteria?

A

Amazon QuickSight Enterprise edition configured to perform identity federation using SAML 2.0 and the default encryption settings.

432
Q

A financial institution uses Amazon S3 to host a data lake and an Amazon Redshift cluster to host a data warehouse. The firm utilizes Amazon QuickSight to create dashboards and wishes to safeguard access to Amazon QuickSight from its on-premises Active Directory.

A

Use an Active Directory connector and single sign-on (SSO) in a corporate network environment.

433
Q

Amazon Redshift is being used by an online retailer to execute queries and do analytics on consumer buying activity. When numerous queries are executing concurrently on the cluster, the runtime for tiny queries rapidly rises. The data analytics team at the organization may reduce the runtime of these little searches by prioritizing them over big queries.

Which solution will satisfy these criteria?

A

Configure short query acceleration in workload management (WLM)

434
Q

A business want to gather and handle near-real-time event data from several departments. Prior to storing the data on Amazon S3, the firm must clean it by standardizing the address and timestamp columns’ formats. The amount of the data fluctuates according to the total load at any given moment in time. A single data record might range in size from 100 KB to 10 MB.

How should a data analytics professional design the data intake solution?

A

Use Amazon Managed Streaming for Apache Kafka. Configure a topic for the raw data. Use a Kafka producer to write data to the topic. Create an application on Amazon EC2 that reads data from the topic by using the Apache Kafka consumer API, cleanses the data, and writes to Amazon S3.

435
Q

A manufacturer of equipment want to gather data from sensors. A data analytics professional must build a system that gathers and stores data in near-real time. The data must be saved in nested JSON format and queried with a latency of single-digit milliseconds from the data storage.

Which solution will satisfy these criteria?

A

A. Use Amazon Kinesis Data Streams to receive the data from the sensors. Use Amazon Kinesis Data Analytics to read the stream, aggregate the data, and send the data to an AWS Lambda function. Configure the Lambda function to store the data in Amazon DynamoDB.

436
Q

A marketing firm wishes to enhance its business intelligence and reporting capabilities. The organization conducted interviews with key stakeholders throughout the planning process and learned the following:

✑ Hourly reports for the current month’s data are generated by the operations team.
✑ The sales team want to use numerous Amazon QuickSight dashboards to provide a rolling view of the last 30 days per category. Additionally, the sales team wants immediate access to the data as it reaches the reporting backend.
✑ The finance team runs reports daily for the previous month’s data and once a month for the previous 24 months’ data.
Currently, the system has 400 TB of data, with a projected monthly addition of 100 TB. The organization is seeking for the most cost-effective option imaginable.

Which option best fulfills the needs of the business?

A

Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Set up an external schema and table for Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift as the data source.

437
Q

Amazon Redshift serves as the data repository for a business. Columns in a new table contain sensitive data. Eventually, the data in the database will be referred to by various existing queries that execute several times each day.
A data analyst is tasked with the responsibility of loading 100 billion rows of data into a new database. Prior to doing so, the data analyst must verify that the columns holding sensitive data are only accessible to members of the auditing group.

How can the data analyst achieve these criteria while incurring the fewest possible maintenance costs?

A

Load all the data into the new table and grant the auditing group permission to read from the table. Use the GRANT SQL command to allow read-only access to a subset of columns to the appropriate users.

438
Q

A corporation is transferring its current on-premises data integration and transformation operations to Amazon EMR. The code is composed of a succession of Java tasks. The organization needs to decrease system administrator overhead without modifying the underlying programming. Due to the sensitivity of the material, compliance requires the organization to encrypt the root device volume on all cluster nodes. When feasible, corporate requirements dictate that environments be deployed through AWS CloudFormation.

Which solution meets these criteria?

A

Create a custom AMI with encrypted root device volumes. Configure Amazon EMR to use the custom AMI using the CustomAmild property in the CloudFormation template.

439
Q

A business want to ensure that its data analysts have continuous access to the data stored in its Amazon Redshift cluster. Amazon Kinesis Data Firehose is used to transmit all data to an Amazon S3 bucket. Every five minutes, an AWS Glue task executes a COPY command to transfer the data into Amazon Redshift.
The volume of data sent varies throughout the day, and cluster usage peaks at various times. COPY commands typically finish within a few seconds. When a load surge happens, however, locks may exist and data may be missing. At the moment, the AWS Glue task is set to execute without retries, with a 5 minute timeout and a single concurrency.

How should a data analytics professional design the AWS Glue task to increase data availability and fault tolerance in an Amazon Redshift cluster?

A

Keep the number of retries at 0. Decrease the timeout value. Keep the job concurrency at 1.

440
Q

A corporation want to display complicated real-world situations using an autonomous machine learning (ML) Random Cut Forest (RCF) technique, including recognizing seasonality and trends, removing outliers, and imputing missing data.
The team working on this project is non-technical and is seeking for the LEAST amount of management overhead possible.

Which solution will satisfy these criteria?

A

Use Amazon QuickSight to visualize the data and then use ML-powered forecasting to forecast the key business metrics.

441
Q

For the last year, a manufacturing business has been gathering data from IoT sensor devices on its factory floor and storing it in Amazon Redshift for daily analysis. According to a data analyst, the cluster will be undersized in less than four months at the estimated intake rate of around 2 TB each day. There is a need for a long-term strategy. While the data analyst said that the majority of inquiries pertain to the most recent 13 months of data, there are also quarterly reports that require querying all data created over the previous seven years. The chief technology officer (CTO) is worried about the long-term solution’s cost, administrative burden, and performance.

Which data analyst solution should be used to satisfy these requirements?

A

Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.

442
Q

A multinational pharmaceutical corporation obtains test findings for new medications from a variety of testing centers located across the globe. The findings are uploaded in the form of millions of 1 KB-sized JSON objects to the company’s Amazon S3 bucket. The data engineering team must analyze those files, convert them to Apache Parquet, and put them into Amazon Redshift for dashboard reporting by data analysts. The engineering team processes the items using AWS Glue, orchestrates the processes using AWS Step Functions, and schedules jobs using Amazon CloudWatch.
Additional testing facilities have been installed lately, and the time required to process files is growing.

What will most effectively reduce the time required to process data?

A

Use the AWS Glue dynamic frame file grouping option while ingesting the raw input files. Process the files and load them into Amazon Redshift tables.

443
Q

A financial institution is presently storing sensitive data on an Amazon Redshift cluster with dense storage (DS) nodes. The cluster was discovered to be unencrypted during an audit. According to compliance standards, a database containing sensitive data must be secured using a hardware security module (HSM) that supports automatic key rotation.

A

Set up a trusted connection with HSM using a client and server certificate with automatic key rotation.

Create a new HSM-encrypted Amazon Redshift cluster and migrate the data to the new cluster.

444
Q

A hotel inventory management firm wants to develop a near-real-time communications system. The communications are gathered from 1,000 data sources and include information about hotel inventory. After that, the data is processed and disseminated to a total of twenty HTTP endpoint locations. The data size of messages is between 2 and 500 KB.
The messages must be sent in sequence to each destination. The performance of a single HTTP endpoint should have no effect on the performance of other HTTP endpoints.

Which solution satisfies these objectives with the LEAST INTERRUPTION in the time between message absorption and delivery?

A

B. Create an Amazon Kinesis data stream, and ingest the data for each source into the stream. Create a single enhanced fan-out AWS Lambda function to read these messages and send the messages to each destination endpoint. Register the function as an enhanced fan-out consumer.

445
Q

An airline stores data in.csv format in Amazon S3 using an AWS Glue Data Catalog. As part of a dally batch process, data analysts want to combine this data with call center data housed in Amazon Redshift. Amazon Redshift is already experiencing significant pressure. The solution must be controlled, serverless, and well-performing, with the goal of minimizing burden on the current Amazon Redshift cluster. Additionally, the solution should take less effort and development activities.

Which solution satisfies these criteria?

A

Create an external table using Amazon Redshift Spectrum for the call center data and perform the join with Amazon Redshift.

446
Q

Amazon RDS is used by an ecommerce business to store client purchase data. The business need a system for storing and analyzing historical data. For analytics workloads, the most recent six months of data will be regularly searched. This data set exceeds many terabytes in size. Once a month, historical data from the last five years must be available and combined with more current data. The organization wishes to maximize both performance and cost.

Which storage option will satisfy these criteria?

A

Incrementally copy data from Amazon RDS to Amazon S3. Load and store the most recent 6 months of data in Amazon Redshift. Configure an Amazon Redshift Spectrum table to connect to all historical data.

447
Q

Using Amazon QuickSight, a retail company’s data analytics team has produced various product sales analysis dashboards for the average selling price per product. The dashboards were produced by uploading.csv data to Amazon S3. The team now intends to share the dashboards with the corresponding external product owners through Amazon QuickSight individual users. Access restriction is a critical necessity for compliance and governance purposes. The dashboard reports should be limited to the study of their own products.

Which method should the data analytics team employ in order to enable product owners to access their goods only in the dashboard?

A

Create dataset rules with row-level security.

448
Q

A retailer want to use Amazon QuickSight to create dashboards for both online and in-store sales. The dashboards will be developed and used by a group of 50 business intelligence specialists. When the dashboards are complete, they will be shared with a set of 1,000 people.
Sales data is gathered from various retailers and uploaded to Amazon S3 every 24 hours. The data is divided into years and months and saved in the Apache Parquet format. The company’s primary data catalog is AWS Glue Data Catalog, while querying is handled via Amazon Athena. At any moment, the dashboards query from a total of 200 GB of uncompressed data.

Which configuration will deliver the MOST COST-EFFECTIVE solution that satisfies these criteria?

A

Use QuickSight Enterprise edition. Configure 50 author users and 1,000 reader users. Configure an Athena data source and import the data into SPICE. Automatically refresh every 24 hours.

449
Q

Amazon Redshift is now being used by a financial institution to store sensitive data. The existing cluster is unencrypted, according to an audit. A database containing sensitive data must be encrypted using a hardware security module (HSM) with customer-managed keys to ensure compliance.

Which adjustments to the cluster are necessary to guarantee compliance?

A

Create a new HSM-encrypted Amazon Redshift cluster and migrate the data to the new cluster.

450
Q

A firm that monitors weather conditions from distant construction sites is in the process of implementing a system that will gather temperature data from the following two weather stations.

✑ Station A, which is equipped with ten sensors
✑ Station B, which is equipped with five sensors

On-site subject-matter specialists installed these weather stations.
Each sensor is identified by a unique ID. Amazon Kinesis Data Streams will be used to gather data from each sensor.
A single Amazon Kinesis data stream with two shards is formed based on the total incoming and outgoing data throughput. Two partition keys are generated based on the names of the stations. During testing, data from Station A experiences a bottleneck, while data from Station B does not. The overall stream throughput is verified to be less than the assigned Kinesis Data Streams throughput.

How can this bottleneck be addressed without increasing the total cost and complexity of the system, while yet meeting the quality standards for data collection?

A

Modify the partition key to use the sensor ID instead of the station name.

451
Q

For ad-hoc searches, a financial institution employs Apache Hive on Amazon EMR. Users have expressed dissatisfaction with the performance.
According to a data analyst, the following is true:

✑ Around 90% of enquiries are filed during the first hour after the market begins.

HDFS’s (Hadoop Distributed File System) usage never surpasses 10%.

Which approach would be most effective in resolving the performance issues?

A

Create instance group configurations for core and task nodes.

Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric.

452
Q

An insurance firm has raw data in JSON format that is transferred to an Amazon S3 bucket on an ad hoc basis through an Amazon Kinesis Data Firehose delivery stream. Every eight hours, an AWS Glue crawler is scheduled to update the schema of the tables contained in the S3 bucket’s data catalog. Apache Spark SQL is used by data analysts to analyze data on Amazon EMR, which is configured using AWS Glue Data Catalog as the metastore. According to data analysts, they sometimes obtain outdated data. A data engineer must ensure that users have access to the most current data.

Which solution satisfies these criteria?

A

Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.

453
Q

A big retailer has successfully transitioned to a data lake design based on Amazon S3. Amazon Redshift and Amazon QuickSight are being used by the company’s marketing team to analyze data to generate and show insights. To guarantee that the marketing team gets the most up-to-date actionable data, a data analyst does nightly Amazon Redshift refreshes utilizing terabytes of previous day’s changes.
Users claim that after the first nightly refresh, half of the most popular dashboards that were performing OK before to the refresh are now much slower. Amazon CloudWatch does not display any notifications.

Which of the following is the MOST LIKELY cause of performance degradation?

A

The nightly data refreshes left the dashboard tables in need of a vacuum operation that could not be automatically performed by Amazon Redshift due to ongoing user workloads.

454
Q

A financial services organization must collect daily stock trading data from exchanges and store it in a data warehouse. The organization needs data to be streamed directly into the data repository, but allows for SQL-based data modification on occasion. The solution should incorporate complicated, analytic queries that execute with the least amount of delay possible. The solution must include a business intelligence dashboard that allows the identification of the primary causes of stock price abnormalities.

Which option best fulfills the needs of the business?

A

Use Amazon Kinesis Data Firehose to stream data to Amazon Redshift. Use Amazon Redshift as a data source for Amazon QuickSight to create a business intelligence dashboard.

455
Q

A shoe selling firm located in the United States has opened a worldwide website. In the us-east-1 Region, all transaction data is kept in Amazon RDS, while curated historical transaction data is saved in Amazon Redshift. By offering a dashboard on shoe trends, the business intelligence (BI) team hopes to improve the user experience.
The business intelligence team chooses to display the website dashboards using Amazon QuickSight. A team in Japan provided Amazon QuickSight in ap- northeast-1 during development. The team is experiencing connectivity issues between Amazon QuickSight in ap-northeast-1 and Amazon Redshift in us-east-1.

Which option will address this problem while also satisfying the requirements?

A

Create a new security group for Amazon Redshift in us-east-1 with an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in ap-northeast-1.

456
Q

What is Athena control limit?

A

The per-query control limit specifies the total amount of data scanned per query. If any query that runs in the workgroup exceeds the limit, it is canceled

457
Q

A company’s data analyst must verify that Amazon Athena searches do not scan more than a certain quantity of data in order to maintain cost management.
Queries that surpass the specified threshold must be promptly canceled.

What actions should the data analyst take to accomplish this?

A

For each workgroup, set the control limit for each query to the prescribed threshold.

458
Q

A telecoms business is searching for a system that detects anomalies in order to identify fraudulent calls. Currently, the firm utilizes Amazon Kinesis to transmit JSON-formatted phone call records from its on-premises database to Amazon S3. The present dataset comprises 200-column records for voice calls. To identify fraudulent calls, the solution would have to examine just five of these columns.
The organization is looking for a low-cost solution that leverages AWS and needs little work and familiarity with anomaly detection algorithms.

Which solution satisfies these criteria?

A

Use an AWS Glue job to transform the data from JSON to Apache Parquet. Use AWS Glue crawlers to discover the schema and build the AWS Glue Data Catalog. Use Amazon Athena to create a table with a subset of columns. Use Amazon QuickSight to visualize the data and then use Amazon QuickSight machine learning-powered anomaly detection.

459
Q

A business uses Amazon Redshift to host an enterprise reporting system. The program gives reporting capabilities to three distinct user groups: executives who can obtain financial reports, data analysts who can conduct long-running ad-hoc queries, and data engineers who can execute stored procedures and ETL operations. The executive team expects excellent performance from inquiries. The data engineering team anticipates that queries will take a few minutes.

Which Amazon Redshift capability satisfies the task’s requirements?

A

Workload management (WLM)

460
Q

An Internet of Things business is developing a new gadget that will gather data on sleep patterns while sleeping on an intelligent mattress. Sensors will transmit data to an Amazon S3 bucket. Each night, around 2 MB of data is created for each bed. Each user’s data must be analyzed and summarized, and the findings must be made accessible as quickly as feasible. Time windowing and other operations are included as part of the procedure. Each run, based on testing using a Python script, will need around 1 GB of RAM and will take a few minutes to finish.

Which option is the MOST cost-effective approach to execute the script?

A

AWS Glue with a PySpark job

461
Q

A real estate business uses Apache HBase on Amazon EMR to power a mission-critical application. A single master node is setup for Amazon EMR. The company’s data is kept on a Hadoop Distributed File System in excess of 5 TB (HDFS). The organization is looking for a cost-effective way to increase the availability of its HBase data.

Which architectural design best fulfills the needs of the business?

A

Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Create a primary EMR HBase cluster with multiple master nodes. Create a secondary EMR HBase read-replica cluster in a separate Availability Zone. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.

462
Q

A company ingests a large set of clickstream data in nested JSON format from different sources and stores it in Amazon S3. Data analysts need to analyze this data in combination with data stored in an Amazon Redshift cluster. Data analysts want to build a cost-effective and automated solution for this need.

A

Use the Relationalize class in an AWS Glue ETL job to transform the data and write the data back to Amazon S3. Use Amazon Redshift Spectrum to create external tables and join with the internal tables.

463
Q

A publisher website captures user activity and sends clickstream data to Amazon Kinesis Data Streams. The publisher wants to design a cost-effective solution to process the data to create a timeline of user activity within a session. The solution must be able to scale depending on the number of active sessions.

A

Include a session identifier in the clickstream data from the publisher website and use as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Deploy the consumer application on Amazon EC2 instances in an
EC2 Auto Scaling group. Use an AWS Lambda function to reshard the stream based upon Amazon CloudWatch alarms

464
Q

A company is currently using Amazon DynamoDB as the database for a user support application. The company is developing a new version of the application that will store a PDF file for each support case ranging in size from 1–10 MB. The file should be retrievable whenever the case is accessed in the application.

A

Store the file in Amazon S3 and the object key as an attribute in the DynamoDB table

465
Q

A company needs to implement a near-real-time fraud prevention feature for its ecommerce site. User and order details need to be delivered to an Amazon SageMaker endpoint to flag suspected fraud. The amount of input data needed for the inference could be as much as 1.5 MB.

Which solution meets the requirements with the LOWEST overall latency?

A

Create an Amazon Managed Streaming for Kafka cluster and ingest the data for each order into a topic. Use a Kafka consumer running on Amazon EC2 instances to read these messages and invoke the Amazon SageMaker endpoint.

466
Q

A media company is migrating its on-premises legacy Hadoop cluster with its associated data processing scripts and workflow to an Amazon EMR environment running the latest Hadoop release. The developers want to reuse the Java code that was written for data processing jobs for the on-premises cluster.

A

Compile the Java program for the desired Hadoop version and run it using a CUSTOM_JAR step on the EMR cluster.

467
Q

An online retail company wants to perform analytics on data in large Amazon S3 objects using Amazon EMR. An Apache Spark job repeatedly queries the same data to populate an analytics dashboard. The analytics team wants to minimize the time to load the data and create the dashboard.

Propose TWO approches

A

Load the data into Spark DataFrames.

Use Amazon S3 Select to retrieve the data necessary for the dashboards from the S3 objects.

468
Q

A data engineer needs to create a dashboard to display social media trends during the last hour of a large company event. The dashboard needs to display the associated metrics with a consistent latency of less than 2 minutes.

A

Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Use Kinesis Data Analytics for SQL Applications to perform a sliding window analysis to compute the metrics and output the results to a Kinesis Data Streams data stream. Configure an AWS Lambda function to save the stream data to an Amazon DynamoDB table. Deploy a real-time dashboard hosted in an Amazon S3 bucket to read and display the metrics data stored in the DynamoDB table.

469
Q

A real estate company is receiving new property listing data from its agents through .csv files every day and storing these files in Amazon S3. The data analytics team created an Amazon QuickSight visualization report that uses a dataset imported from the S3 files. The data analytics team wants the visualization report to reflect the current data up to the previous day.

How can a data analyst meet these requirements?

A

Schedule the dataset to refresh daily.

470
Q

A financial company uses Amazon EMR for its analytics workloads. During the company’s annual security audit, the security team determined that none of the EMR clusters’ root volumes are encrypted. The security team recommends the company encrypt its EMR clusters’ root volume as soon as possible.

Which solution would meet these requirements?

A

Specify local disk encryption in a security configuration. Re-create the cluster using the newly created security configuration.

471
Q

A company is providing analytics services to its marketing and human resources (HR) departments. The departments can only access the data through their business intelligence (BI) tools, which run Presto queries on an Amazon EMR cluster that uses the EMR File System (EMRFS). The marketing data analyst must be granted access to the advertising table only. The HR data analyst must be granted access to the personnel table only.

Which approach will satisfy these requirements?

A

Create separate IAM roles for the marketing and HR users. Assign the roles with AWS Glue resource based policies to access their corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore.

472
Q

A software company hosts an application on AWS, and new features are released weekly. As part of the application testing process, a solution must be developed that analyzes logs from each Amazon EC2 instance to ensure that the application is working as expected after each deployment. The collection and analysis solution should be highly available with the ability to display new information with minimal delays.
Which method should the company use to collect and analyze the logs?

A

Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.

473
Q

A data analyst is using AWS Glue to organize, cleanse, validate, and format a 200 GB dataset. The data analyst triggered the job to run with the Standard worker type. After 3 hours, the AWS Glue job status is still RUNNING. Logs from the job run show no error codes. The data analyst wants to improve the job execution time without overprovisioning.

Which actions should the data analyst take?

A

Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.

474
Q

A company has a business unit uploading .csv files to an Amazon S3 bucket. The company’s data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table.
Which solution will update the Redshift table without duplicates when jobs are rerun?

A

Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.

475
Q

A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by Amazon
Athena. Users are seeing degradation in query performance as time progresses.
Which action can help improve query performance?

A

Merge the files in Amazon S3 to form larger files.

476
Q

A company uses Amazon Elasticsearch Service (Amazon ES) to store and analyze its website clickstream data. The company ingests 1 TB of data daily using
Amazon Kinesis Data Firehose and stores one day’s worth of data in an Amazon ES cluster.
The company has very slow query performance on the Amazon ES index and occasionally sees errors from Kinesis Data Firehose when attempting to write to the index. The Amazon ES cluster has 10 nodes running a single index and 3 dedicated master nodes. Each data node has 1.5 TB of Amazon EBS storage attached and the cluster is configured with 1,000 shards. Occasionally, JVMMemoryPressure errors are found in the cluster logs.
Which solution will improve the performance of Amazon ES?

A

Decrease the number of Amazon ES shards for the index.

477
Q

A manufacturing company has been collecting IoT sensor data from devices on its factory floor for a year and is storing the data in Amazon Redshift for daily analysis. A data analyst has determined that, at an expected ingestion rate of about 2 TB per day, the cluster will be undersized in less than 4 months. A long-term solution is needed. The data analyst has indicated that most queries only reference the most recent 13 months of data, yet there are also quarterly reports that need to query all the data generated from the past 7 years. The chief technology officer (CTO) is concerned about the costs, administrative effort, and performance of a long-term solution.
Which solution should the data analyst use to meet these requirements?

A

Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.

478
Q

An insurance company has raw data in JSON format that is sent without a predefined schedule through an Amazon Kinesis Data Firehose delivery stream to an Amazon S3 bucket. An AWS Glue crawler is scheduled to run every 8 hours to update the schema in the data catalog of the tables stored in the S3 bucket. Data analysts analyze the data using Apache Spark SQL on Amazon EMR set up with AWS Glue Data Catalog as the metastore. Data analysts say that, occasionally, the data they receive is stale. A data engineer needs to provide access to the most up-to-date data.

Which solution meets these requirements?

A

Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.

479
Q

A company is planning to do a proof of concept for a machine earning (ML) project using Amazon SageMaker with a subset of existing on-premises data hosted in the company’s 3 TB data warehouse. For part of the project, AWS Direct Connect is established and tested. To prepare the data for ML, data analysts are performing data curation. The data analysts want to perform multiple step, including mapping, dropping null fields, resolving choice, and splitting fields. The company needs the fastest solution to curate the data for this project.
Which solution meets these requirements?

A

Ingest data into Amazon S3 using AWS DMS. Use AWS Glue to perform data curation and store the data in Amazon 3 for ML processing.

480
Q

A team of data scientists plans to analyze market trend data for their company’s new investment strategy. The trend data comes from five different data sources in large volumes. The team wants to utilize Amazon Kinesis to support their use case. The team uses SQL-like queries to analyze trends and wants to send notifications based on certain significant patterns in the trends. Additionally, the data scientists want to save the data to Amazon S3 for archival and historical re- processing, and use AWS managed services wherever possible. The team wants to implement the lowest-cost solution.
Which solution meets these requirements?

A

Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

481
Q

A company currently uses Amazon Athena to query its global datasets. The regional data is stored in Amazon S3 in the us-east-1 and us-west-2 Regions. The data is not encrypted. To simplify the query process and manage it centrally, the company wants to use Athena in us-west-2 to query data from Amazon S3 in both
Regions. The solution should be as low-cost as possible.
What should the company do to achieve this goal?

A

Run the AWS Glue crawler in us-west-2 to catalog datasets in all Regions. Once the data is crawled, run Athena queries in us-west-2.

482
Q

A large company receives files from external parties in Amazon EC2 throughout the day. At the end of the day, the files are combined into a single file, compressed into a gzip file, and uploaded to Amazon S3. The total size of all the files is close to 100 GB daily. Once the files are uploaded to Amazon S3, an
AWS Batch program executes a COPY command to load the files into an Amazon Redshift cluster.
Which program modification will accelerate the COPY process?

A

Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.

483
Q

A large ride-sharing company has thousands of drivers globally serving millions of unique customers every day. The company has decided to migrate an existing data mart to Amazon Redshift. The existing schema includes the following tables.
✑ A trips fact table for information on completed rides.
✑ A drivers dimension table for driver profiles.
✑ A customers fact table holding customer profile information.
The company analyzes trip details by date and destination to examine profitability by region. The drivers data rarely changes. The customers data frequently changes.
What table design provides optimal query performance?

A

Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

484
Q

Three teams of data analysts use Apache Hive on an Amazon EMR cluster with the EMR File System (EMRFS) to query data stored within each teams Amazon
S3 bucket. The EMR cluster has Kerberos enabled and is configured to authenticate users from the corporate Active Directory. The data is highly sensitive, so access must be limited to the members of each team.
Which steps will satisfy the security requirements?

A

For the EMR cluster Amazon EC2 instances, create a service role that grants no access to Amazon S3. Create three additional IAM roles, each granting access to each team’s specific bucket. Add the service role for the EMR cluster EC2 instances to the trust policies for the additional IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

485
Q

A company is planning to create a data lake in Amazon S3. The company wants to create tiered storage based on access patterns and cost objectives. The solution must include support for JDBC connections from legacy clients, metadata management that allows federation for access control, and batch-based ETL using PySpark and Scala Operational management should be limited.
Which combination of components can meet these requirements?

A

AWS Glue Data Catalog for metadata management

AWS Glue for Scala-based ETL

Amazon Athena for querying data in Amazon S3 using JDBC drivers

486
Q

A regional energy company collects voltage data from sensors attached to buildings. To address any known dangerous conditions, the company wants to be alerted when a sequence of two voltage drops is detected within 10 minutes of a voltage spike at the same building. It is important to ensure that all messages are delivered as quickly as possible. The system must be fully managed and highly available. The company also needs a solution that will automatically scale up as it covers additional cites with this monitoring feature. The alerting system is subscribed to an Amazon SNS topic for remediation.
Which solution meets these requirements?

A

Create an Amazon Kinesis data stream to capture the incoming sensor data and create another stream for alert messages. Set up AWS Application Auto Scaling on both. Create a Kinesis Data Analytics for Java application to detect the known event sequence, and add a message to the message stream. Configure an AWS Lambda function to poll the message stream and publish to the SNS topic.

487
Q

A company has developed an Apache Hive script to batch process data stared in Amazon S3. The script needs to run once every day and store the output in
Amazon S3. The company tested the script, and it completes within 30 minutes on a small local three-node cluster.
Which solution is the MOST cost-effective for scheduling and executing the script?

A

Create an AWS Lambda function to spin up an Amazon EMR cluster with a Hive execution step. Set KeepJobFlowAliveWhenNoSteps to false and disable the termination protection flag. Use Amazon CloudWatch Events to schedule the Lambda function to run daily.

488
Q

A company stores its sales and marketing data that includes personally identifiable information (PII) in Amazon S3. The company allows its analysts to launch their own Amazon EMR cluster and run analytics reports with the data. To meet compliance requirements, the company must ensure the data is not publicly accessible throughout this process. A data engineer has secured Amazon S3 but must ensure the individual EMR clusters created by the analysts are not exposed to the public internet.
Which solution should the data engineer to meet this compliance requirement with LEAST amount of effort?

A

Enable the block public access setting for Amazon EMR at the account level before any EMR cluster is created.

489
Q

A financial company uses Amazon S3 as its data lake and has set up a data warehouse using a multi-node Amazon Redshift cluster. The data files in the data lake are organized in folders based on the data source of each data file. All the data files are loaded to one table in the Amazon Redshift cluster using a separate
COPY command for each data file location. With this approach, loading all the data files into Amazon Redshift takes a long time to complete. Users want a faster solution with little or no increase in cost while maintaining the segregation of the data files in the S3 data lake.
Which solution meets these requirements?

A

Create a manifest file that contains the data file locations and issue a COPY command to load the data into Amazon Redshift.

490
Q

A company’s marketing team has asked for help in identifying a high performing long-term storage service for their data based on the following requirements:
✑ The data size is approximately 32 TB uncompressed.
✑ There is a low volume of single-row inserts each day.
✑ There is a high volume of aggregation queries each day.
✑ Multiple complex joins are performed.
✑ The queries typically involve a small subset of the columns in a table.
Which storage service will provide the MOST performant solution?

A

Amazon Redshift

491
Q

A technology company is creating a dashboard that will visualize and analyze time-sensitive data. The data will come in through Amazon Kinesis Data Firehose with the buffer interval set to 60 seconds. The dashboard must support near-real-time data.
Which visualization solution will meet these requirements?

A

Select Amazon Elasticsearch Service (Amazon ES) as the endpoint for Kinesis Data Firehose. Set up a Kibana dashboard using the data in Amazon ES with the desired analyses and visualizations.

492
Q

A media company has been performing analytics on log data generated by its applications. There has been a recent increase in the number of concurrent analytics jobs running, and the overall performance of existing jobs is decreasing as the number of new jobs is increasing. The partitioned data is stored in
Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) and the analytic processing is performed on Amazon EMR clusters using the EMR File System
(EMRFS) with consistent view enabled. A data analyst has determined that it is taking longer for the EMR task nodes to list objects in Amazon S3.
Which action would MOST likely increase the performance of accessing log data in Amazon S3?

A

Increase the read capacity units (RCUs) for the shared Amazon DynamoDB table.

493
Q

A company has developed several AWS Glue jobs to validate and transform its data from Amazon S3 and load it into Amazon RDS for MySQL in batches once every day. The ETL jobs read the S3 data using a DynamicFrame. Currently, the ETL developers are experiencing challenges in processing only the incremental data on every run, as the AWS Glue job processes all the S3 input data on each run.
Which approach would allow the developers to solve the issue with minimal coding effort?

A

Enable job bookmarks on the AWS Glue jobs.

494
Q

A company is building a data lake and needs to ingest data from a relational database that has time-series data. The company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental data only from the source into Amazon S3.
What is the MOST cost-effective approach to meet these requirements?

A

Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

495
Q

An Amazon Redshift database contains sensitive user data. Logging is necessary to meet compliance requirements. The logs must contain database authentication attempts, connections, and disconnections. The logs must also contain each query run against the database and record which database user ran each query.
Which steps will create the required logs?

A

Enable audit logging for Amazon Redshift using the AWS Management Console or the AWS CLI.

496
Q

A data analyst is designing a solution to interactively query datasets with SQL using a JDBC connection. Users will join data stored in Amazon S3 in Apache ORC format with data stored in Amazon Elasticsearch Service (Amazon ES) and Amazon Aurora MySQL.
Which solution will provide the MOST up-to-date results?

A

Query all the datasets in place with Apache Presto running on Amazon EMR.

497
Q

A large company has a central data lake to run analytics across different departments. Each department uses a separate AWS account and stores its data in an
Amazon S3 bucket in that account. Each AWS account uses the AWS Glue Data Catalog as its data catalog. There are different data lake access requirements based on roles. Associate analysts should only have read access to their departmental data. Senior data analysts can have access in multiple departments including theirs, but for a subset of columns only.
Which solution achieves these required access patterns to minimize costs and administrative tasks?

A

Set up an individual AWS account for the central data lake. Use AWS Lake Formation to catalog the cross-account locations. On each individual S3 bucket, modify the bucket policy to grant S3 permissions to the Lake Formation service-linked role. Use Lake Formation permissions to add fine-grained access controls to allow senior analysts to view specific tables and columns.

498
Q

A company wants to improve user satisfaction for its smart home system by adding more features to its recommendation engine. Each sensor asynchronously pushes its nested JSON data into Amazon Kinesis Data Streams using the Kinesis Producer Library (KPL) in Java. Statistics from a set of failed sensors showed that, when a sensor is malfunctioning, its recorded data is not always sent to the cloud.
The company needs a solution that offers near-real-time analytics on the data from the most updated sensors.
Which solution enables the company to meet these requirements?

A

Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon Elasticsearch Service cluster.

499
Q

A global company has different sub-organizations, and each sub-organization sells its products and services in various countries. The company’s senior leadership wants to quickly identify which sub-organization is the strongest performer in each country. All sales data is stored in Amazon S3 in Parquet format.
Which approach can provide the visuals that senior leadership requested with the least amount of effort?

A

Use Amazon QuickSight with Amazon Athena as the data source. Use heat maps as the visual type.

500
Q

A company has 1 million scanned documents stored as image files in Amazon S3. The documents contain typewritten application forms with information including the applicant first name, applicant last name, application date, application type, and application text. The company has developed a machine learning algorithm to extract the metadata values from the scanned documents. The company wants to allow internal data analysts to analyze and find applications using the applicant name, application date, or application text. The original images should also be downloadable. Cost control is secondary to query performance.
Which solution organizes the images and metadata to drive insights while meeting the requirements?

A

Index the metadata and the Amazon S3 location of the image file in Amazon Elasticsearch Service. Allow the data analysts to use Kibana to submit queries to the Elasticsearch cluster.

501
Q

A mobile gaming company wants to capture data from its gaming app and make the data available for analysis immediately. The data record size will be approximately 20 KB. The company is concerned about achieving optimal throughput from each device. Additionally, the company wants to develop a data stream processing application with dedicated throughput for each consumer.

Which solution would achieve this goal?

A

Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature while consuming the data.

502
Q

A marketing company wants to improve its reporting and business intelligence capabilities. During the planning phase, the company interviewed the relevant stakeholders and discovered that:✑ The operations team reports are run hourly for the current monthג€™s data.✑ The sales team wants to use multiple Amazon QuickSight dashboards to show a rolling view of the last 30 days based on several categories. The sales team also wants to view the data as soon as it reaches the reporting backend.✑ The finance teamג€™s reports are run daily for last monthג€™s data and once a month for the last 24 months of data.Currently, there is 400 TB of data in the system with an expected additional 100 TB added every month. The company is looking for a solution that is as cost- effective as possible.Which solution meets the companyג€™s requirements?

A

Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Set up an external schema and table for Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift as the data source.

503
Q

A media company wants to perform machine learning and analytics on the data residing in its Amazon S3 data lake. There are two data transformation requirements that will enable the consumers within the company to create reports:
✑ Daily transformations of 300 GB of data with different file formats landing in Amazon S3 at a scheduled time.
✑ One-time transformations of terabytes of archived data residing in the S3 data lake.

Which combination of solutions cost-effectively meets the companyג€™s requirements for transforming the data?

A

For daily incoming data, use AWS Glue crawlers to scan and identify the schema.

For daily incoming data, use AWS Glue workflows with AWS Glue jobs to perform transformations.

For archived data, use Amazon EMR to perform data transformations.

504
Q

A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time solution that can ingest the data securely at scale. The solution should also be able to remove the patient’s protected health information (PHI) from the streaming data and store the data in durable storage.

Which solution meets these requirements with the least operational overhead?

A

Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.

505
Q

A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants to optimize the files to reduce the cost of querying and also improve the speed of data loading into the AmazonRedshift cluster.Which solution meets these requirements?

A

Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.

506
Q

An online retail company with millions of users around the globe wants to improve its ecommerce analytics capabilities. Currently, clickstream data is uploaded directly to Amazon S3 as compressed files. Several times each day, an application running on Amazon EC2 processes the data and makes search options and reports available for visualization by editors and marketers. The company wants to make website clicks and aggregated data available to editors and marketers in minutes to enable them to connect with users more effectively.

Which options will help meet these requirements in the MOST efficient way?

A

Use Amazon Kinesis Data Firehose to upload compressed and batched clickstream records to Amazon Elasticsearch Service.

Use Kibana to aggregate, filter, and visualize the data stored in Amazon Elasticsearch Service. Refresh content performance dashboards in near-real time.

507
Q

A company is streaming its high-volume billing data (100 MBps) to Amazon Kinesis Data Streams. A data analyst partitioned the data on account_id to ensure that all records belonging to an account go to the same Kinesis shard and order is maintained. While building a custom consumer using the Kinesis Java SDK, the data analyst notices that, sometimes, the messages arrive out of order for account_id. Upon further investigation, the data analyst discovers the messages that are out of order seem to be arriving from different shards for the same account_id and are seen when a stream resize runs.

What is an explanation for this behavior and what is the solution?

A

The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards

508
Q

A media organization want to do machine learning and analytics on the data stored in its Amazon S3 data lake. There are two data transformation criteria that must be met in order for the company’s consumers to develop reports:

✑ Daily transformations of 300 GB of data with different file formats landing in Amazon S3 at a scheduled time.
✑ One-time transformations of terabytes of archived data residing in the S3 data lake.

Which combination of technologies is most cost-effective in meeting the company’s data transformation requirements?

A

For daily incoming data, use AWS Glue crawlers to scan and identify the schema.

For daily incoming data, use AWS Glue workflows with AWS Glue jobs to perform transformations.

For archived data, use Amazon EMR to perform data transformations.

509
Q

A manufacturing business manages its contact center using Amazon Connect and its customer relationship management (CRM) data with Salesforce. The data engineering team must create a pipeline that will ingest data from the contact center and CRM system into an Amazon S3-based data lake.

What is the MOST ENERGIZING method for data collection in the data lake with the LEAST amount of operational overhead?

A

Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.

510
Q

A business want to do analytics on the logs for Elastic Load Balancing stored on Amazon S3. A data analyst must be able to query the whole database for a certain year, month, or day. Additionally, the data analyst should be able to do queries on a subset of the columns. The firm has a low operating cost and is the most cost-effective choice.

A

Use an AWS Glue job nightly to transform new log files into Apache Parquet format and partition by year, month, and day. Use AWS Glue crawlers to detect new partitions. Use Amazon Athena to query data.

511
Q

A marketing firm gets data from third-party suppliers and processes it using transitory Amazon EMR clusters. The organization wishes to host a permanent, trustworthy Apache Hive metastore that can be accessed concurrently by EMR clusters and numerous AWS services and accounts. Additionally, the metastore must be accessible at all times.

A

Use AWS Glue Data Catalog as the metastore

512
Q

A business intends to establish a data lake on Amazon S3. The organization want to implement tiered storage depending on usage patterns and cost constraints. Support for JDBC connections from older clients, metadata management that enables federation for access control, and batch-based ETL utilizing PySpark and Scala are all required components of the solution. Management of operations should be minimal.

Which component combination satisfies these requirements?

A

AWS Glue Data Catalog for metadata management
AWS Glue for Scala-based ETL
Amazon Athena for querying data in Amazon S3 using JDBC drivers

513
Q

A corporation utilizes AWS to host a data lake that ingests data from several business units and performs searches using Amazon Athena. Amazon S3 is used as the storage layer, along with the AWS Glue Data Catalog. The organization want to make the data accessible to its data scientists and business analysts. However, the organization must first regulate Athena’s data access according to user roles and responsibilities.

What should the business do in order to implement these access restrictions with the LEAST amount of operational overhead possible?

A

Define security policy-based rules for the users and applications by role in AWS Lake Formation.

514
Q

For analytics purposes, an airline has been gathering stats on flight activity. A recently completed proof of concept highlights how the firm gives data analysts with insights to help them enhance on-time departures. The proof of concept used Amazon S3 objects that included the metrics in.csv format, and Amazon Athena for data querying. As data volumes grow, the data analyst wishes to optimize the storage solution in order to maximize query speed.

Which choices should the data analyst pursue in order to optimize performance as the data lake expands in size?

A

Compress the objects to reduce the data transfer I/O.
Use an S3 bucket in the same Region as Athena.
Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.

515
Q

A firm that specializes in smart home automation must effectively ingest and analyze information from a variety of linked devices and sensors. The vast bulk of these communications are made up of several tiny files. These messages are ingested by Amazon Kinesis Data Streams and published to Amazon S3 using a consumer application for Kinesis data streams. The Amazon S3 message data is then processed through a pipeline based on Amazon EMR and powered by scheduled PySpark processes.
The data platform team controls data processing and is concerned with downstream data processing efficiency and cost. They want to maintain their usage of PySpark.

Which solution optimizes data processing efficiency and is well-architected?

A

Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue

516
Q

The learning management system (LMS) of an education provider is housed in a 100 TB data lake created on Amazon S3. Hundreds of schools are supported by the provider’s LMS. The supplier want to develop an advanced analytics reporting platform with Amazon Redshift in order to efficiently handle complicated queries.
95 percent of the time, system users will query the most recent four months of data, whereas 5% of queries will use data from the past twelve months.

A

Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift Spectrum to query data in the data lake. Ensure the S3 Standard storage class is in use with objects in the data lake.

517
Q

The Amazon Kinesis SDK is used by a business to write data to Kinesis Data Streams. According to compliance regulations, data must be encrypted at rest using a rotateable key. The organization wants to achieve this encryption requirement with the least amount of coding work possible.

How are these stipulations to be met?

A

Create a customer master key (CMK) in AWS KMS. Assign the CMK an alias. Enable server-side encryption on the Kinesis data stream using the CMK alias as the KMS master key.

518
Q

A corporation want to enhance application logs in near-real time and then analyze the enriched dataset. The application is deployed across various Availability Zones on Amazon EC2 instances and records its activity using Amazon CloudWatch Logs. The source of the enrichment is saved in an Amazon DynamoDB database.

Which solution satisfies the event collection and enrichment requirements?

A

Use a CloudWatch Logs subscription to send the data to Amazon Kinesis Data Firehose. Use AWS Lambda to transform the data in the Kinesis Data Firehose delivery stream and enrich it with the data in the DynamoDB table. Configure Amazon S3 as the Kinesis Data Firehose delivery destination.

519
Q

A business is developing a service for monitoring car fleets. The startup gathers IoT data from a device installed in each car and puts it in near-real time into Amazon Redshift. At various intervals during the day, fleet owners upload.csv files containing vehicle reference data to Amazon S3. A nightly routine populates Amazon Redshift with car reference data from Amazon S3. To enable reporting and dashboards, the business connects IoT data from the device and vehicle reference data. Fleet owners get annoyed when their dashboards do not update for a day.

Which method would result in the SHORTEST time interval between uploading reference data to Amazon S3 and the update being seen in the owners’ dashboards?

A

Use S3 event notifications to trigger an AWS Lambda function to copy the vehicle reference data into Amazon Redshift immediately when the reference data is uploaded to Amazon S3.

520
Q

A business rewards consumers who are physically active. The organization want to ascertain the consumers’ level of activity by having them use a mobile application to monitor the amount of steps they take each day. The organization must consume and analyze live data in near-real time. The processed data must be retained and made accessible for analytics purposes for a period of one year.

Which method satisfies these criteria with the LEAST amount of operational overhead?

A

Ingest the data into Amazon Kinesis Data Streams by using an Amazon API Gateway API as a Kinesis proxy. Run Amazon Kinesis Data Analytics on the stream data. Output the processed data into Amazon S3 by using Amazon Kinesis Data Firehose. Use Amazon Athena to run analytics calculations. Use S3 Lifecycle rules to transition objects to S3 Glacier after 1 year.

521
Q

A manufacturing business uses Amazon S3 to store data from its operating systems. The business analysts of the organization need one-time queries against the data in Amazon S3 using Amazon Athena. The organization needs to connect to the Athena network through a JDBC connection from the on-premises network. The corporation has established a virtual private cloud (VPC) security policy that prohibits requests to AWS services from traversing the Internet.

Which measures should a data analytics professional do in combination to achieve these requirements?

A

Establish an AWS Direct Connect connection between the on-premises network and the VPC.

Configure the JDBC connection to use an interface VPC endpoint for Athena.

522
Q

A business requires the storage of JSON objects containing log data. The objects are created by eight AWS-hosted apps. Six programs create a total of 500 KiB of data each second, while two applications generate up to 2 MiB per second.
A data engineer is tasked with developing a scalable solution for collecting and storing use data in an Amazon S3 bucket. Before storing the use data objects in Amazon S3, they must be reformatted, converted to.csv format, and then compressed. The corporation asks that the solution include as little bespoke code as feasible and has allowed the data engineer to seek an increase in the service limit if necessary.

Which solution satisfies these criteria?

A

Configure an Amazon Kinesis Data Firehose delivery stream for each application. Write AWS Lambda functions to read log data objects from the stream for each application. Have the function perform reformatting and .csv conversion. Enable compression on all the delivery streams.

523
Q

A business is transitioning from an on-premises Apache Hadoop cluster to an Amazon Elastic Map Reduce (EMR) cluster. The cluster is only operational during normal business hours. The EMR cluster must be highly available to prevent intraday cluster failures as a result of a corporate need to avoid intraday cluster failures. The data must survive when the cluster is terminated at the end of each business day.

Which configurations of the EMR cluster would allow it to achieve these requirements?

A

EMR File System (EMRFS) for storage
AWS Glue Data Catalog as the metastore for Apache Hive
Multiple master nodes in a single Availability Zone

524
Q

An operations team observes that a few AWS Glue tasks are failing for a certain ETL application. The AWS Glue tasks read a large number of tiny JSON files from an Amazon S3 bucket and publish them in their native Apache Parquet format to a separate S3 bucket. After doing an initial examination, a data engineer sees the following error message in the AWS Glue console’s History tab: ג€Command Exit Code 1 was not successful.
Further examination reveals that the driver memory profile for the unsuccessful tasks rapidly exceeds the safe threshold of 50% utilization and quickly hits 90€”95 percent. The average memory utilization of all executors remains less than 4%.
Additionally, when investigating the associated Amazon CloudWatch Logs, the data engineer detects the following issue.

What actions should the data engineer take to resolve the issue in the MOST cost-effective manner possible?

A

Modify the AWS Glue ETL code to use the ‘groupFiles’: ‘inPartition’ feature.

525
Q

A gaming corporation is in the process of establishing a serverless data lake. The firm streams data into Amazon Kinesis Data Streams and writes it to Amazon S3 using Amazon Kinesis Data Firehose. The organization uses a 10 MB S3 buffer size and a 90-second buffer interval. The organization uses AWS Glue to integrate and change the data before publishing it back to Amazon S3.
Recently, the company’s data volume has increased significantly. AWS Glue ETL tasks commonly fail due to an OutOfMemoryError.

Which methods are cost-effective in resolving this issue?

A

Use the groupFiles setting in the AWS Glue ETL job to merge small S3 files and rerun AWS Glue ETL jobs.

Update the Kinesis Data Firehose S3 buffer size to 128 MB. Update the buffer interval to 900 seconds.

526
Q

A retail enterprise operates 15 locations in six cities throughout the United States. Once a month, the sales team demands an Amazon QuickSight graphic that enables easy identification of revenue patterns across cities and locations. Additionally, the visualization aids in identifying outliers that need additional examination.

Which QuickSight visual kind best fits the expectations of the sales team?

A

Heat map

527
Q

A business examines its data in an Amazon Redshift data warehouse that is presently comprised of three dense storage nodes. The firm needs to load an extra 4 TB of user data into Amazon Redshift as a result of a recent business acquisition. The technical team will integrate all user data and perform intricate computations that will use a significant amount of I/O resources. The organization must adapt the capacity of the cluster to accommodate changing analytical and storage requirements.

Which solution satisfies these criteria?

A

Resize the cluster using elastic resize with dense compute nodes.

528
Q

A corporation wants to increase customer satisfaction with its smart home system by expanding its recommendation engine’s capabilities. Each sensor uses the Java Kinesis Producer Library (KPL) to asynchronously submit its layered JSON data into Amazon Kinesis Data Streams. According to statistics gathered from a collection of failed sensors, when a sensor fails, its recorded data is not always delivered to the cloud.
The organization need a system that enables near-real-time analytics on the most recent sensor data.

Which solution allows the business to satisfy these standards?

A

Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster.

529
Q

A global online retailer with millions of consumers seeks to enhance its ecommerce analytics capabilities. At the moment, clickstream data is immediately transferred to Amazon S3 as compressed files. Each day, an application running on Amazon EC2 analyses the data and provides editors and marketers with search options and reports for visualization. The company’s goal is to make website clicks and aggregated data instantly accessible to editors and marketers, enabling them to communicate more effectively with people.

Which alternatives will assist you in meeting these criteria in the most effective manner possible?

A

Use Amazon Kinesis Data Firehose to upload compressed and batched clickstream records to Amazon OpenSearch Service (Amazon Elasticsearch Service).

Use OpenSearch Dashboards (Kibana) to aggregate, filter, and visualize the data stored in Amazon OpenSearch Service (Amazon Elasticsearch Service). Refresh content performance dashboards in near-real time.

530
Q

A online retailer wishes to build a system for near-real-time clickstream analytics. The firm want to do data analysis using an open-source program.
While the analytics application will only analyze the raw data once, other applications will need quick access to the raw data for a period of up to one year.

Which solution satisfies these conditions WITH THE MINIMUM OF OPERATIONAL EXERCISE?

A

Use Amazon Kinesis Data Streams to collect the data. Use Amazon Kinesis Data Analytics with Apache Flink to process the data in real time. Set the retention period of the Kinesis data stream to 8,760 hours.

531
Q

A corporation is running Amazon Athena queries against a cross-account AWS Glue Data Catalog using an AWS Lambda function. The following error is returned by a query:

The error message indicates that the size of the response payload is more than the maximum permitted. The partitioned table that is being searched is already partitioned, and the data is stored in an Amazon S3 bucket using the Apache Hive partition format.

Which solution will fix this error?

A

Run the MSCK REPAIR TABLE command on the queried table.

532
Q

Recently, a corporation built a test AWS account with the purpose of creating a development environment. Additionally, the organization opened a production AWS account in a different AWS Region. The organization wishes to transmit log data from Amazon CloudWatch Logs in its production account to an Amazon Kinesis data stream in its test account as part of its security testing.

Which solution will enable the business to attain this objective?

A

Create a destination data stream in Kinesis Data Streams in the test account with an IAM role and a trust policy that allow CloudWatch Logs in the production account to write to the test account. Create a subscription filter in the production account’s CloudWatch Logs to target the Kinesis data stream in the test account as its destination.

533
Q

On-premises hosting of an Apache Flink application. Data is processed by the application from many Apache Kafka clusters. The data comes from a range of different sources, including online applications, mobile applications, and operational databases. The organization has already transferred some of these sources to AWS and is now looking to relocate the Flink application as well. The organization must guarantee that data stored in databases contained inside the VPC is not sent over the internet. The application must be capable of processing data from the company’s AWS solution, on-premises resources, and public internet.

Which method satisfies these criteria with the LEAST amount of operational overhead?

A

Create an Amazon Kinesis Data Analytics application by uploading the compiled Flink .jar file. Use Amazon Kinesis Data Streams to collect data that comes from applications and databases within the VPC and the public internet. Configure the Kinesis Data Analytics application to have sources from Kinesis Data Streams and any on-premises Kafka clusters by using AWS Client VPN or AWS Direct Connect.

534
Q

A business has a marketing and a finance department. The departments store data in Amazon S3 in their own AWS Organizations accounts. Both departments catalog and safeguard their data using AWS Lake Formation. The departments share various databases and tables.
The marketing department requires safe access to a number of tables maintained by the finance department.

Which two stages are necessary to complete this process?

A

The finance department grants Lake Formation permissions for the tables to the external account for the marketing department.

The marketing department creates an IAM role that has permissions to the Lake Formation tables.

535
Q

A market data organization compiles data from several sources in order to produce a complete picture of product consumption in various nations. The firm intends to sell this data to other parties on a subscription basis. To do this, the business must make its data securely accessible to third parties that are also AWS customers.

What should the business do to ensure that these needs are met with the LEAST amount of operational overhead possible?

A

Upload the data to AWS Data Exchange for storage. Share the data by using the AWS Data Exchange sharing wizard.

536
Q

A corporation provides toll services on roadways around the nation and gathers data to better understand traffic patterns. Analysts have sought the option to conduct near-real-time traffic statistics. The organization is interested in developing an ingestion pipeline that feeds all data into an Amazon Redshift cluster and notifies operations employees when toll traffic at a certain toll station falls below a predefined threshold. Amazon S3 is used to store station data and associated threshold values.

Which strategy is the MOST EFFECTIVE in meeting these requirements?

A

Use Amazon Kinesis Data Firehose to collect data and deliver it to Amazon Redshift and Amazon Kinesis Data Analytics simultaneously. Create a reference data source in Kinesis Data Analytics to temporarily store the threshold values from Amazon S3 and compare the count of vehicles for a particular toll station against its corresponding threshold value. Use AWS Lambda to publish an Amazon Simple Notification Service (Amazon SNS) notification if the threshold is not met.

537
Q

A business examines historical data and requires access to data stored in Amazon S3. Each day, new data is created as.csv files and saved on Amazon S3.
Amazon Athena is being used by the company’s analysts to run SQL queries on a recent subset of the company’s entire data. The volume of data fed into Amazon S3 has risen significantly over time, as has the query latency.

Which options should the business consider using to boost query performance?

A

Run a daily AWS Glue ETL job to convert the data files to Apache Parquet and to partition the converted files. Create a periodic AWS Glue crawler to automatically crawl the partitioned data on a daily basis.

Run a daily AWS Glue ETL job to compress the data files by using the .lzo format. Query the compressed data.

538
Q

A bank works within the confines of a regulated environment. According to the country’s compliance standards, client data for each state should be available exclusively to bank staff based in that state. Bank workers in one state should not have access to data on clients who have supplied a residence address in another state.
The bank’s marketing department has employed a data analyst to mine client data in preparation for the launch of a new campaign in certain states. At the moment, data relating to each customer account and its home state is maintained in a tabular.csv file contained inside a single Amazon S3 folder within a private S3 bucket. The S3 folder is 2 GB in size uncompressed. The marketing team is unable to access this folder due to the country’s compliance rules.
The data analyst is accountable for ensuring that the marketing team has one-time access to consumer data for campaign analytics projects while adhering to all applicable compliance rules and regulations.

Which option should the data analyst use in order to satisfy the necessary objectives with the LEAST amount of setup work possible?

A

Load tabular data from Amazon S3 to Amazon QuickSight Enterprise edition by directly importing it as a data source. Use the built-in row-level security feature in Amazon QuickSight to provide marketing employees with appropriate data access under compliance controls. Delete Amazon QuickSight data sources after the project is complete.

539
Q

A reseller with hundreds of AWS accounts gets Amazon S3 buckets with AWS Cost and Usage Reports. The following format is used to transmit the reports to the S3 bucket:
//yyyymmdd-yyyymmdd/.parquet
AWS Glue searches the S3 bucket and creates a table in an AWS Glue Data Catalog. Business analysts query the database and generate monthly summary reports for the AWS accounts using Amazon Athena. Due to the buildup of reports over the previous five years, business analysts are experiencing poor query response times. The business analysts are requesting that the operations team make adjustments to increase query performance.

Which course of action should the operations team adopt in order to comply with these requirements?

A

Partition the data by account ID, year, and month

540
Q

A data analyst is developing a system that will allow him to query datasets interactively using SQL and a JDBC connection. Users will be able to connect data saved in Amazon S3 in the Apache ORC format to data stored in Amazon OpenSearch Service (Amazon Elasticsearch Service) and Amazon Aurora MySQL.

Which option will provide the MOST CURRENT information?

A

Query all the datasets in place with Apache Presto running on Amazon EMR.

541
Q

A data engineering team at a shared workspace firm is tasked with developing a consolidated logging system for all weblogs created by the space reservation system. The firm operates a fleet of Amazon EC2 instances that handle web-based requests for shared space bookings. The data engineering team’s goal is to aggregate all weblogs into a service that will allow near-real-time search. The team is not interested in managing the logging system’s maintenance and operation.

Which option enables the data engineering team to configure the web logging system on AWS efficiently?

A

Set up the Amazon CloudWatch agent to stream weblogs to CloudWatch logs and subscribe the Amazon Kinesis Data Firehose delivery stream to CloudWatch. Choose Amazon OpenSearch Service (Amazon Elasticsearch Service) as the end destination of the weblogs.

542
Q

A media organization has been analyzing log data produced by its apps. The number of concurrent analytics jobs has recently increased, and the overall performance of old tasks has decreased as the number of new jobs has increased. The partitioned data is stored in Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) buckets, and the analytic processing occurs on Amazon EMR clusters through the EMR File System (EMRFS) with consistent view enabled. A data analyst discovered that the EMR task nodes are taking longer to list items in Amazon S3.

Which step is most likely to improve the performance of log data access in Amazon S3?

A

Increase the read capacity units (RCUs) for the shared Amazon DynamoDB table.

543
Q

What is EMRFS consistent view?

A

EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRF.

544
Q

A corporation owns facilities across the globe that are equipped with IoT devices. Amazon Kinesis Data Streams is being used to send data from the devices to Amazon S3. The operations team of the organization wishes to get insights from IoT data in order to check data quality during intake. The insights must be extracted in near-real time, and the output must be recorded to Amazon DynamoDB for further analysis.

Which solution satisfies these criteria?

A

Connect Amazon Kinesis Data Analytics to analyze the stream data. Save the output to DynamoDB by using an AWS Lambda function.

545
Q

A healthcare organization collects, ingests, and stores electronic health record (EHR) data on its patients using AWS data and analytics technologies. The raw EHR data is stored in Amazon S3 in JSON format and is updated hourly. The organization wishes to retain the data catalog and associated metadata in an AWS Glue Data Catalog in order to provide analytics utilizing Amazon Athena or Amazon Redshift Spectrum.
The following conditions apply when defining tables in the Data Catalog:

✑ Choose the catalog table name and do not rely on the catalog table naming algorithm.
✑ Keep the table updated with new partitions loaded in the respective S3 bucket prefixes.

Which option satisfies these criteria with the least effort?

A

Use the AWS Glue API CreateTable operation to create a table in the Data Catalog. Create an AWS Glue crawler and specify the table as the source.

546
Q

Each day, a huge ride-sharing firm employs thousands of drivers worldwide to serve millions of unique consumers. The organization has chosen Amazon Redshift as the platform for migrating an existing data mart. The following tables are included in the current schema.

✑ A trips fact table for information on completed rides.
✑ A drivers dimension table for driver profiles.
✑ A customers fact table holding customer profile information.

The firm analyzes travel information by date and destination in order to determine regional profitability. The driver’s information is seldom updated. Customers’ information is regularly updated.
Which table architecture optimizes query performance?

A

Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

547
Q

A business is streaming its high-volume billing data to Amazon Kinesis Data Streams at a rate of 100 MBps. A data analyst partitioned the data by account id to guarantee that all records associated with a particular account are stored in the same Kinesis shard and that order is preserved. While developing a custom consumer using the Kinesis Java SDK, the data analyst discovers that messages for account id sometimes come out of order. Further analysis reveals that the out-of-order messages seem to be coming from various shards for the same account id and are visible when a stream resize is conducted.

What is the rationale for this conduct, and what is the remedy?

A

The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards.

548
Q

An online store must implement a mechanism for tracking product sales. For reporting purposes, source data is exported from an external online transaction processing (OLTP) system. Each day, roll-up data is calculated for the previous day’s actions. The reporting system must meet the following criteria:

✑Maintain daily roll-up data for a period of one year.
✑After one year, store the daily roll-up data for easy access on an as-needed basis. ✑The reporting system must keep source data exports for a period of five years. Access to the query will be required exclusively for the purpose of re-evaluation, which may occur during the first 90 days.

Which combination of operations will achieve these criteria while minimizing storage costs?

A

Store the source data initially in the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class. Apply a lifecycle configuration that changes the storage class to Amazon S3 Glacier Deep Archive 90 days after creation, and then deletes the data 5 years after creation.

Store the daily roll-up data initially in the Amazon S3 Standard storage class. Apply a lifecycle configuration that changes the storage class to Amazon S3 Standard-Infrequent Access (S3 Standard-IA) 1 year after data creation.

549
Q

An ecommerce firm imports a vast amount of JSON-formatted clickstream data and stores it on Amazon S3. Business analysts from several product divisions must examine the data using Amazon Athena. The company’s analytics team must develop a method to track each product division’s daily data use for Athena. Additionally, the solution must provide a warning when a division exceeds its quota.
Which method satisfies these criteria with the LEAST amount of operational overhead?

A

Create an Athena workgroup for each division. Configure a data usage control for each workgroup and a time period of 1 day. Configure an action to send notifications to an Amazon Simple Notification Service (Amazon SNS) topic.

550
Q

A business is constructing a data lake and requires the ingest of data from a relational database including time series data. The corporation wishes to do this via the usage of managed services. Daily scheduling is required for the procedure to get incremental data from the source into Amazon S3.

Which strategy is the MOST cost-effective in meeting these requirements?

A

Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

551
Q

A mobile gaming firm want to collect data from their gaming application and make it instantly accessible for analysis. Each data record will be roughly 20 KB in size. The firm is focused with obtaining the highest possible throughput from each device. Additionally, the business intends to build a data stream processing program capable of providing dedicated throughput to each user.

Which solution would accomplish this objective?

A

Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature while consuming the data.

552
Q

A business unit of a firm is uploading.csv files to an Amazon S3 bucket. The data platform team at the firm has configured an AWS Glue crawler to do discovery and construct tables and schemas. AWS Glue uploads processed data from the newly generated tables to an Amazon Redshift database using an AWS Glue task. The AWS Glue operation correctly maps columns and creates the Amazon Redshift table. Each time the AWS Glue task is restarted within a day for whatever reason, duplicate entries are added to the Amazon Redshift database.

Which solution ensures that when tasks are restarted, the Redshift database is updated without duplication?

A

Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.

553
Q

An Amazon QuickSight dashboard is being created by a data analyst utilizing consolidated sales data stored in Amazon Redshift. The dashboard must be controlled such that a salesman in Sydney, Australia, may access just Australian data, while a salesperson in New York can see only data from the United States (US).

What steps should the data analyst take to guarantee adequate data security?

A

Deploy QuickSight Enterprise edition to implement row-level security (RLS) to the sales table.

554
Q

A big telecoms business intends to establish a data catalog and metadata management system for several AWS data sources. The catalog will be used to manage all of the items recorded in the data stores’ metadata. Structured sources like as Amazon RDS and Amazon Redshift are combined with semistructured sources such as JSON and XML files stored in Amazon S3. The catalog must be regularly updated, capable of detecting changes to item information, and need as little management as feasible.

Which solution satisfies these criteria?

A

Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and update the Data Catalog with metadata changes. Schedule the crawlers periodically to update the metadata catalog.

555
Q

A group of data scientists want to examine market trend data in order to develop a new investment strategy for their organization. The trend data is derived in enormous quantities from five distinct data sources. The team want to make advantage of Amazon Kinesis to facilitate their use case. The team analyzes trends using SQL-like queries and want to send alerts in response to certain noteworthy patterns in the trends. Additionally, the data scientists wish to store the data to Amazon S3 for preservation and historical reprocessing, and if practical, employ AWS managed services. The team want to adopt the least expensive option possible.

Which solution satisfies these criteria?

A

Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

556
Q

A business has created many AWS Glue tasks to verify and convert data from Amazon S3 and load it in batches into Amazon RDS for MySQL once a day. The ETL operations use a DynamicFrame to read the S3 data. At the moment, ETL developers are having difficulty processing just incremental data on each run, since the AWS Glue job processes entire S3 input data on each run.

Which option would enable developers to resolve the problem with the least amount of coding work possible?

A

Enable job bookmarks on the AWS Glue jobs.

557
Q

A firm built a service that generates millions of messages each day and streams them over Amazon Kinesis Data Streams.
The firm writes data to Kinesis Data Streams using the Kinesis SDK. A few months after introduction, a data analyst discovered a dramatic decrease in write performance. The data analyst examined the analytics and discovered that Kinesis is restricting write requests. The data analyst want to handle this problem without modifying the architecture significantly.

What steps should the data analyst take to rectify this situation?

A

Increase the number of shards in the stream using the UpdateShardCount API.

Choose partition keys in a way that results in a uniform record distribution across shards.

558
Q

A supplier of analytics software as a service (SaaS) wishes to give self-service business intelligence (BI) reporting capabilities to its clients. The provider creates these reports using Amazon QuickSight. Although the reports’ data is stored in a multi-tenant database, each client should have access to just their own data.
The company want to offer consumers two distinct user roles:

✑ Individuals who simply need to examine dashboards may be read-only users.
✑ Individuals having the ability to develop and share new dashboards with other users are considered power users.

Which QuickSught feature enables the supplier to satisfy these criteria?

A

Isolated namespaces

559
Q

A business offers analytics to its sales and marketing divisions. The departments can only access the data through their business intelligence (BI) tools, which link to Amazon Redshift via an internal Amazon Redshift user. Each department is allocated a user with the appropriate rights in the Amazon Redshift database. Direct access to the advertising table must be allowed to the marketing data analysts. The advertising table is kept in Apache Parquet format in the marketing S3 bucket of the business data lake. AWS Lake Formation is used to manage the company’s data lake. Finally, access must be restricted to the table’s three promotion columns.

Which combination of actions will satisfy these criteria?

A

Create an Amazon Redshift Spectrum IAM role with permissions for Lake Formation. Attach it to the Amazon Redshift cluster.

Create an external schema in Amazon Redshift by using the Amazon Redshift Spectrum IAM role. Grant usage to the marketing Amazon Redshift user.

Grant permissions in Lake Formation to allow the Amazon Redshift Spectrum role to access the three promotion columns of the advertising table.

560
Q

Apache Spark is being used by a business on an Amazon EMR cluster. The Spark task writes to a bucket on Amazon S3. The task fails with an HTTP 503 ‘Slow Down’ AmazonS3Exception.

Which steps will rectify this situation?

A

Add additional prefixes to the S3 bucket

Increase the EMR File System (EMRFS) retry limit

561
Q

On AWS, a software business runs an application that receives weekly updates. As part of the application testing process, a solution must be built that examines log files from each Amazon EC2 instance to guarantee that the application continues to function properly after each deployment. The gathering and analysis solution should be highly accessible, with little latency in displaying fresh data.

Which technique should be used by the firm to gather and analyze logs?

A

Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.

562
Q

Amazon QuickSight Enterprise version users get access to hundreds of dashboards, analysis, and datasets. The organization is having difficulty managing and assigning permissions for providing users access to numerous QuickSight elements. The organization want to simplify the process of sharing and permissions management.

Which solution should the organization employ to streamline the administration of permissions?

A

Use QuickSight folders to organize dashboards, analyses, and datasets. Assign group permissions by using these folders.

563
Q

A transportation business wishes to follow vehicle movements via the collection of geolocation information. The recordings are 10 B in size and capture up to 10,000 records per second. Data transmission delays of a few minutes are tolerable when network conditions are unstable. The transport firm chose to ingest the data using Amazon Kinesis Data Streams. The organization is seeking a dependable technique for sending data to Kinesis Data Streams while optimizing the throughput efficiency of the Kinesis shards.

Which option will best fulfill the needs of the business?

A

Kinesis Producer Library (KPL)

564
Q

A business has about 10-15’ of uncompressed.csv files stored on Amazon S3. Amazon Athena is being evaluated by the corporation as a one-time query engine. The organization want to change the data in order to reduce query execution time and storage expenses.

Which data format and compression algorithm best fits these requirements?

A

Apache Parquet compressed with Snappy

565
Q

A retailer saves order invoices in an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster. Monthly indices are built in the cluster.
Once a new month starts, no new writes to any of the prior months’ indexes are made. The firm has been increasing storage on the Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster in order to prevent running out of space, but the company would want to minimize prices. While the majority of searches on the cluster are performed on the most recent three months of data, the audit team needs rare access to older data in order to compile periodic reports. While the most recent three months of data must be immediately accessible for queries, the audit team is willing to overlook slower queries provided the solution saves money on cluster expenses.

Which of the following is the MOST EFFECTIVE approach in terms of operational efficiency to achieve these requirements?

A

Archive indices that are older than 3 months by using Index State Management (ISM) to create a policy to migrate the indices to Amazon OpenSearch Service (Amazon Elasticsearch Service) UltraWarm storage.

566
Q

Multiple data pipelines are used by a business to ingest data from operational databases into an Amazon S3 data lake. AWS Glue and Amazon EMR are used for data processing and ETL in the processes. The firm wishes to improve its architecture in order to facilitate automated orchestration and reduce human involvement.

Which solution should the business employ to manage data processes in order to comply with these requirements?

A

AWS Step Functions

567
Q

A technological business is in the process of developing a dashboard for visualizing and analyzing time-sensitive data. The data will be ingested through Amazon Kinesis Data Firehose with a 60-second butter interval. The dashboard must display data in near-real time.

Which visualization solution satisfies these criteria?

A

Select Amazon OpenSearch Service (Amazon Elasticsearch Service) as the endpoint for Kinesis Data Firehose. Set up an OpenSearch Dashboards (Kibana) using the data in Amazon OpenSearch Service (Amazon ES) with the desired analyses and visualizations.

568
Q

Patient data is ingested from several sources and stored in an Amazon S3 staging bucket by a healthcare provider. AWS Glue changes the data before it is written to an S3-based data lake for querying using Amazon Athena. The firm wishes to match patient data even when they lack a unique identifier in common.

Which solution satisfies this criterion?

A

Train and use the AWS Glue FindMatches ML transform in the ETLjob

569
Q

A business utilizes Amazon Redshift to meet its data warehousing requirements. Every night, ETL processes are executed to import data, apply business rules, and build aggregate tables for reporting purposes. The data warehouse is used by the company’s data analysis, data science, and business intelligence departments during normal business hours. The workload management is set to AUTO, and each team has its own queue with a priority of NORMAL.
Recently, a sharp increase in read requests from the data analysis team has happened at least twice daily, with queries queued for cluster resources. The organization requires a solution that allows the data analysis team to avoid query queuing while minimizing the effect on other teams’ latency and query times.

Which solution satisfies these criteria?

A

Configure the data analysis queue to enable concurrency scaling.

570
Q

Amazon OpenSearch Service (Amazon Elasticsearch Service) is used by a business to store and analyze website clickstream data. Daily, the organization uses Amazon Kinesis Data Firehose to collect 1 TB of data and stores one day’s worth of data in an Amazon ES cluster.
The organization has very sluggish query performance on the Amazon ES index and sometimes encounters issues when trying to publish to the index using Kinesis Data Firehose. The Amazon ES cluster is comprised of ten nodes that each execute a single index and three dedicated master nodes. Each data node is set with 1.5 TB of Amazon EBS storage and the cluster contains 1,000 shards. Occasionally, cluster logs include JVMMemoryPressure problems.

Which option will optimize Amazon ES’s performance?

A

Decrease the number of Amazon ES shards for the index

571
Q

A data analytics professional is using AWS Glue to automate the ingestion of compressed files submitted to an Amazon S3 bucket. The data intake pipeline should be capable of incremental processing.

Which AWS Glue feature should the data analytics professional use to accomplish this task?

A

Job bookmarks

572
Q

A medical business has a system comprised of sensor devices that continuously scan parameters and feed them to an Amazon Kinesis data stream. Multiple shards exist in the Kinesis data stream. The company’s objective is to determine the average value of a numeric measure every second and to raise an alert for when the value exceeds or falls below a specified threshold. Within fewer than 30 seconds, the alert must be transmitted to Amazon Simple Notification Service (Amazon SNS).

Which architecture satisfies these criteria?

A

Use an Amazon Kinesis Data Analytics application to read from the Kinesis data stream and calculate the average per second. Send the results to an AWS Lambda function that sends the alarm to Amazon SNS.

573
Q

A retailer’s online reporting system is being migrated to AWS. The legacy system of the organization processes data from online transactions via a sophisticated set of layered Apache Hive queries. Multiple times each day, transactional data is exported from the online system to the reporting system. Between updates, the schemas in the files remain stable.
A data analyst wants to shift data processing fast to AWS, therefore any code modifications should be kept to a minimum. To keep storage prices down, the data analyst opts for Amazon S3. It is critical that the data in the reports and accompanying analytics be totally current, since they are reliant on the data in Amazon S3.

Which solution satisfies these criteria?

A

Create an AWS Glue Data Catalog to manage the Hive metadata. Create an AWS Glue crawler over Amazon S3 that runs when data is refreshed to ensure that data changes are updated. Create an Amazon EMR cluster and use the metadata in the AWS Glue Data Catalog to run Hive processing queries in Amazon EMR.

574
Q

A business gets data in JSON format with a timestamp in the file name from a vendor. The provider uploads the data to an Amazon S3 bucket, which the firm then registers in its data lake for analysis and reporting. After five days, the organization created an S3 Lifecycle policy to archive all data to S3 Glacier.
The organization want to verify that its AWS Glue crawler catalogs data from just S3 Standard storage and disregards older files. A data analytics professional must build a system that accomplishes this objective without modifying the S3 bucket settings.

A

Use the excludeStorageClasses property in the AWS Glue Data Catalog table to exclude files on S3 Glacier storage.

575
Q

An online gaming firm utilizes an Amazon Kinesis Data Analytics SQL application that is sourced from a Kinesis data stream. The source gives the program three non-null fields: player_id, score, and us_5_digit_zip_code.
A data analyst has created a.csv mapping file that converts a limited number of us_5_digit_zip_code values to territory codes. If a territorial code exists, the data analyst must include it as an extra output of the Kinesis Data Analytics application.

How should the data analyst fulfill this objective while keeping expenditures to a minimum?

A

Store the mapping file in an Amazon S3 bucket and configure it as a reference data source for the Kinesis Data Analytics application. Change the SQL query in the application to include a join to the reference table and add the territory code field to the SELECT columns.

576
Q

A data analytics professional is manually configuring workload management in an Amazon Redshift system. The data analytics specialist is building query monitoring rules to control an Amazon Redshift cluster’s system performance and user experience.

Which components must be included in each query monitoring rule?

A

A unique rule name, one to three predicates, and an action

577
Q

A hospital is developing a research data lake to house data from numerous hospitals and clinics’ electronic health record (EHR) systems. The EHR systems are self-contained and do not share a single patient identity. The data engineering team lacks machine learning (ML) competence and has been tasked with creating a unique patient identification for the imported information.

Which option will be most effective in completing this task?

A

An AWS Glue ETL job with the FindMatches transform

578
Q

A business is storing historical datasets on Amazon S3. The company’s data engineer want to make these datasets accessible for study through Amazon Athena. Additionally, the engineer want to secure the Athena query results in an S3 results location utilizing AWS encryption technologies. The following conditions apply to encrypting query results:

✑ Use custom keys for encryption of the primary dataset query results.
✑ Use generic encryption for all other query results.
✑ Provide an audit trail for the primary dataset queries that shows when the keys were used and by whom.

Which solution satisfies these criteria?

A

Use server-side encryption with AWS KMS managed customer master keys (SSE-KMS CMKs) for the primary dataset. Use server-side encryption with S3 managed encryption keys (SSE-S3) for the other datasets.

579
Q

Throughout the day, a huge corporation gets files from external parties through Amazon EC2. Finally, the files are concatenated into a single file, compressed using a gzip compression algorithm, and sent to Amazon S3. Daily, the combined size of all the files is close to 100 GB. After uploading the files to Amazon S3, an AWS Batch software uses the COPY command to load them into an Amazon Redshift cluster.

Which update to the application will speed up the COPY process?

A

Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.

580
Q

A business has developed an application that consumes streaming data. The organization wants to examine this stream over a five-minute period to look for abnormalities using Random Cut Forest (RCF) and to summarize the current count of status codes. The original and summary data sets should be retained for future reference.

Which strategy would achieve the required result while maintaining a minimal cost of data persistence?

A

Ingest the data stream with Amazon Kinesis Data Streams. Have a Kinesis Data Analytics application evaluate the stream over a 5-minute window using the RCF function and summarize the count of status codes. Persist the source and results to Amazon S3 through output delivery to Kinesis Data Firehouse.

581
Q

A data analyst is organizing, cleansing, validating, and formatting a 200 GB dataset using AWS Glue. The data analyst started execution of the task using the Standard worker type. After three hours, the status of the AWS Glue task remains RUNNING. The task run logs include no error codes. The data analyst wishes to shorten the time required to complete an assignment without overprovisioning.

Which data analyst activities should he or she take?

A

Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.

582
Q

A company uses Amazon EMR clusters to analyze its workload on a regular basis. Data engineers often forget to terminate these EMR clusters after their work is complete. The company incurs unnecessary costs because the idle clusters continue to run. The company needs an automated solution to terminate idle clusters.

Which solution meets these requirements?

A

Configure an Amazon CloudWatch metrics alarm on the IsIdle metric from the EMR clusters to publish a notification to an Amazon Simple Notification Service topic. Subscribe an AWS Lambda function to the topic to terminate the clusters.

583
Q

An IoT device company is storing sensor data in a data lake that is stored on Amazon S3 Standard. The company queries the data directly from Amazon S3 to power rolling 30-day dashboards. The company also uses a subset of the data to retrain a set of machine learning (ML) models on a weekly basis.

The company stores the data in Apache Parquet format in 1 GB objects. The ETL process uploads the data to an S3 bucket and uses the S3 Standard storage class. The company needs to add an S3 Lifecycle configuration on the S3 bucket to minimize cost.

Which solution will meet these requirements?

A

Add an S3 Lifecycle configuration on the S3 bucket to immediately transition the data from S3 Standard to S3 Intelligent-Tiering

584
Q

A healthcare company must encrypt all data at rest and in transit. The company’s data analytics team has set up a pipeline that runs Apache Hive scripts on an Amazon EMR cluster to process the data. After processing, the output is transferred to an Amazon S3 bucket as encrypted objects. The company needs a solution that encrypts all the data that Amazon EMR Hive reads and writes.

What should a data analytics specialist do to meet these requirements?

A

Configure the EMR cluster to use EMRFS. Use Amazon EBS encryption for local and root volumes. Encrypt the data by using server-side SSE-S3. Specify the encryption artifacts used for in-transit encryption by uploading a .zip file that contains the certificates to S3

585
Q

An online retail company wants to collect millions of product registration records from customers every week. The company wants to process the records with customized business logic to validate the customer information and activate product warranties.

The company will process the raw registration data into compressed, unpartitioned, columnar data each month. The company can take up to a week to process the data. The company will scan and aggregate the processed data into hourly, daily, weekly, and monthly reports. These reports will be accessed frequently throughout the day. The company must avoid data loss during processing.

Which solutions meet these requirements MOST cost-effectively?

A
  • Use EMR with spot instances and the EMRFS to process the records. Load the processed data into Amazon Redshift. Generate the reports from Amazon Redshift
  • Use AWS Glue to process the records. Load the processed data into Amazon Redshift. Generate the reports from Redshift.
586
Q

A media company is using social interactions, clickstreams, ticket sales, ratings, and other data-gathering methods to build its campaign engine. The company wants an on-demand, event-driven solution to process and transform the growing amount of data. The solution must initiate when partners add data in an Amazon S3 bucket. The solution must process the data, transform the data, and save the data in a different Amazon S3 bucket. A few data analysts occasionally will query the transformed data.

Which solution will meet these requirements MOST cost-effectively?

A

Crawl the data with AWS Glue crawler and update AWS Glue Data Catalog to reflect the metadata. Use AWS Glue ETL job to process and transform the data. Use Athena to query the transformed data

587
Q

A healthcare company uses Amazon S3 to store all its data. The company is planning to use Amazon EMR, backed with the EMR File System (EMRFS), to process and transform the data. The company stores the data in multiple S3 buckets and encrypts the data by using different encryption keys for each S3 bucket.

A data analytics specialist must configure the EMR cluster to access the encrypted data. The solution must comply with security best practices.

Which solution meets these requirements?

A

Use per bucket encryption overrides

588
Q

A company maintains years of campaign data and user segmentation data in its on-premises data warehouse. The company frequently uses the current year’s data, which includes thousands of tables. The company has migrated the current year’s data to an Amazon Redshift cluster. About 150 TB of infrequently used data from previous years is stored in Amazon S3. The Amazon Redshift cluster has heavy utilization.

The company has a new request for a monthly business intelligence (BI) report that is hosted in Amazon QuickSight. The report needs to combine data from the previous years with data from the current year. The company needs to determine the most cost-effective way to deliver the new monthly BI report without significantly increasing the load on the existing Amazon Redshift cluster.

Which solution will meet these requirements?

A

Use Amazon Redshift CREATE EXTERNAL SCHEMA SPECTRUM and CREATE EXTERNAL TABLE commands to make the data in S3 accesible in Redshift. Generate the report from Redshift

589
Q

An online retail company is planning to capture clickstream data from its ecommerce website. The company will use the data to drive a new custom-built recommendation engine that provides product recommendations to online users. The company will use Amazon Kinesis Data Streams to ingest the streaming data. The company will use Amazon Kinesis Data Analytics to perform SQL queries on the stream, using windowed queries to process the data that arrives at inconsistent intervals.

A data analytics specialist must choose a windowed query that aggregates the data and uses time-based windows that open as data arrives.

Which type of query meets these requirements?

A

A stagger window - query uses a windowing method that is suited for analyzing groups of data that arrive at inconsistent times.

590
Q

A company recently migrated a batch job to an Amazon EMR cluster that uses On-Demand Instances. The job is critical to operations. The job has a service level agreement (SLA) of 3 hours and takes an average of 2 hours to complete. The company wants to reduce costs while minimizing impact on availability.

Which combination of steps should a data analytics specialist take to meet these requirements?

A
  • Confgure the EMR to use instance fleet with provisioning timeout for the core nodes
  • Use Spot Instances for the Task nodes
591
Q

What is EMR instance fleet

A

For each fleet, you can define a provisioning timeout. The timeout applies when the cluster is provisioning capacity and does not have enough Spot Instances to fulfill the target capacity according to the provided specifications. With the provisioning timeout, you can specify the timeout period and choose to switch to On-Demand capacity to fulfill the remaining Spot capacity and comply with the SLA.

592
Q

A company is creating a sales dashboard for a new product line by using data in Amazon Redshift. During a test run of the dashboard, a data analytics specialist notices sluggish query response times. Upon closer inspection, the data analytics specialist discovers the following facts about the data model and query history:

The orders table includes an order_date column and is often searched by date.
The product_details table includes a product_id column and is often joined with the orders table.
What should the data analytics specialist do to optimize the data model?

A

Ensure that the orders table and the product_details table use the KEY diststyle with identical key columns that minimize data processing and skew. In addition, ensure that the orders table has a compound sort key that includes the order_data column and is ordered from the lowest cardinality to highest cardinality

593
Q

A company operates an ecommerce website with an online transaction processing (OLTP) database. The company stores a .csv file that contains an export of the database hourly in Amazon S3. The company queries the data through Amazon Athena to extract key business indicators.

The files are currently tens of gigabytes in size, but they are increasing in size. As the size of the data increases, the same query takes more time. The most common query obtains sales and user statuses by date and by region.

Which solutions will improve the Athena query performance? (Select TWO.)

A
  • partition the files by the date and region

- transform the .csv files into Apache Parquet files

594
Q

A manufacturing company uses Amazon Kinesis Data Firehose to receive periodic data from millions of devices and persist those records in its data lake in Amazon S3. The company then uses AWS Glue to transform the data and load the refined dataset into an Amazon Redshift database. The database consists of one fact table and three dimension tables. The company periodically deletes old periods of data and uses Amazon Redshift Spectrum to access that data from Amazon S3.

The company uses Amazon QuickSight dashboards to query the data directly from Redshift Spectrum and to visualize the results. Most of the queries involve only the newest data.

The number of sensors has remained static, but users have started to notice a degradation in the load time of the dashboards. After the deletion of older data from Amazon Redshift, maintenance jobs take longer to run as the data grows.

Which solution will resolve these issues?

A

Organize the data into multiple time-series tables. Drop old tables

595
Q

A hospital has medical sensor devices that transmit data about patients’ treatments. The sensor data is written to local on-premises storage. Data scientists who perform medical research access the data later.

The hospital plans to begin using AWS for long-term storage of sensor data. The hospital also plans to use Amazon Athena, Amazon QuickSight, and Amazon EMR for new analytics capabilities. The hospital is evaluating solutions for achieving the ongoing replication of the sensor data to AWS for storage. The hospital wants a solution that provides encryption of data in transit and data integrity validation.

Which solution will meet these requirements?

A

Use DataSync to deploy a DataSync agent on-prem and to replicate the data to a specified S3 bucket

596
Q

A manufacturing company is expecting a surge in orders. The company needs to visualize performance data from its manufacturing devices to ensure that it can keep up with demand. The company stores data from the devices in a data lake on Amazon S3 in Apache Parquet format. Data arrives every 15 minutes from the company’s factories around the world and is added to 10 TB of historical data that already resides in the S3 data lake.

Data scientists want to use the S3 data lake to develop new machine learning (ML) models. The data scientists want to use the new models to generate updated visualizations for predicted order volume. The company must have preliminary visualizations available within 12 hours of the data’s arrival in Amazon S3.

Which solution will meet these requirements?

A

Launch an AWS Glue ETL job to transform the data and save it in S3. Read the data by using Amazon Forecast and develop the new ML models to predict the order volume. SAve the results to S3. Create visualizations in QuickSight

597
Q

A company processes real-time streaming data by using Apache ZooKeeper and Apache Kafka. The company is facing challenges with managing, setting up, and scaling brokers during production. The company wants to optimize the deployment of the Kafka brokers. The solution must be managed by AWS, must be secure, and must minimize changes to the current client code.

Which solution meets these requirements?

A

Use Amazon MSK to scale the brokers.

598
Q

A credit rating company is based in the United States. The company is using Amazon S3 buckets in the us-east-1 Region and the us-west-2 Region as semistructured data storage. To comply with regulations, the company has applied server-side encryption (SSE) on all the S3 buckets. An S3 Lifecycle policy on all the S3 buckets across both Regions moves objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 7 days and moves objects to S3 Glacier after 40 days.

The company is using Amazon Athena in us-east-1 with appropriate IAM permissions to query the data from Amazon S3. However, the company cannot access some of the data by using Athena.

Which objects are inaccessible to the company when the company uses Athena? (Select TWO.)

A
  • Objects in S3 buckets in us-west-2

- Objects in S3-Glacier

599
Q

A company is implementing a new data lake on Amazon S3. The company has a set of web services that send data to Amazon Kinesis Data Streams. The company uses AWS Lambda to process and store the data in Amazon S3.

The company needs a solution to validate and control the evolution of streaming data. The solution must be able to keep track of any schema changes.

Which solution will meet these requirements?

A

Integrate the KPL and KCL with the AWS Schema Registry

600
Q

An online retail company is capturing clickstream data to populate analytics dashboards. Multiple departments throughout the company want to enrich their dashboards by using the same data stream in Amazon Kinesis Data Streams. However, the company faces performance bottlenecks when it horizontally scales its consumer applications and adds new dashboards.

Which solution will produce the MOST improvement in performance?

A

Use Kinesis Data Streams Enhances fan-outand HTTP/2 data retrieval

601
Q

A transportation company recently implemented a data lake on Amazon S3. The company uses AWS Glue ETL jobs to run a set of transformations on the data lake’s data and to store the data in a consumable format so that users can run one-time queries on the datasets. A data analytics specialist must be able to express the dependencies between the transformations and must run multiple transformations concurrently to reduce processing time.

How can the data analytics specialist meet these requirements with the LEAST operational overhead?

A

Use a scheduled trigger to start an AWS Glue Workflow that will launch the required AWS Glue Jobs

602
Q

A gaming company uses Amazon Athena to run one-time queries on gaming data to find trends and usage metrics. Multiple teams are using Athena, and the company wants to enforce cost controls for each team. The company also must encrypt all Athena query results data.

Which solution will meet these requirements?

A

Create a unique Athena workgroup for each team. Within the workgroup, enforce the encryption for the query results and create tags. Use the tags to calculate the cost for each team. Use resource-based policies to assign workgroups to teams.