Data Analytics Flashcards

Question

Can you perform many resharding operations at the same time

Answer 1

No, only one operations is allowed at a time and it takes a few seconds

Answer 2

- S3 - Redshift - Opensearch - HTTP Endpoint

Answer 3

JSON -> Parquet

Answer 4

Yes, when the target is S3

Answer 5

Pay only for the amount of data going through Firehose

Answer 6

based on time and size rules

Answer 7

- Kinesis Data Streams - Kinesis Data Firehose - AWS Lambda

Answer 8

60 seconds

Answer 9

900 seconds

Answer 10

200MB/s and 200.000 record/s

Answer 11

400MB/s per consumer (extra capacity in Enhanced fan out)

Answer 12

sudo service aws-kinesis-agent restart

Answer 13

1 message/s to 10.000 messages/s

Answer 14

<10ms on publish and receive

Answer 15

use SQS Extended Client (Java Library)

Answer 16

XML, JSON, Unformatted text

Answer 17

10 messages - max 256KB

Answer 18

3000 messages/s

Answer 19

1 minute to 14 days

Answer 20

- pay per API Request | - pay per network usage

Answer 21

body only, metadata is NOT encrypted

Answer 22

Many times

Answer 23

After consumption

Answer 24

After the retention period

Answer 25

Kinesis Data Streams

Answer 26

It evaluates inbound messages published into AWS IoT, transforms and delivers them to another thing or a cloud based on business rules you define

Answer 27

A Device Shadow is a persistent, virtual representation of a device that is managed by a thing resource you create in the AWS IoT registry

Answer 28

Entry point for IoT devices connecting to AWS

Answer 29

MQTT, WebSockets, and HTTP 1.1

Answer 30

The Message Broker is a high throughput pub/sub message broker that securely transmits messages to and from all of your IoT devices and applications with low latency.

Answer 31

messages are published into topics

Answer 32

all clients connected to the topic

Answer 33

Organizes the resources associated with each device in the AWS Cloud

Answer 34

- Create X.509 certificate and load them securely into the Things - AWS SigV4 - Custom tokens with Custom authorizers

Answer 35

JSON document

Answer 36

AWS IoT Greengrass provides cloud-based management of application logic that runs on devices

Answer 37

Database Migration Service - quickly and securely migrate databases to AWS, resilient, self healing

Answer 38

When migrating to different DB engine

Answer 39

Snowcone only

Answer 40

A software you install on your computer/laptop to manage your Snow Family Device

Answer 41

- basic - enhanced - topic-level

Answer 42

1MB default, up to 10MB

Answer 43

Topics with Partition

Answer 44

Can only add partition to a topic

Answer 45

PLAINTEXT or TLS In-flight

Answer 46

feature to upload files larger than 5GB

Answer 47

- expedited (1-5 mins) - standard (3-5 hours) - bulk (5-12 hours)

Answer 48

- standard (12h) | - bulk (48h)

Answer 49

- CRR - Cross Region Replication | - SRR - Same Region Replication

Answer 50

You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request

Answer 51

S3 Byte-Range Fetch

Answer 52

one write/s for an item up to 1KB in size

Answer 53

If we read just after a write, it's possible we'll get unexpected response because of replication

Answer 54

If we read just after a write, we will get the correct data

Answer 55

one strongly consistent read per second or | two eventually consistent reads per second, for an item up to 4KB in size

Answer 56

3000RCU/1000WCU

Answer 57

- PutItem - UpdateItem - Conditional writes

Answer 58

- DeleteItem | - DeleteTable

Answer 59

- up to 25 PutItem/DeleteItem in one call - up to 16Mb of data - up to 400KB of data per item

Answer 60

Eventually consistent

Answer 61

- up to 100 items | - up to 16MB of data

Answer 62

Partition key and Sort key only

Answer 63

LSI (Local Secondary Index)

Answer 64

GSI (Global Secondary Index) only

Answer 65

DynamoDB Accelerator - seamless cache, no application re-write

Answer 66

Captures a time-ordered sequence of item-level modifications in a DynamoDB table and durably stores the information for up to 24 hours

Answer 67

VPC Endpoints

Answer 68

multi-region, fully replicated, high performance tables

Answer 69

Use DMS (Database Migration Service)

Answer 70

Sore them in S3 and reference them in DynamoDB

Answer 71

Yes - by default

Answer 72

Integrate Lambda with DynamoDB

Answer 73

ElastiCache

Answer 74

Extraction is based on how your S3 data is organized

Answer 75

- S3 - JDBC (RDS, Redshift) - Glue Data Catalog

Answer 76

Serverless Spark platform

Answer 77

- time based schedules - job bookmarks - CloudWatch Events

Answer 78

Job Bookmark

Answer 79

Billing by the minute for Crawler and ETL jobs First million objects stored and accesses are free for the Glue Data Catalog Development endpoint for developing ETL code charged by the minute

Answer 80

yes, runs on Apache Spark Structured Streaming (serverless)

Answer 81

Schedule crawlers to run periodically

Answer 82

Python and Scala

Answer 83

YES Upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.

Answer 84

CloudWatch + SNS

Answer 85

Visual interface for ETL workflows

Answer 86

A visual data preparation tool

Answer 87

- master - core - task

Answer 88

Hadoop Distributed Files System

Answer 89

files are stored as blocks (128MB default size)

Answer 90

The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3

Answer 91

EMR treats that as a failure and replaces it.

Answer 92

buffers, caches, etc.

Answer 93

per hour plus EC2 charges

Answer 94

provisions a new node automatically

Answer 95

Add more task nodes

Answer 96

Resize or add core nodes

Answer 97

first add core nodes, then task nodes, up to max units specified

Answer 98

- first removes task nodes, then core nodes, no further than minim constraints - spot nodes always removed before on-demand instances

Answer 99

Yet Another Resource Navigator

Answer 100

Open-source distributed processing framework for big data

Answer 101

Java, Scala, Python and R

Answer 102

Apache Tez is an open-source framework for big data processing based on MapReduce technology

Answer 103

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.

Answer 104

non-relational, petabyte-scale database based on Google's BigTable, on top of HDFS

Answer 105

- it connect to many different "big data" databases and data stores at once, and query across them - interactive queries at petabyte scale

Answer 106

Apache Zeppelin is a new and incubating multi-purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark

Answer 107

Graphical front-end for applications on EMR cluster

Answer 108

operational tool - can be used to visualize EMR and S3 data using EMR Hadoop Cluster

Answer 109

Another way to stream data into cluster. Originally made to handle log aggregation.

Answer 110

Like tensorflow, a library for building and accelerating neural networks.

Answer 111

Tool for copying large amounts of data (s3 HDFS)

Answer 112

EMR will delete the volumes once the EMR cluster is terminated

Answer 113

Apache Flink

Answer 114

SQL function used for anomaly detection on numeric columns in a stream

Answer 115

The count and instance types of master nodes

Answer 116

- Kinesis - Logstash - Elasticsearch's API's

Answer 117

Elasticsearch snapshots

Answer 118

Pay-as-you-go - $5 per TB scanned - Successful or cancelled queries count

Answer 119

On-Line Analytical Processing

Answer 120

Recovers space from deleted rows

Answer 121

- quickly add or remove nodes of the same type - cluster is down for a few minutes - tries to keep connections open across the downtime

Answer 122

- change node type and/or number of nodes | - cluster is read-only for hours to days

Answer 123

15 read replicas

Answer 124

Up to 64TB per database instance

Answer 125

GraphQL is designed to make APIs fast, flexible, and developer-friendly. It can even be deployed within an integrated development environment (IDE) known as GraphiQL. As an alternative to REST.

Answer 126

Amazon Kendra is a highly accurate intelligent search service that enables your users to search unstructured data using natural language. It returns specific answers to questions, giving users an experience that's close to interacting with a human expert.

Answer 127

Use SSE-KMS to encrypt the files SSE-KMS will allow you to use different KMS keys to encrypt the objects, and then you can grant users access to specific sets of KMS keys to give them access to the objects in S3 they should be able to decrypt

Answer 128

- Increase your Lambda function's timeout value | - Process data in smaller batches to avoid hitting Lambda's timeout

Answer 129

Enable Web Identity federation. Use DynamoDB and reference ${www.amazon.com:user_id} in the attached IAM policy

Answer 130

SSE-S3 – Amazon S3 manages keys for you. SSE-KMS – You use an AWS KMS key to set up with policies suitable for Amazon EMR

Answer 131

- EBS encryption - available only when you specify AWS Key Management Service as your key provider. - LUKS encryption – If you choose to use LUKS encryption for Amazon EBS volumes, the LUKS encryption applies only to attached storage volumes, not to the root device volume.

Answer 132

Use DMS to replicate the database to RDS

Answer 133

Store the data in S3 and keep a warm copy in HDFS

Answer 134

Configure Redshift to automatically copy snapshots to another region, using an AWS KMS customer master key in the destination region.

Answer 135

Copy data using the dblink function into PostgreSQL tables

Answer 136

Ganglia is the operational dashboard provided with EMR

Answer 137

Use Amazon Quicksight directly on top of the Excel, S3, and Redshift data.

Answer 138

Enable DynamoDB TTL and add a TTL column

Answer 139

Implement a custom encryption code in the Kinesis Producer Library (KPL)

Answer 140

Area line chart

Answer 141

Enable S3 Cross Region Replication

Answer 142

Create an AWS Lambda that responds to S3 upload events and will check if all the parts are there before uploading to Redshift

Answer 143

Run an AWS Glue Crawler on the data lake to populate a AWS Glue Data Catalog. Share the glue data catalog as a metadata repository between Athena, Redshift, Hive, and QuickSight

Answer 144

Use Enhanced VPC Routing

Answer 145

Publish sensor data into a Kinesis data stream, and create a Kinesis Data Analytics application using RANDOM_CUT_FOREST to detect anomalies. When an anomaly is detected, use a Lambda function to route an alarm to Amazon SNS

Answer 146

Pivot table

Answer 147

DynamoDB DAX

Answer 148

Consolidate files on a daily basis using DataPipeline

Answer 149

Amazon SageMaker

Answer 150

Use Apache Sqoop on the EMR cluster to copy the data into HDFS

Answer 151

Sqoop is an open-source system for transferring data between Hadoop and relational databases.

Answer 152

Kinesis Data Streams

Answer 153

Add a policy to the S3 bucket allowing S3 GET and LIST operations for an IAM role for Spectrum on the Redshift account

Answer 154

Perform any initial ETL you can using Amazon Kinesis, store the data in S3, and trigger a Glue ETL job to complete the transformations needed.

Answer 155

Convert the data into Apache Parquet format, compressed with Snappy, stored in a directory structure of year=XXXX/month=XX/day=XX/

Answer 156

Add a preliminary step that will use a S3DistCp command

Answer 157

Spin up separate redshift clusters in multiple availability zones, using Amazon Kinesis to simultaneously write data into each cluster. Use Route 53 to direct your analytics tools to the nearest cluster when querying your data.

Answer 158

Enable DynamoDB Streams and write a Lambda function

Answer 159

Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model's results to DynamoDB for producing recommendations in real-time.

Answer 160

Use the S3 ETag and compare to the local MD5 hash

Answer 161

- Build - Train - Deploy

Answer 162

No fixed limit

Answer 163

- EC2 - Amazon EKS - AWS Outposts

Answer 164

A mode with software components that only runs tasks and DOES NOT store data in HDFS. Task node is optional

Answer 165

Implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.

Answer 166

Instance store and Amazon Elastic Block Store (Amazon EBS) volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications may "spill" to the local file system

Answer 167

Presto, also known as PrestoDB, is an open source, distributed SQL query engine that enables fast analytic queries against data of any size.

Answer 168

Amazon EMR notebooks provide a managed analysis environment based on open-source Jupyter notebooks so that data scientists, analysts, and developers can prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis.

Answer 169

Amazon Redshift routes traffic through the internet, including traffic to other services within the AWS network.

Answer 170

Apache Airflow is an open-source task scheduler that can be installed on EC2 instances or bootstrapped on primary nodes

Answer 171

The Amazon MWAA is a managed service that reduces the burden of provisioning and ongoing maintenance of Airflow and offers seamless integration with CloudWatch for system metrics and logs. It offers a rich UI and troubleshooting tools and can be used to orchestrate jobs across hybrid environments.

Answer 172

- long-running | - transient

Answer 173

- Spark Streaming or Flink | - online transaction processing (OLTP) workload like Apache HBase

Answer 174

Cluster to be automatically shut down, it shuts down after all the steps complete.

Answer 175

Apache Hive is an open-source data warehouse and analytics package that runs on top of an Apache Hadoop cluster. A Hive metastore contains a description of the table and the underlying data making up its foundation, including the partition names and data types.

Answer 176

In a MySQL database on the master node’s file system.

Answer 177

- AWS Glue Data Catalog | - external data store such as Amazon Relational Database Service (Amazon RDS) or Amazon Aurora

Answer 178

Apache Ranger is an open-source project that provides authorization and audit capabilities for Hadoop and related big data applications like Apache Hive, Apache HBase, and Apache Kafka.

Answer 179

Primary data transfer utility used in Amazon EMR and is an extension of the open-source Apache DistCp and is optimized to work with Amazon S3.

Answer 180

NO It's a serverless service

Answer 181

Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types.

Answer 182

Amazon QuickSight allows everyone in your organization to understand your data by asking questions in natural language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by machine learning.

Answer 183

Persistent metadata store, you can use this managed service to store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

Answer 184

Visual data preparation tool, you can clean, enrich, format, and normalize your datasets with over 250 built-in transformations. You can create a “recipe” for a dataset using the transformations of your choice, and then reuse that recipe repeatedly as your business continues to collect new data.

Answer 185

Consume real-time data from either an Amazon Kinesis data stream or an Amazon Managed Streaming for Apache Kafka stream.

Answer 186

A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the Data Catalog.

Answer 187

A data store is a repository for persistently storing your data. Examples include Amazon S3 buckets and relational databases. A data source is a data store that is used as input to a process or transform. A data target is a data store that a process or transform writes to.

Answer 188

An environment that you can use to develop and test AWS Glue ETL scripts.

Answer 189

The business logic that is required to perform ETL work. It is composed of a transformation script, data sources, and data targets. Job runs are initiated by triggers that can be scheduled or activated by events.

Answer 190

Parquet format refers to a type of file format that structures data in a columnar format rather than row-based format like a CSV or Microsoft Excel file. Parquet format is optimal for analytical engines like Athena or Redshift Spectrum to query over.

Answer 191

The metadata definition that represents your data. Whether your data is in an Amazon S3 file, an Amazon Relational Database Service (Amazon RDS) table, or another set of data, a table defines the schema of your data. A table in the Data Catalog consists of the names of columns, data type definitions, partition information, and other metadata about a base dataset.

Answer 192

- Author - Admin - Reader

Answer 193

- Role Based Federation (SSO) | - Active Directory

Answer 194

The speed of data entering a solution.

Answer 195

The number of different sources - and the types of sources - the solution will use.

Answer 196

The number of different sources - and the types of sources - that the solution will use

Answer 197

The degree to which data is accurate, precise, and trusted. It is contingent on the integrity and trustworthiness of the data.

Answer 198

A subset of data warehouse. Data mart focus on one subject of functional area.

Answer 199

Resource management framework responsible for scheduling and executing processing jobs.

Answer 200

YARN-based system that allows for parallel processing of large data sets on the cluster.

Answer 201

The action or process of selecting, organizing, and looking after the items in a collection

Answer 202

the maintenance and assurance of the accuracy and consistency of data over its entire lifecycle

Answer 203

The degree to which data is accurate, precise, and trusted.

Answer 204

the process of detecting and correcting corruptions within data

Answer 205

process of ensuring that the constraints of table relationships are enforced

Answer 206

process of ensuring that the data being entered into a field matches the data type defined for that field

Answer 207

process of ensuring that the values stored within a field match the constraints defined for that field

Answer 208

- Atomicity - Consistency - Isolation - Durability

Answer 209

When executing a transaction in a database, atomicity ensures that your transactions either completely succeed or completely fail.

Answer 210

Consistency ensures that all transactions provide valid data to the database. This data must adhere to all defined rules and constraints

Answer 211

Isolation ensures that one transaction cannot interfere with another concurrent transaction.

Answer 212

Data durability is all about making sure your changes actually stick. Once a transaction has successfully completed, durability ensures that the result of the transaction is permanent even in the event of a system failure.

Answer 213

Basically Available Soft state Eventually consistent

Answer 214

Method for maintaining consistency and integrity in a structured or semistructured database.

Answer 215

BA allows for one instance to receive a change request and make that change available immediately.

Answer 216

In a BASE system, there are allowances for partial consistency across distributed instances. For this reason, BASE systems are considered to be in a soft state, also known as a changeable state.

Answer 217

The data will be eventually consistent. In other words, a change will eventually be made to every copy. However, the data will be available in whatever state it is during propagation of the change.

Answer 218

Feature that implements ACID compliance across one or more tables within a single AWS account and region

Answer 219

The process of analyzing information to find the value contained within it. This term is often synonymous with data analytics.

Answer 220

Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules.

Answer 221

Hourly rate based on the average number of Kinesis Processing Units (or KPUs) used to run your stream processing application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory.

Answer 222

Pay only for the volume of data you ingest, store, and consume through the service.

Answer 223

Pay only for what you use. The process of building, training, and deploying ML models is billed by the second, with no minimum

Answer 224

Compressing, partitioning, and converting your data into columnar formats

Answer 225

Amazon EMR block public access prevents a cluster in a public subnet from launching when any security group associated with the cluster has a rule that allows inbound traffic from IPv4 0.0.0.0/0 or IPv6 ::/0 (public access) on a port, unless the port has been specified as an exception.

Answer 226

Amazon Redshift logs information about connections and user activities in your database. The logs are stored in Amazon S3 buckets.

Answer 227

The cluster can reside only in one Availability Zone or subnet

Answer 228

EMRFS consistent view is an optional feature available when using Amazon EMR release version 3.2. 1 or later. Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS

Answer 229

Merge the files in Amazon S3 to form larger files.

Answer 230

Reduce the propagation delay by overriding the KCL default settings.

Answer 231

- provisioned | - on-demand

Answer 232

<10 Ms on publish and receive

Answer 233

No | That's different from Kinesis

Answer 234

- decouple applications - buffer writes to a dabases (voting application) - handle large loads of messages coming in

Answer 235

configurable (default 1 MB)

Answer 236

A Broker is a Kafka server that runs in a Kafka Cluster. Kafka Brokers form a cluster. The Kafka Cluster consists of many Kafka Brokers on many servers.

Answer 237

Apache ZooKeeper is an open-source server that enables highly reliable distributed coordination. Producers, consumers, and topic creators — Amazon MSK lets you use Apache Kafka data-plane operations to create topics and to produce and consume data.

Answer 238

in-flight using TLS

Answer 239

Kafka Connect is an open-source component of Apache Kafka that provides a framework for connecting with external systems such as databases, key-value stores, search indexes, and file systems.

Answer 240

Kinesis Data Streams

Answer 241

Kinesis Data Firehose

Answer 242

Kinesis Producer Library

Answer 243

Kinesis Agent

Answer 244

Kinesis Producer Library + Implement Compression Yourself

Answer 245

Kinesis Firehose

Answer 246

Your DynamoDB table is under-provisioned

Answer 247

Use Kinesis Data Streams on-demand mode

Answer 248

Send that data into an IoT topic and define a rule action

Answer 249

Change the state of device shadow

Answer 250

DMS (Database Migration Service)

Answer 251

S3 Infrequent Access

Answer 252

Glacier with Vault Lock Policy

Answer 253

Enable Cross Region Replication

Answer 254

Compute the local ETag for each file and compare them with AWS S3's ETag

Answer 255

/yyy-mm-dd/...

Answer 256

We have hot partition/hot key

Answer 257

Create DAX cluster

Answer 258

Integrate Lambda with DynamoDB stream

Answer 259

Parquet and ORC

Answer 260

Convert the data from JSON to ORC format and analyze the data with Athena

Answer 261

- INSERT INTO ... SELECT | - CREATE TABLE AS

Answer 262

Prioritize short-running queries over long-running queries

Answer 263

- elastic resize | - classic resize

Answer 264

RA3 is that it has a completely separate storage layer called Redshift Managed Storage (RMS).

Answer 265

AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10x faster than other enterprise cloud data warehouses by automatically boosting certain types of queries.

Answer 266

In Redshift Processing Units (RPU's)

Answer 267

- encryption at rest | - MS Active Directory Integration

Answer 268

Line chart

Answer 269

Select auto-graph

Answer 270

Select edit/preview data before loading it into analysis and edit it as needed

Answer 271

Add Quicksight IP range to the allowed IPs of the hosted DB

Answer 272

x-amz-server-side-encryption:"AES256"

Answer 273

user control + audit trial

Answer 274

x-amz-server-side-encryption:"aws:kms"

Answer 275

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Apache Ranger has the following features: Centralized security administration to manage all security related tasks in a central UI or using REST APIs.

Answer 276

SSE with customer-provided keys (SSE-C) is not available for use with Amazon EMR.

Answer 277

- Open-source HDFS encryption - Instance store: NVMe or LUKS - EBS: KMS, LUKS (doesn't work with root volume)

Answer 278

Multi-class classification model

Answer 279

- Directly writing to HCatalog tables in S3 - Submitting work from the EMR console using Pig scripts stored in S3 - Loading custom JAR files from S3 with the REGISTER command

Answer 280

- Publish data in csv format to Amazon Cloudfront via S3, and use d3.js to visualize the data on the web - Publish data in csv format to Amazon Cloudfront via S3, and use Highcharts to visualize the data on the web.

Answer 281

Produce data using Spark Streaming to Kinesis Data Streams, and read the data with Spark Streaming from Kinesis Data Streams to write it to S3

Answer 282

Spin up separate redshift clusters in multiple availability zones, using Amazon Kinesis to simultaneously write data into each cluster. Use Route 53 to direct your analytics tools to the nearest cluster when querying your data

Answer 283

Consolidate files on a daily basis using DataPipeline

Answer 284

- Set up an SSH tunnel with port forwarding to allow access on port 5601 - Set up a reverse proxy server between your browser and Amazon Elasticsearch Service.

Answer 285

Implement a custom encryption code in the Kinesis Producer Library (KPL)

Answer 286

Sqoop is an open-source system for transferring data between Hadoop and relational databases.

Answer 287

Use Apache Sqoop on the EMR cluster to copy the data into HDFS

Answer 288

Transfer your historical data using Snowball and use Kinesis Data Streams for ongoing data collection

Answer 289

- Uncompress the data and split it into 64MB chunks | - Convert the file to AVRO format

Answer 290

Store the data in S3 and keep a warm copy in HDFS

Answer 291

Define an IoT rules actions to send data to SQS and consume the data with EC2 instances in an Auto Scaling group

Answer 292

This is an OLTP use case for which a “NoSQL” database is a good fit. HBase is the only option presented designed for OLTP and not OLAP, plus it has the advantage of already being present in EMR. DynamoDB would also be an appropriate technology to use.

Answer 293

- Snapshots of HBase data to S3 - Storage of HBase StoreFiles and metadata on S3 - HBase read-replicas on S3

Answer 294

Amazon Athena

Answer 295

Attach an IAM role

Answer 296

- S3 | - DynamoDB

Answer 297

Enable DynamoDB Streams and write a Lambda function

Answer 298

Create an AWS Lambda that responds to S3 upload events and will check if all the parts are there before uploading to Redshift

Answer 299

Perform any initial ETL you can using Amazon Kinesis, store the data in S3, and trigger a Glue ETL job to complete the transformations needed.

Answer 300

- Process data in smaller batches to avoid hitting Lambda’s timeout - Increase your Lambda function’s timeout value

Answer 301

Here Kinesis Data Streams is the best fit as the data can be replayed in the same order

Answer 302

Configure Redshift to automatically copy snapshots to another region, using an AWS KMS customer master key in the destination region.

Answer 303

Use the Kinesis Producer Library to send the clickstream and the Kinesis agent to collect the Server Logs

Answer 304

Create an IAM role and attach it to the EMR instances

Answer 305

Here, using an S3DistCp command is the right thing to do to copy data from S3 into HDFS and then make sure the data is processed locally by the EMR cluster MapReduce job. Upon completion, you will use S3DistCp again to push back the final result data to S3.

Answer 306

Classic resize

Answer 307

SageMaker enables developers and data scientists to build, train, and deploy machine learning models at any scale, using hosted Jupyter notebooks

Answer 308

Split the data into files between 1MB and 125MB (after compression,) and specify GZIP compression from a single COPY command.

Answer 309

Create an Athena workgroup for each given use case, apply tags to the workgroup, and create an IAM policy using the tags to apply appropriate permissions to the workgroup.

Answer 310

Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Run two separate EMR clusters in two different Availability Zones. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.

Answer 311

A KEY distribution style for both tables

Answer 312

Load the data into Amazon S3 and query it with Amazon S3 Select.

Answer 313

Convert the .csv files to Apache Parquet. | Partition the data by campaign.

Answer 314

The producer has a network-related timeout. There was a change in the number of shards, record processors, or both.

Answer 315

Increase the provisioned write capacity units assigned to the stream's Amazon DynamoDB table.

Answer 316

Create an AWS Lambda function to process the DynamoDB stream. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPY command with the IAM role that has access to the KMS key to load the data from S3 to the finance table.

Answer 317

Use a single COPY command to load the data into the Amazon Redshift cluster.

Answer 318

Configure an AWS Glue crawler to use a JDBC connection to catalog the data in the on-premises database. Use an AWS Glue job to enrich the data and save the result to Amazon S3 in Apache Parquet format. Use Amazon Athena to query the data.

Answer 319

Use the AWS Glue Data Catalog to manage the data catalog. Update the AWS Glue ETL code to include the enableUpdateCatalog and partitionKeys arguments.

Answer 320

Create an Amazon QuickSight analysis by using the data in Amazon Redshift. Add a forecasting widget Publish the analysis as a dashboard.

Answer 321

Add a key prefix of the form date=year-month-day/ to the S3 objects to partition the data. Convert the log files to Apache Parquet format. Drop and recreate the table with the PARTITIONED BY clause. Run the MSCK REPAIR TABLE statement.

Answer 322

Enable default encryption on the Amazon S3 bucket where the logs are stored by using AES-256 encryption. Use Amazon Redshift Spectrum to query the data as required.

Answer 323

Place the required installation scripts in Amazon S3 and execute them using custom bootstrap actions. Launch an Amazon EC2 instance with Amazon Linux and install the required third-party libraries on the instance. Create an AMI and use that AMI to create the EMR cluster.

Answer 324

Configure automatic workload management (WLM) from the Amazon Redshift console.

Answer 325

Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the clients' TLS certificates as the principal of the ACL.

Answer 326

Split large .csv files, then use a COPY command to load data into Amazon Redshift.

Answer 327

Create a VPC and establish a VPN connection between the VPC and the on-premises network. Create an HSM connection and client certificate for the on- premises HSM. Launch a cluster in the VPC with the option to use the on-premises HSM to store keys.

Answer 328

Enable concurrency scaling in the workload management (WLM) queue.

Answer 329

Use Amazon Kinesis Data Firehose to push the data into an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster. Visualize the data by using an OpenSearch Dashboards (Kibana).

Answer 330

Use an AWS Glue crawler to create and update a table in the Glue data catalog from the logs. Use Athena to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.

Answer 331

Increase the number of shards in the Kinesis data stream.

Answer 332

Use Amazon Kinesis Data Firehose to ingest the data to save it to Amazon S3. Load frequently queried data to Amazon Redshift using the COPY command. Use Amazon Redshift Spectrum for less frequently queried data.

Answer 333

Send the data to Amazon Kinesis Data Streams and configure an Amazon Kinesis Analytics for Java application as the consumer. The application will consume the data and process it to identify potential playback issues. Persist the raw data to Amazon S3.

Answer 334

Use an AWS Glue ETL job to compress, partition, and convert the data into a columnar data format. Use Athena to query the processed dataset. Configure a lifecycle policy to move the processed data into the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class 5 years after object creation. Configure a second lifecycle policy to move the raw data into Amazon S3 Glacier for long-term archival 7 days after object creation.

Answer 335

Add the QuickSight IP address range into the Amazon Redshift security group.

Answer 336

In Apache ORC partitioned by date and sorted by source IP

Answer 337

To create the data catalog, run an AWS Glue crawler on the existing Parquet data. Register the Amazon S3 path and then apply permissions through Lake Formation to provide granular-level security.

Answer 338

Edit the permissions for the new S3 bucket from within the Amazon QuickSight console.

Answer 339

Use Amazon QuickSight to connect to the data sources and generate the mobile dashboards.

Answer 340

Use AWS Lake Formation permissions

Answer 341

Grant all users read-only permissions to the columns that contain non-sensitive data. Use the GRANT SELECT command to allow the auditing team to access the columns that contain sensitive data.

Answer 342

Partition the data by year, month, and day. | Store the data in Apache Parquet format using Snappy compression.

Answer 343

Register the S3 locations with AWS Lake Formation. Create two IAM roles. Use Lake Formation data permissions to grant Select permissions to all of the columns for one role. Grant Select permissions to only columns that contain non-PII data for the other role.

Answer 344

In the Service Quotas console, request an increase for the DML query timeout

Answer 345

Use the COPY command with the manifest file to load data into Amazon Redshift. Use temporary staging tables during the loading process.

Answer 346

A geospatial color-coded chart of sales volume data across the country.

Answer 347

Store the data in Apache Parquet format with the date and time as the partition, with the data sorted by the location ID.

Answer 348

Replace Kafka with an Amazon Kinesis data stream. Use an Amazon Kinesis Data Firehose delivery stream to consume the data and store the data in Amazon S3. Use QuickSight Enterprise edition to refresh the data in SPICE from Amazon S3 hourly and create a dynamic dashboard with forecasting and ML insights.

Answer 349

Keep the data from the last 90 days in Amazon Redshift. Move data older than 90 days to Amazon S3 and store it in Apache Parquet format partitioned by date. Then use Amazon Redshift Spectrum for the additional analysis.

Answer 350

Amazon QuickSight Enterprise edition configured to perform identity federation using SAML 2.0 and the default encryption settings.

Answer 351

Use an Active Directory connector and single sign-on (SSO) in a corporate network environment.

Answer 352

Configure short query acceleration in workload management (WLM)

Answer 353

Use Amazon Managed Streaming for Apache Kafka. Configure a topic for the raw data. Use a Kafka producer to write data to the topic. Create an application on Amazon EC2 that reads data from the topic by using the Apache Kafka consumer API, cleanses the data, and writes to Amazon S3.

Answer 354

A. Use Amazon Kinesis Data Streams to receive the data from the sensors. Use Amazon Kinesis Data Analytics to read the stream, aggregate the data, and send the data to an AWS Lambda function. Configure the Lambda function to store the data in Amazon DynamoDB.

Answer 355

Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Set up an external schema and table for Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift as the data source.

Answer 356

Load all the data into the new table and grant the auditing group permission to read from the table. Use the GRANT SQL command to allow read-only access to a subset of columns to the appropriate users.

Answer 357

Create a custom AMI with encrypted root device volumes. Configure Amazon EMR to use the custom AMI using the CustomAmild property in the CloudFormation template.

Answer 358

Keep the number of retries at 0. Decrease the timeout value. Keep the job concurrency at 1.

Answer 359

Use Amazon QuickSight to visualize the data and then use ML-powered forecasting to forecast the key business metrics.

Answer 360

Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.

Answer 361

Use the AWS Glue dynamic frame file grouping option while ingesting the raw input files. Process the files and load them into Amazon Redshift tables.

Answer 362

Set up a trusted connection with HSM using a client and server certificate with automatic key rotation. Create a new HSM-encrypted Amazon Redshift cluster and migrate the data to the new cluster.

Answer 363

B. Create an Amazon Kinesis data stream, and ingest the data for each source into the stream. Create a single enhanced fan-out AWS Lambda function to read these messages and send the messages to each destination endpoint. Register the function as an enhanced fan-out consumer.

Answer 364

Create an external table using Amazon Redshift Spectrum for the call center data and perform the join with Amazon Redshift.

Answer 365

Incrementally copy data from Amazon RDS to Amazon S3. Load and store the most recent 6 months of data in Amazon Redshift. Configure an Amazon Redshift Spectrum table to connect to all historical data.

Answer 366

Create dataset rules with row-level security.

Answer 367

Use QuickSight Enterprise edition. Configure 50 author users and 1,000 reader users. Configure an Athena data source and import the data into SPICE. Automatically refresh every 24 hours.

Answer 368

Create a new HSM-encrypted Amazon Redshift cluster and migrate the data to the new cluster.

Answer 369

Modify the partition key to use the sensor ID instead of the station name.

Answer 370

Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric.

Answer 371

Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.

Answer 372

The nightly data refreshes left the dashboard tables in need of a vacuum operation that could not be automatically performed by Amazon Redshift due to ongoing user workloads.

Answer 373

Use Amazon Kinesis Data Firehose to stream data to Amazon Redshift. Use Amazon Redshift as a data source for Amazon QuickSight to create a business intelligence dashboard.

Answer 374

Create a new security group for Amazon Redshift in us-east-1 with an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in ap-northeast-1.

Answer 375

The per-query control limit specifies the total amount of data scanned per query. If any query that runs in the workgroup exceeds the limit, it is canceled

Answer 376

For each workgroup, set the control limit for each query to the prescribed threshold.

Answer 377

Use an AWS Glue job to transform the data from JSON to Apache Parquet. Use AWS Glue crawlers to discover the schema and build the AWS Glue Data Catalog. Use Amazon Athena to create a table with a subset of columns. Use Amazon QuickSight to visualize the data and then use Amazon QuickSight machine learning-powered anomaly detection.

Answer 378

Workload management (WLM)

Answer 379

AWS Glue with a PySpark job

Answer 380

Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Create a primary EMR HBase cluster with multiple master nodes. Create a secondary EMR HBase read-replica cluster in a separate Availability Zone. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.

Answer 381

Use the Relationalize class in an AWS Glue ETL job to transform the data and write the data back to Amazon S3. Use Amazon Redshift Spectrum to create external tables and join with the internal tables.

Answer 382

Include a session identifier in the clickstream data from the publisher website and use as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Deploy the consumer application on Amazon EC2 instances in an EC2 Auto Scaling group. Use an AWS Lambda function to reshard the stream based upon Amazon CloudWatch alarms

Answer 383

Store the file in Amazon S3 and the object key as an attribute in the DynamoDB table

Answer 384

Create an Amazon Managed Streaming for Kafka cluster and ingest the data for each order into a topic. Use a Kafka consumer running on Amazon EC2 instances to read these messages and invoke the Amazon SageMaker endpoint.

Answer 385

Compile the Java program for the desired Hadoop version and run it using a CUSTOM_JAR step on the EMR cluster.

Answer 386

Load the data into Spark DataFrames. Use Amazon S3 Select to retrieve the data necessary for the dashboards from the S3 objects.

Answer 387

Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Use Kinesis Data Analytics for SQL Applications to perform a sliding window analysis to compute the metrics and output the results to a Kinesis Data Streams data stream. Configure an AWS Lambda function to save the stream data to an Amazon DynamoDB table. Deploy a real-time dashboard hosted in an Amazon S3 bucket to read and display the metrics data stored in the DynamoDB table.

Answer 388

Schedule the dataset to refresh daily.

Answer 389

Specify local disk encryption in a security configuration. Re-create the cluster using the newly created security configuration.

Answer 390

Create separate IAM roles for the marketing and HR users. Assign the roles with AWS Glue resource based policies to access their corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore.

Answer 391

Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.

Answer 392

Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.

Answer 393

Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.

Answer 394

Merge the files in Amazon S3 to form larger files.

Answer 395

Decrease the number of Amazon ES shards for the index.

Answer 396

Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.

Answer 397

Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.

Answer 398

Ingest data into Amazon S3 using AWS DMS. Use AWS Glue to perform data curation and store the data in Amazon 3 for ML processing.

Answer 399

Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

Answer 400

Run the AWS Glue crawler in us-west-2 to catalog datasets in all Regions. Once the data is crawled, run Athena queries in us-west-2.

Answer 401

Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.

Answer 402

Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

Answer 403

For the EMR cluster Amazon EC2 instances, create a service role that grants no access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the service role for the EMR cluster EC2 instances to the trust policies for the additional IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

Answer 404

AWS Glue Data Catalog for metadata management AWS Glue for Scala-based ETL Amazon Athena for querying data in Amazon S3 using JDBC drivers

Answer 405

Create an Amazon Kinesis data stream to capture the incoming sensor data and create another stream for alert messages. Set up AWS Application Auto Scaling on both. Create a Kinesis Data Analytics for Java application to detect the known event sequence, and add a message to the message stream. Configure an AWS Lambda function to poll the message stream and publish to the SNS topic.

Answer 406

Create an AWS Lambda function to spin up an Amazon EMR cluster with a Hive execution step. Set KeepJobFlowAliveWhenNoSteps to false and disable the termination protection flag. Use Amazon CloudWatch Events to schedule the Lambda function to run daily.

Answer 407

Enable the block public access setting for Amazon EMR at the account level before any EMR cluster is created.

Answer 408

Create a manifest file that contains the data file locations and issue a COPY command to load the data into Amazon Redshift.

Answer 409

Amazon Redshift

Answer 410

Select Amazon Elasticsearch Service (Amazon ES) as the endpoint for Kinesis Data Firehose. Set up a Kibana dashboard using the data in Amazon ES with the desired analyses and visualizations.

Answer 411

Increase the read capacity units (RCUs) for the shared Amazon DynamoDB table.

Answer 412

Enable job bookmarks on the AWS Glue jobs.

Answer 413

Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

Answer 414

Enable audit logging for Amazon Redshift using the AWS Management Console or the AWS CLI.

Answer 415

Query all the datasets in place with Apache Presto running on Amazon EMR.

Answer 416

Set up an individual AWS account for the central data lake. Use AWS Lake Formation to catalog the cross-account locations. On each individual S3 bucket, modify the bucket policy to grant S3 permissions to the Lake Formation service-linked role. Use Lake Formation permissions to add fine-grained access controls to allow senior analysts to view specific tables and columns.

Answer 417

Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon Elasticsearch Service cluster.

Answer 418

Use Amazon QuickSight with Amazon Athena as the data source. Use heat maps as the visual type.

Answer 419

Index the metadata and the Amazon S3 location of the image file in Amazon Elasticsearch Service. Allow the data analysts to use Kibana to submit queries to the Elasticsearch cluster.

Answer 420

Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature while consuming the data.

Answer 421

Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Set up an external schema and table for Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift as the data source.

Answer 422

For daily incoming data, use AWS Glue crawlers to scan and identify the schema. For daily incoming data, use AWS Glue workflows with AWS Glue jobs to perform transformations. For archived data, use Amazon EMR to perform data transformations.

Answer 423

Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.

Answer 424

Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.

Answer 425

Use Amazon Kinesis Data Firehose to upload compressed and batched clickstream records to Amazon Elasticsearch Service. Use Kibana to aggregate, filter, and visualize the data stored in Amazon Elasticsearch Service. Refresh content performance dashboards in near-real time.

Answer 426

The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards

Answer 427

For daily incoming data, use AWS Glue crawlers to scan and identify the schema. For daily incoming data, use AWS Glue workflows with AWS Glue jobs to perform transformations. For archived data, use Amazon EMR to perform data transformations.

Answer 428

Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.

Answer 429

Use an AWS Glue job nightly to transform new log files into Apache Parquet format and partition by year, month, and day. Use AWS Glue crawlers to detect new partitions. Use Amazon Athena to query data.

Answer 430

Use AWS Glue Data Catalog as the metastore

Answer 431

AWS Glue Data Catalog for metadata management AWS Glue for Scala-based ETL Amazon Athena for querying data in Amazon S3 using JDBC drivers

Answer 432

Define security policy-based rules for the users and applications by role in AWS Lake Formation.

Answer 433

Compress the objects to reduce the data transfer I/O. Use an S3 bucket in the same Region as Athena. Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.

Answer 434

Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue

Answer 435

Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift Spectrum to query data in the data lake. Ensure the S3 Standard storage class is in use with objects in the data lake.

Answer 436

Create a customer master key (CMK) in AWS KMS. Assign the CMK an alias. Enable server-side encryption on the Kinesis data stream using the CMK alias as the KMS master key.

Answer 437

Use a CloudWatch Logs subscription to send the data to Amazon Kinesis Data Firehose. Use AWS Lambda to transform the data in the Kinesis Data Firehose delivery stream and enrich it with the data in the DynamoDB table. Configure Amazon S3 as the Kinesis Data Firehose delivery destination.

Answer 438

Use S3 event notifications to trigger an AWS Lambda function to copy the vehicle reference data into Amazon Redshift immediately when the reference data is uploaded to Amazon S3.

Answer 439

Ingest the data into Amazon Kinesis Data Streams by using an Amazon API Gateway API as a Kinesis proxy. Run Amazon Kinesis Data Analytics on the stream data. Output the processed data into Amazon S3 by using Amazon Kinesis Data Firehose. Use Amazon Athena to run analytics calculations. Use S3 Lifecycle rules to transition objects to S3 Glacier after 1 year.

Answer 440

Establish an AWS Direct Connect connection between the on-premises network and the VPC. Configure the JDBC connection to use an interface VPC endpoint for Athena.

Answer 441

Configure an Amazon Kinesis Data Firehose delivery stream for each application. Write AWS Lambda functions to read log data objects from the stream for each application. Have the function perform reformatting and .csv conversion. Enable compression on all the delivery streams.

Answer 442

EMR File System (EMRFS) for storage AWS Glue Data Catalog as the metastore for Apache Hive Multiple master nodes in a single Availability Zone

Answer 443

Modify the AWS Glue ETL code to use the 'groupFiles': 'inPartition' feature.

Answer 444

Use the groupFiles setting in the AWS Glue ETL job to merge small S3 files and rerun AWS Glue ETL jobs. Update the Kinesis Data Firehose S3 buffer size to 128 MB. Update the buffer interval to 900 seconds.

Answer 445

Resize the cluster using elastic resize with dense compute nodes.

Answer 446

Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster.

Answer 447

Use Amazon Kinesis Data Firehose to upload compressed and batched clickstream records to Amazon OpenSearch Service (Amazon Elasticsearch Service). Use OpenSearch Dashboards (Kibana) to aggregate, filter, and visualize the data stored in Amazon OpenSearch Service (Amazon Elasticsearch Service). Refresh content performance dashboards in near-real time.

Answer 448

Use Amazon Kinesis Data Streams to collect the data. Use Amazon Kinesis Data Analytics with Apache Flink to process the data in real time. Set the retention period of the Kinesis data stream to 8,760 hours.

Answer 449

Run the MSCK REPAIR TABLE command on the queried table.

Answer 450

Create a destination data stream in Kinesis Data Streams in the test account with an IAM role and a trust policy that allow CloudWatch Logs in the production account to write to the test account. Create a subscription filter in the production account's CloudWatch Logs to target the Kinesis data stream in the test account as its destination.

Answer 451

Create an Amazon Kinesis Data Analytics application by uploading the compiled Flink .jar file. Use Amazon Kinesis Data Streams to collect data that comes from applications and databases within the VPC and the public internet. Configure the Kinesis Data Analytics application to have sources from Kinesis Data Streams and any on-premises Kafka clusters by using AWS Client VPN or AWS Direct Connect.

Answer 452

The finance department grants Lake Formation permissions for the tables to the external account for the marketing department. The marketing department creates an IAM role that has permissions to the Lake Formation tables.

Answer 453

Upload the data to AWS Data Exchange for storage. Share the data by using the AWS Data Exchange sharing wizard.

Answer 454

Use Amazon Kinesis Data Firehose to collect data and deliver it to Amazon Redshift and Amazon Kinesis Data Analytics simultaneously. Create a reference data source in Kinesis Data Analytics to temporarily store the threshold values from Amazon S3 and compare the count of vehicles for a particular toll station against its corresponding threshold value. Use AWS Lambda to publish an Amazon Simple Notification Service (Amazon SNS) notification if the threshold is not met.

Answer 455

Run a daily AWS Glue ETL job to convert the data files to Apache Parquet and to partition the converted files. Create a periodic AWS Glue crawler to automatically crawl the partitioned data on a daily basis. Run a daily AWS Glue ETL job to compress the data files by using the .lzo format. Query the compressed data.

Answer 456

Load tabular data from Amazon S3 to Amazon QuickSight Enterprise edition by directly importing it as a data source. Use the built-in row-level security feature in Amazon QuickSight to provide marketing employees with appropriate data access under compliance controls. Delete Amazon QuickSight data sources after the project is complete.

Answer 457

Partition the data by account ID, year, and month

Answer 458

Query all the datasets in place with Apache Presto running on Amazon EMR.

Answer 459

Set up the Amazon CloudWatch agent to stream weblogs to CloudWatch logs and subscribe the Amazon Kinesis Data Firehose delivery stream to CloudWatch. Choose Amazon OpenSearch Service (Amazon Elasticsearch Service) as the end destination of the weblogs.

Answer 460

Increase the read capacity units (RCUs) for the shared Amazon DynamoDB table.

Answer 461

EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRF.

Answer 462

Connect Amazon Kinesis Data Analytics to analyze the stream data. Save the output to DynamoDB by using an AWS Lambda function.

Answer 463

Use the AWS Glue API CreateTable operation to create a table in the Data Catalog. Create an AWS Glue crawler and specify the table as the source.

Answer 464

Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

Answer 465

The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards.

Answer 466

Store the source data initially in the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class. Apply a lifecycle configuration that changes the storage class to Amazon S3 Glacier Deep Archive 90 days after creation, and then deletes the data 5 years after creation. Store the daily roll-up data initially in the Amazon S3 Standard storage class. Apply a lifecycle configuration that changes the storage class to Amazon S3 Standard-Infrequent Access (S3 Standard-IA) 1 year after data creation.

Answer 467

Create an Athena workgroup for each division. Configure a data usage control for each workgroup and a time period of 1 day. Configure an action to send notifications to an Amazon Simple Notification Service (Amazon SNS) topic.

Answer 468

Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

Answer 469

Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature while consuming the data.

Answer 470

Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.

Answer 471

Deploy QuickSight Enterprise edition to implement row-level security (RLS) to the sales table.

Answer 472

Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and update the Data Catalog with metadata changes. Schedule the crawlers periodically to update the metadata catalog.

Answer 473

Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

Answer 474

Enable job bookmarks on the AWS Glue jobs.

Answer 475

Increase the number of shards in the stream using the UpdateShardCount API. Choose partition keys in a way that results in a uniform record distribution across shards.

Answer 476

Isolated namespaces

Answer 477

Create an Amazon Redshift Spectrum IAM role with permissions for Lake Formation. Attach it to the Amazon Redshift cluster. Create an external schema in Amazon Redshift by using the Amazon Redshift Spectrum IAM role. Grant usage to the marketing Amazon Redshift user. Grant permissions in Lake Formation to allow the Amazon Redshift Spectrum role to access the three promotion columns of the advertising table.

Answer 478

Add additional prefixes to the S3 bucket Increase the EMR File System (EMRFS) retry limit

Answer 479

Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.

Answer 480

Use QuickSight folders to organize dashboards, analyses, and datasets. Assign group permissions by using these folders.

Answer 481

Kinesis Producer Library (KPL)

Answer 482

Apache Parquet compressed with Snappy

Answer 483

Archive indices that are older than 3 months by using Index State Management (ISM) to create a policy to migrate the indices to Amazon OpenSearch Service (Amazon Elasticsearch Service) UltraWarm storage.

Answer 484

AWS Step Functions

Answer 485

Select Amazon OpenSearch Service (Amazon Elasticsearch Service) as the endpoint for Kinesis Data Firehose. Set up an OpenSearch Dashboards (Kibana) using the data in Amazon OpenSearch Service (Amazon ES) with the desired analyses and visualizations.

Answer 486

Train and use the AWS Glue FindMatches ML transform in the ETLjob

Answer 487

Configure the data analysis queue to enable concurrency scaling.

Answer 488

Decrease the number of Amazon ES shards for the index

Answer 489

Job bookmarks

Answer 490

Use an Amazon Kinesis Data Analytics application to read from the Kinesis data stream and calculate the average per second. Send the results to an AWS Lambda function that sends the alarm to Amazon SNS.

Answer 491

Create an AWS Glue Data Catalog to manage the Hive metadata. Create an AWS Glue crawler over Amazon S3 that runs when data is refreshed to ensure that data changes are updated. Create an Amazon EMR cluster and use the metadata in the AWS Glue Data Catalog to run Hive processing queries in Amazon EMR.

Answer 492

Use the excludeStorageClasses property in the AWS Glue Data Catalog table to exclude files on S3 Glacier storage.

Answer 493

Store the mapping file in an Amazon S3 bucket and configure it as a reference data source for the Kinesis Data Analytics application. Change the SQL query in the application to include a join to the reference table and add the territory code field to the SELECT columns.

Answer 494

A unique rule name, one to three predicates, and an action

Answer 495

An AWS Glue ETL job with the FindMatches transform

Answer 496

Use server-side encryption with AWS KMS managed customer master keys (SSE-KMS CMKs) for the primary dataset. Use server-side encryption with S3 managed encryption keys (SSE-S3) for the other datasets.

Answer 497

Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.

Answer 498

Ingest the data stream with Amazon Kinesis Data Streams. Have a Kinesis Data Analytics application evaluate the stream over a 5-minute window using the RCF function and summarize the count of status codes. Persist the source and results to Amazon S3 through output delivery to Kinesis Data Firehouse.

Answer 499

Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.

Answer 500

Configure an Amazon CloudWatch metrics alarm on the IsIdle metric from the EMR clusters to publish a notification to an Amazon Simple Notification Service topic. Subscribe an AWS Lambda function to the topic to terminate the clusters.

Answer 501

Add an S3 Lifecycle configuration on the S3 bucket to immediately transition the data from S3 Standard to S3 Intelligent-Tiering

Answer 502

Configure the EMR cluster to use EMRFS. Use Amazon EBS encryption for local and root volumes. Encrypt the data by using server-side SSE-S3. Specify the encryption artifacts used for in-transit encryption by uploading a .zip file that contains the certificates to S3

Answer 503

- Use EMR with spot instances and the EMRFS to process the records. Load the processed data into Amazon Redshift. Generate the reports from Amazon Redshift - Use AWS Glue to process the records. Load the processed data into Amazon Redshift. Generate the reports from Redshift.

Answer 504

Crawl the data with AWS Glue crawler and update AWS Glue Data Catalog to reflect the metadata. Use AWS Glue ETL job to process and transform the data. Use Athena to query the transformed data

Answer 505

Use per bucket encryption overrides

Answer 506

Use Amazon Redshift CREATE EXTERNAL SCHEMA SPECTRUM and CREATE EXTERNAL TABLE commands to make the data in S3 accesible in Redshift. Generate the report from Redshift

Answer 507

A stagger window - query uses a windowing method that is suited for analyzing groups of data that arrive at inconsistent times.

Answer 508

- Confgure the EMR to use instance fleet with provisioning timeout for the core nodes - Use Spot Instances for the Task nodes

Answer 509

For each fleet, you can define a provisioning timeout. The timeout applies when the cluster is provisioning capacity and does not have enough Spot Instances to fulfill the target capacity according to the provided specifications. With the provisioning timeout, you can specify the timeout period and choose to switch to On-Demand capacity to fulfill the remaining Spot capacity and comply with the SLA.

Answer 510

Ensure that the orders table and the product_details table use the KEY diststyle with identical key columns that minimize data processing and skew. In addition, ensure that the orders table has a compound sort key that includes the order_data column and is ordered from the lowest cardinality to highest cardinality

Answer 511

- partition the files by the date and region | - transform the .csv files into Apache Parquet files

Answer 512

Organize the data into multiple time-series tables. Drop old tables

Answer 513

Use DataSync to deploy a DataSync agent on-prem and to replicate the data to a specified S3 bucket

Answer 514

Launch an AWS Glue ETL job to transform the data and save it in S3. Read the data by using Amazon Forecast and develop the new ML models to predict the order volume. SAve the results to S3. Create visualizations in QuickSight

Answer 515

Use Amazon MSK to scale the brokers.

Answer 516

- Objects in S3 buckets in us-west-2 | - Objects in S3-Glacier

Answer 517

Integrate the KPL and KCL with the AWS Schema Registry

Answer 518

Use Kinesis Data Streams Enhances fan-outand HTTP/2 data retrieval

Answer 519

Use a scheduled trigger to start an AWS Glue Workflow that will launch the required AWS Glue Jobs

Answer 520

Create a unique Athena workgroup for each team. Within the workgroup, enforce the encryption for the query results and create tags. Use the tags to calculate the cost for each team. Use resource-based policies to assign workgroups to teams.

Data Analytics Flashcards

(602 cards)