AWS Data Analytics Flashcards

1
Q

In a single data dashboard, Amazon ___________ can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more.

A

Quicksight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CloudWatch detailed monitoring sends data from your EC2 instance to CloudWatch in ______ intervals.

A

1-minute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

____________ is an ETL service that captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.

A

Kinesis Data Firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When Kinesis Data Firehose is configured to send data to Redshift, behind the scenes it has to load the streaming data to _______ first and then issue a ______ command to move the data to Redshift.

A

S3… COPY…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Within Kinesis Data Analytics, using _________ __________ is a windowing method for analyzing time-based, overlapping groups of data that arrive at inconsistent times by aggregating the data.

A

stagger windows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three windows you can use to process data in Kinesis Data Analytics?

A
  1. Stagger Windows
  2. Tumbling Windows
  3. Sliding Windows
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

___________ includes a built-in ML algorithm that can easily provide reliable forecasts for your data.

A

Amazon QuickSight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

_______ is a fast, open-source, distributed SQL query engine designed for interactive analytic queries over large datasets from multiple sources (built by Facebook).

A

Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS Glue ETL scripts can be coded in _________ or _________ .

A

Python… Scala…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Amazon Redshift automatically integrates with ________ but not with an ________ (for encryption keys).

A

AWS KMS… HSM…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

With Amazon Redshift, you can’t migrate to an _______-encrypted cluster by modifying the cluster. This is only possible if you want to enable _______ encryption.

A

HSM… KMS…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

To load data from S3 to Redshift, you can use a __________ _________ that lists out the specific S3 paths you want to be copied over.

A

manifest file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Using the AWS Glue crawler for compressed files will cause the run time to ____________.

A

increase… It will take longer because the crawler has to download and decompress the file before reading it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AWS Glue ___________ crawls only crawl folders that were added since the last crawler run, which can save significant time and cost.

A

incremental

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

To enable permissions between S3 and QuickSight, you would need to configure the permissions from the _________ console.

A

QuickSight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The _________ process re-sorts rows and reclaims space in either a specified table or all tables in the current database in Amazon Redshift.

A

VACUUM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If QuickSight connects to the data store by using a ________ ________, the data automatically refreshes when you open an associated dataset, analysis, or dashboard.

A

direct query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

________ ______ is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.

A

Amazon EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can you use AWS Glue triggers to execute a job to run directly after a crawler completes?

A

No, but you can create an AWS Glue workflow with two triggers: one for the crawler and one for the job. This will achieve the same effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

The capacity limits of an Amazon Kinesis data stream are defined by the ________ _____ ________ within the data stream.

A

number of shards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When creating an EMR cluster and you want to have the log files archived to Amazon S3, you must enable this feature __________ (while / after) launching the cluster.

A

while

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Does Amazon SQS support real time streaming of data?

A

No.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the two Amazon EMR cluster types (regarding the time it takes for each to initialize) ?

A

(1) persistent / long-running
(2) transient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In Kinesis Data Streams, you can create up to _____ registered consumers per stream.

A

20

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The two Kinesis Data Streams capacity modes are _________ and _________. These refer to whether the data stream shards are automatically or manually created.

A

on-demand… provisioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

To detect anomalies in your Kinesis Data Stream, you can use the ________________ function.

A

RANDOM_CUT_FOREST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Kinesis Data Analytics (KDA) supports _____________, _____________, and _____________ as destinations.

A

Kinesis Data Streams… Kinesis Data Firehose… Lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

A common architecture using Kinesis Data Analytics (KDA) might look like this: ___________ –> Kinesis Data Analytics –> ___________ –> S3

A

Kinesis Data Stream –> Kinesis Data Analytics –> Kinesis Data Firehose –> S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Apache _______ is a data warehousing system that uses SQL-like queries to analyze structured data stored in Hadoop Distributed File System (HDFS).

A

Hive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

When creating an EMR cluster, what two configuration options can you choose from? The selected option is applied to each node type (primary, core, task) of the cluster.

A
  1. Instance Fleets
  2. Uniform Instance Groups (simpler, provides autoscaling)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Can Glue Data Catalog be used to store data, in a similar way to S3?

A

No, it is only used to store schema information on data gathered from the Glue crawler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

By default, Amazon Redshift clusters are created and situated in _______ AZ(s) within an AWS Region.

A

1… However, a multi-AZ deployment is also an option

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If you have customized networking requirements for using Amazon Redshift, you will need to enable _________________ _______ _________________.

A

Enhanced VPC Routing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

S3 Transfer Acceleration is enabled at the ________ level.

A

bucket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are three common CLI commands for moving data to and from S3?

A

cp (copy)
mv (move)
sync (sync)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the 3 API calls for an S3 Multi Part upload?

A

CreateMultipartUpload
UploadPart
CompleteMultipartUpload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the max number of parts for an S3 Multi Part upload?

A

10,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Does Elasticache for Memcached support snapshots and replication?

A

No. Snapshots and replication are not supported for memcached, just for Redis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Which AWS database stores data as nodes connected with edges?

A

Neptune

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

In relational databases, row-based storage is ideal for OL__P and columnar storage is ideal for OL__P.

A

OLTP… OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Apache _______ is an analytics framework for processing large datasets. (hint: Databricks is built on top of this)

A

Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the 3 data storage options for Amazon EMR?

A
  1. HDFS
  2. EMRFS (uses S3)
  3. Local Storage (Instance Store / EBS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

An Amazon EMR cluster can have either ____ or ____ primary (aka master) nodes.

A

1 or 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How many AZ’s are used for Amazon EMR clusters?

A

Only 1 AZ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the three node types in an Amazon EMR cluster?

A
  1. Primary/Master Node
  2. Core Node
  3. Task Node (optional)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is the name of the API you can use to launch an Amazon EMR cluster?

A

RunJobFlow API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the name of the API you can use to terminate an Amazon EMR cluster?

A

TerminateJobFlows API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

The default limit for Amazon EMR instances is _____. This can be increased upon request.

A

20 instances (across all your clusters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

When using Amazon EMR, can you SSH directly into a task node?

A

No, you must first SSH into the master node, and then SSH into the desired node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Which Amazon EMR node type (primary/master, core, task) hosts data using Hadoop Distributed File System (HDFS) and also runs Hadoop tasks?

A

Core Node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are 5 implementations of how you can run Amazon EMR applications? (i.e. Amazon EMR on ______, Amazon EMR _____)

A
  1. Amazon EMR Serverless
  2. Amazon EMR on EC2
  3. Amazon EMR on AWS Outposts
  4. Amazon EMR on EKS
  5. Amazon EMR on Local Zones
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

For Amazon EMR billing, __________ rounds up the runtime duration to the nearest minute, whereas __________ tracks runtime duration to the nearest second.

A

BilledResourceUtilization… TotalResourceUtilization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Amazon EMR supports what two types of Hive clusters?

A
  1. interactive (customer can run Hive scripts directly on master node)
  2. batch (Hive script stored in S3 and referenced)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Amazon Redshift can automatically generate recommendations for managing your warehouse with the feature called _________ __________

A

Redshift Advisor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Does Redshift support native integration with Amazon SageMaker?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

____________ is a feature of Amazon Redshift that lets you run queries against your data lake in Amazon S3, with no data loading or ETL required.

A

Redshift Spectrum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Using Amazon Redshift ______ nodes with managed storage allows you to pay separately for storage and compute.

A

RA3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

For Amazon Redshift instances using Dense Compute (DC) and Dense Storage (DS2) clusters, where is the data stored?

A

On the compute nodes (as opposed to S3 for RA3 clusters and Redshift Serverless)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

How is the data stored when using an Amazon Redshift RA3 instance?

A

Frequently processed data (hot data) is stored on high performance SSDs, and cold data stored in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What service would a customer use to integrate (and/or aggregate) Amazon Redshift with their own on-premises data warehouse?

A

AWS Data Exchange

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

How are Redshift Multi-AZ and Redshift Relocation different, regarding RTO?

A

Redshift Relocation is free and has a 10-60 minute recover time.
Redshift Multi-AZ is more expensive, but has an RTO measured in seconds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

_____________ allows SQL users to create, train, and deploy machine learning models using familiar SQL commands.

A

Redshift ML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

The Amazon Redshift _______ _______ simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint

A

Data API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

How long are Amazon Redshift automatic backups retained vs manual backups?

A

Automatic: 24 hours
Manual: Indefinitely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

How would you monitor the performance of your Amazon Redshift data warehouse cluster?

A

AWS Management Console, or
CloudWatch APIs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Is there a charge for using the Amazon Redshift Data API?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

When you launch an Amazon Redshift cluster, what option determines the CPU, RAM, storage capacity, and storage drive type for each node?

A

The node type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

For datasets under 1 TB (compressed), what is the recommended Redshift node type?

A

DC2 (Dense Compute node)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are the two EC2 platforms used for launching an Amazon Redshift cluster?

A
  1. EC2-Classic
  2. EC2-VPC
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

In Amazon Redshift, you need to associate a __________ _________ with each cluster that you create in order to configure database settings such as query timeout and date style.

A

parameter group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

The charges that you accrue for using Amazon Redshift are based on _______ nodes and billed at an _________ rate.

A

compute… hourly…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Which EC2 instance categories does Amazon EMR support (i.e. on-demand, etc.) ?

A

on-demand
spot
reserved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

An Amazon Redshift cluster is a set of nodes, which consists of a ________ node and one or more ________ nodes.

A

leader… compute…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

A Quicksight _________ is a user who can create and publish dashboards.

A

Author

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

A Quicksight _________ is a user who consumes interactive dashboards.

A

Reader

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Amazon QuickSight __________ Edition offers enhanced functionality such as QuickSight Readers, Private VPC connectivity, and AD connectivity.

A

Enterprise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

_________ _________ __________ __________ are of 30-minute duration each. Each session is charged at $0.30 with maximum charges of $5 per Reader in a month.

A

Amazon QuickSight Reader sessions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Will an Amazon Quicksight Reader be charged if QuickSight is open in a browser in a background tab?

A

No, only charged when user interacts with page via a page refresh, filtering, clicking, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Can Amazon Quicksight “Authors” or “Readers” invite more users?

A

No. This can only be done with a QuickSight “Admin” account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Does Amazon QuickSight connect to both Amazon EC2 and on-premises databases?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

The “Augment with SageMaker” option for Amazon __________ allows your SageMaker ML models to run inferences on your data.

A

QuickSight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Does QuickSight leverage SageMaker models to perform inference on incremental data or the full data every time it runs?

A

Inference runs on the full data every time it refreshes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Amazon QuickSight has an innovative technology called ________ that allows it to select the most appropriate visualizations based on the properties of the data.

A

AutoGraph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

You can use AWS Glue _________ to visually clean up and normalize data without writing code.

A

DataBrew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

How does AWS Glue relate to AWS Lake Formation?

A

AWS Lake Formation encompasses AWS Glue PLUS additional features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

With AWS Glue _______, data engineers can visually create, run, and monitor ETL workflows.

A

Studio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

The metadata stored in the AWS Glue Data Catalog can be readily accessed from _________________, ______________, _____________, _________________, and third-party services.

A

AWS Glue ETL
Amazon Athena
Amazon EMR
Amazon Redshift Spectrum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

The AWS Glue ________ ____________ is a new feature that allows you to centrally discover, control (i.e. enforce), and evolve data stream schemas.

A

Schema Registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

The AWS Glue ________ ____________ supports Apache Avro and JSON Schema data formats and Java client applications

A

Schema Registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Does the AWS Glue Schema Registry provide encryption at rest and in transit?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

After you define the flow of your data sources, transformations, and targets in the visual (no-code) interface, AWS Glue Studio will generate __________ __________ code on your behalf.

A

Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Which programming languages does AWS Glue ETL support?

A

Python and Scala

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

When building an AWS Glue workflow, what are the two ways to trigger AWS Glue ETL jobs within your workflow?

A

AWS Glue ETL jobs can either be triggered on a schedule or on a job completion event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

AWS Glue provides default retry behavior that will retry all failures _____ times before sending an error notification to CloudWatch.

A

three

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

AWS Glue supports ETL on streams from _______________, _____________, and _____________.

A

Amazon KDS
Apache Kafka
Amazon MSK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Do you have to use both the Data Catalog and AWS Glue ETL together for the service to work?

A

No, they can be used independently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Both AWS Glue and Kinesis Data Analytics can be used to process streaming data.
____________ is recommended when your use cases are primarily ETL and you want to run jobs on a serverless Apache Spark-based platform.
____________ is recommended when your use cases are primarily analytics and you want to run jobs on a serverless Apache Flink-based platform.

A

AWS Glue
Kinesis Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Apache Spark is primarily used for ______ processing, whereas Apache Flink is primarily used for ______ processing.

A

batch… stream…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Both AWS Glue and Kinesis Data Firehose can be used for streaming ETL.
___________ is recommended for complex ETL, including joining streams, and partitioning the output in Amazon S3 based on the data content.
___________ is recommended when your use cases focus on data delivery and preparing data to be processed after it is delivered.

A

AWS Glue
Kinesis Data Firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

The AWS Glue __________ ML Transform can solve record linkage and data deduplication problems.

A

FindMatches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

AWS Glue _______ __________ is a feature of AWS Glue that automatically measures and monitors the quality of data in data lakes and pipelines.

A

Data Quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

For the following AWS Glue features:
Data __________ use DataBrew to transform data without writing any code.
Data __________ use the Data Catalog to manage metadata.
Data __________ use AWS Glue Studio to author scalable data integration pipelines.

A

analysts
engineers
engineers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

Can Amazon Athena process unstructured, semi-structured, and structured datasets?

A

Yes, it can process all three

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

AWS strongly recommends using the ______ command to load large amounts of data into Redshift, as opposed to the _______ command.

A

COPY… INSERT…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

To grant or revoke privilege to load data into a table using a Redshift COPY command, grant or revoke the __________ privilege.

A

INSERT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

To load data from Amazon S3, the Redshift COPY command must have _______ access to the bucket and _______ access for the bucket objects.

A

LIST… GET…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

For Redshift to obtain authorization to access a resource, your cluster must be authenticated using either __________ access control or __________ access control.
(________ access control is recommended by AWS)

A

role-based… key-based…
(role-based)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

With ___________ access control, your Redshift cluster temporarily assumes an AWS Identity and Access Management (IAM) role on your behalf.

A

role-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

When loading data into Redshift, you can use a ___________ file to ensure that your COPY command loads only your specified files from Amazon S3.

A

manifest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

When you load data into Redshift from S3 using a COPY command, what do you need to do differently when S3 server-side encryption is enabled?

A

Nothing. The process is the same whether S3 is encrypted or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

When using the COPY command to load a table into Amazon Redshift, does the table to be loaded need to already exist in the Redshift database?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

By default, when loading data from DynamoDB into Redshift, do these two services need to be in the same AWS Region?

A

Yes, but you can also specify a different region using the REGION parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

When loading data from DynamoDB into Redshift, what happens when DynamoDB attributes do not match a column in the Amazon Redshift table?

A

These attributes are discarded. Additionally, they consume part of DynamoDB’s provisioned throughput since the attributes still have to be read.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

After a Redshift load operation is complete, you can query the ______________ system table to verify that the expected files were loaded.

A

STL_LOAD_COMMITS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

To validate the data in the Amazon S3 input files or Amazon DynamoDB table before you actually load the data into Redshift, you can use the __________ option with the COPY command.

A

NOLOAD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

To apply automatic compression when loading data to Redshift, run the COPY command with the __________ option set to ON.

A

COMPUPDATE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

When loading data files from Amazon S3 into Redshift, does the order of the columns matter?

A

Yes, the columns must be in the same order as the Redshift table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

The category of SQL commands that manipulate data in a database (INSERT, UPDATE, DELETE) are referred to as _______ _____________ ____________ commands.

A

Data Manipulation Language (DML)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

Does Amazon Redshift support a single merge (or upsert) command to update a table from a single data source?

A

No, but you can essentially do the same thing with a combination of updates and inserts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

The category of SQL commands that can be used to define the database schema, such as CREATE, DROP, ALTER, are referred to as _______ _____________ ____________.

A

Data Definition Language (DDL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

In a Redshift cluster, each node is further broken down into ___________, which have their own compute and storage associated with each.

A

slices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

AWS recommends creating your Redshift tables with __________ ______, which uses automatic table optimization to choose the sort key.

A

SORTKEY AUTO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

When you create a Redshift table, you can optionally specify one column as the ____________ ______. When the table is loaded with data, the rows are distributed to the node slices according to this key.

A

distribution key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

What are the two types of Redshift table sort keys, and which is preferred?

A

COMPOUND (preferred)
INTERLEAVED

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

With compression in Redshift, can the sort key column be compressed?

A

No, it must always be in its raw form so it is always available for Redshift to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

Which type of Redshift sort key performs better when using lots of WHERE clauses?

A

INTERLEAVED

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

Which type of Redshift sort key performs better when using lots of ORDER BY clauses?

A

COMPOUND

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

AWS recommends which distribution style for your Redshift tables?

A

DISTSTYLE AUTO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

When you create a Redshift table, you can designate one of four distribution styles. What are they?

A

AUTO
EVEN
KEY
ALL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

When creating a Redshift table with a NOT NULL constraint on a column, does Redshift enforce this?

A

No, Redshift can still accept data into that column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

Redshift Spectrum supports ________ and ________ operations.
It does NOT support ________ and ________ operations.

A

SELECT… INSERT…
UPDATE… DELETE…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
132
Q

When resizing a Redshift cluster, the source cluster goes into ____________ mode while the resized cluster is being created.

A

read-only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
133
Q

The two types of resize operations you can choose for resizing a Redshift cluster are __________ and __________.

A

classic resize… elastic resize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
134
Q

The ______ resize operation for a Redshift cluster takes minutes, while a ______ resize operation can take hours to days.

A

elastic… classic…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
135
Q

When performing an elastic resize of a Redshift cluster, what are the two main constraints?

A
  1. Can’t be used from or to a single-node cluster
  2. Only available for clusters that use the EC2-VPC platform
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
136
Q

For classic resize and elastic resize operations for Redshift clusters, can you cancel the resize operation after it has been started?

A

For classic resize, yes.
For elastic resize, no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
137
Q

Are the Redshift pause/resume options supported for EC2-Classic clusters?

A

No, you can only pause/resume EC2-VPC clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
138
Q

Which type of Redshift cluster resize uses a snapshot for the operation?

A

elastic resize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
139
Q

What Redshift operation can sort rows and will only sort tables that are less than 95% sorted?

A

VACUUM SORT ONLY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
140
Q

What Redshift operation can reclaim disc space and will only run on tables that have more than 5% of the rows marked for deletion?

A

VACUUM DELETE ONLY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
141
Q

What Redshift VACUUM option will ensure that the operation is not interrupted by (i.e. resources are not diverted to) incoming queries.

A

BOOST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
142
Q

A faster alternative to performing a full vacuum operation on a Redshift cluster table could be to do a _______ _______. This can be beneficial when you have an extremely unsorted table.

A

Deep Copy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
143
Q

What AWS service can transfer data to and from AWS at a huge scale (i.e. 10GB/s per agent, which is approximately 100TB/day) ?

A

AWS DataSync

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
144
Q

What is an Amazon EMR cluster composed of?

A

A collection of EC2 instances (referred to as “nodes”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
145
Q

Each EC2 instance in an Amazon EMR cluster is called a _______.

A

node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
146
Q

Every Amazon EMR cluster has a ___________ node, and it’s possible to
create a single-node cluster with only this node.

A

primary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
147
Q

The following is an example process using four steps for which AWS service?
1. Submit an input dataset for processing.
2. Process the output of the first step by using a Pig program.
3. Process a second input dataset by using a Hive program.
4. Write an output dataset.

A

Amazon EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
148
Q

When you set up an Amazon EMR cluster in a private subnet, AWS recommends that you also set up _____________________. Otherwise, you will incur additional charges for NAT gateway as the traffic flow will not be contained within your VPC.

A

VPC endpoints for Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
149
Q

Amazon EMR integrates with ___________ to log information about requests made by or on behalf of your AWS account. With this information, you can track who is accessing your cluster when, and the IP address from which they made the request.

A

CloudTrail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
150
Q

___________ ______ _________ is a web-based integrated development environment (IDE) for fully managed Jupyter notebooks that run on Amazon EMR clusters.

A

Amazon EMR Studio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
151
Q

What feature of Amazon EMR allows you to browse your data catalog, run SQL queries, and download results before you work with the data in a Studio notebook.

A

Amazon EMR Studio SQL Explorer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
152
Q

An Amazon EMR Studio is composed of one or more ___________.

A

Workspaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
153
Q

___________ ______ _________ does not support EMR clusters with multiple primary nodes.

A

Amazon EMR Studio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
154
Q

The maximum number of Amazon EMR Studios you can have is _____ per AWS account.

A

10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
155
Q

To use SSH to log on to the master/primary node of an Amazon EMR cluster, you will need to associate an __________ ______ ______ ______with the cluster.

A

Amazon EC2 key pair

156
Q

What are two limitations of launching an Amazon EMR cluster with multiple primary nodes?

A
  1. Cannot use instance fleets configuration for the nodes
  2. If two of the three primary nodes fail simultaneously, then the cluster will fail
157
Q

When launching an Amazon EMR cluster with multiple primary nodes, how many core nodes does AWS recommend launching?

A

At least 4

158
Q

Amazon EMR on ____________ is ideal for low latency workloads that need to be run in close proximity to on-premises data and applications.

A

AWS Outposts

159
Q

Are spot instances or reserved instances supported for Amazon EMR on AWS Outposts?

A

No, only on-demand instances are supported

160
Q

By default, when you create an Amazon EMR cluster, what AMI is used?

A

Amazon Linux AMI

161
Q

When launching an Amazon EMR cluster and choosing between instance fleets or uniform instance groups, which category of nodes does this decision apply to (primary, core, task) ?

A

All of them

162
Q

When launching an Amazon EMR cluster with the uniform instance groups configuration, your cluster can include up to _____ instance groups:
_____ primary instance group(s)
_____ core instance group(s), and
up to _____ optional task instance groups.

A

50
1
1
48

163
Q

Which EMR node type does not store data?

A

Task nodes

164
Q

The DataNode daemons run on which Amazon EMR node type?

A

Core nodes

165
Q

Which Amazon EMR storage option is ephemeral, distributed, and best suited for caching results between intermediate job flow steps?

A

HDFS

166
Q

Which Amazon EMR storage option would you use to separate your compute and storage and persist data outside of the lifecycle of your cluster?

A

EMRFS (because it stores data to S3)

167
Q

Is Kinesis Data Streams a fully managed and serverless AWS service?

A

Yes

168
Q

Is Amazon EMR a fully managed and serverless AWS service?

A

No.

However, there is a new option you can use called Amazon EMR Serverless.

169
Q

The __________ is a Java library that acts as an intermediary between your record processing logic and Kinesis Data Streams.

A

Kinesis Client Library (KCL)

170
Q

Can multiple Kinesis Data Streams applications consume data from the same stream?

A

Yes

171
Q

A __________ ______ ________ is a set of shards.

A

Kinesis data stream

172
Q

A Kinesis data stream _________ contains a sequence of data records.
Each data record has a __________ __________, which is the unique identifier of each data record within a shard, but this number may overlap for a data record in a different shard.

A

shard
sequence number

173
Q

Data records within a Kinesis data stream shard are composed of what three attributes?

A
  1. sequence number
  2. partition key
  3. data blob
174
Q

A data blob is one of three attributes within a __________ within a ________ within a Kinesis data stream. The data blob can be up to ____ MB in size.

A

data record… shard… 1 MB

175
Q

By default, the retention period of the data records within a Kinesis data stream is __________, and the max retention period is _________.

A

24 hours… 365 days

176
Q

Each data record within a Kinesis data stream shard gets assigned a unique ___________ ____________.

A

sequence number

177
Q

In Kinesis Data Streams, a ___________ ______ is used to logically separate sets of data. This is generally not a 1:1 ratio to the shards. Often, one shard will have 100+ of these.

A

partition key

178
Q

Kinesis Data Streams uses ______________ for encryption.

A

AWS KMS master keys

179
Q

In Kinesis Data Streams, to read from or write to an encrypted stream, producer and consumer applications must have
permission to access the __________________.

A

KMS master key

180
Q

In Kinesis Data Streams, does using server-side encryption incur AWS KMS costs?

A

Yes

181
Q

In Kinesis Data Streams, by default, you
can create up to _____ data streams
with the on-demand capacity
mode. This can be increased with a support ticket.

A

50

182
Q

In Kinesis Data Streams, what is the limit for the number of streams per account, using KDS provisioned mode?

A

No limit

183
Q

The Kinesis Data Streams GetRecords command can retrieve up to _____ MB of data per call from a single shard, and up to _________ records per call.

A

10… 10,000…

184
Q

In Kinesis Data Streams, one read transaction is also referred to as one ________________ call. They are the same thing.

A

GetRecords

185
Q

Each Kinesis Data Stream shard can support up to a maximum total data read rate of ____ MB per _________ via GetRecords

A

2 MB… second

186
Q

In Kinesis Data Streams, can you switch the capacity mode of your stream? How often?

A

Yes.
You can switch 2x within 24 hours.

187
Q

A Kinesis data stream in the on-demand mode accommodates up to ________ the peak write throughput observed
in the previous 30 days.

A

double

188
Q

The Kinesis Data Streams ___________ capacity mode is suited for predictable traffic with capacity requirements that are easy to forecast.

A

provisioned

189
Q

In Kinesis Data Streams, can you enable server-side encryption after the stream has been created?

A

Yes

190
Q

AWS recommends (for better Kinesis Data Stream scalability) that you migrate all of your producers and consumers that call the _____________ API to instead call the ______________ and _____________ API’s.

A

DescribeStream… DescribeStreamSummary… ListShards

191
Q

What API would you use to reshard a KDS stream?

A

UpdateShardCount

192
Q

In Kinesis Data Streams, what are the two types of resharding operations?

A

shard split
shard merge

193
Q

When changing the data retention period for your KDS stream, how quickly does the change take effect?

A

Within minutes

194
Q

You can assign your own metadata to streams you create in Amazon Kinesis Data Streams by using _______.

A

tags

195
Q

In Kinesis Data Streams, the ___________________ provides a layer of abstraction specifically for ingesting data.

A

Kinesis Producer Library (KPL)

196
Q

In Kinesis Data Streams, what is the preferred method for developing producers to add (put) data into a data stream?

A

The preferred method is to use the Kinesis Producer Library (KPL)

197
Q

_______ _____________ is a complete solution that lets frontend web and mobile developers easily build, ship, and host full-stack web/mobile applications on AWS.

A

AWS Amplify

198
Q

___________ ___________ offers marketers and developers one customizable tool to deliver customer communications across channels, segments, and campaigns at scale.

A

Amazon Pinpoint

199
Q

In Kinesis Data Streams, when you add KPL user records using the KPL addUserRecord() operation, a record is given a time stamp and added to a buffer with a deadline set by the ________________ configuration parameter.

A

RecordMaxBufferedTime

This determines how long it takes for the record to be put into the data stream.

200
Q

By default, KDS shards in a stream provide ____ MB/sec of read throughput per shard.

A

2 MB/sec

201
Q

In Kinesis Data Streams, does the default 2 MB/sec of read throughput per shard get shared across consumers?

A

Yes, this limit is fixed.

In other words, you cannot have 5 consumers all reading 1 MB/sec each from a shard, because this sums up to 5 MB/sec, which exceeds the limit.

202
Q

You can use an Amazon Kinesis Data Analytics application to process and analyze data in a KDS stream using _________, _________, or _________ (languages).

A

SQL, Java, or Scala

203
Q

In Kinesis Data Streams, the _________________ consumer applications are typically distributed, with one or
more application instances running simultaneously, for failover and load-balancing.

A

Kinesis Client Library (KCL)

204
Q

In Kinesis Data Streams, what is the term used in a KCL consumer application to describe how a consumer instance binds to (takes ownership of) processing a particular shard?

A

lease

205
Q

In Kinesis Data Streams, each KCL consumer application stores its lease information in a DynamoDB _______ _________.

What are the implications of this for the KCL consumer application names?

A

lease table

Each KCL consumer application must have a unique name, because this name is used for the DynamoDB lease table.

206
Q

Amazon Athena also allows you to run _________ __________ applications on Athena to query your data.

A

Apache Spark

207
Q

AWS Glue Studio can use datasets that are defined in the _______ ________ ________ ___________.

A

AWS Glue Data Catalog

208
Q

In Amazon Athena, when partitioning your S3 data, a common practice is to partition the data based on _________ or __________.

A

date or time

209
Q

How can you speed up the time it takes to load large csv files into Redshift?

A

Split the csv files into smaller chunks. When you use the COPY Redshift command, this utilizes the massively parallel processing architecture of Redshift, so breaking apart the csv files will allow more distributed processing to take place.

210
Q

What does the Redshift UNLOAD command do?

A

moves the result of your Redshift query to Amazon S3

211
Q

To ensure that only certain people can view certain dashboards in Amazon Quicksight, you can use a __________ _________ file to specify the permissions to the dataset.

A

dataset rules

212
Q

Can you deliver streaming data directly from Kinesis Data Firehose to DynamoDB?

A

No, you would have to use a lambda function as an intermediary step

213
Q

What 3 AWS services can Kinesis Data Firehose deliver data to directly? Also, can KDF deliver to other third-party destinations?

A
  1. S3
  2. Redshift
  3. OpenSearch Service

Yes, KDF can also deliver data to third parties like Splunk, Logz.io, New Relic, MongoDB, etc.

214
Q

_________________ is an AWS-managed deployment of the open source, distributed search and analytics suite derived from Elasticsearch.

A

Amazon OpenSearch Service

215
Q

The AWS ________ command is an extension of the Apache Distcp tool that you can use to copy large amounts of data.

A

S3DistCp

216
Q

A Kinesis Data Analytics application is composed of a ___________ source and an optional _______________ ________ source.

A

streaming… reference data…

217
Q

In Kinesis Data Analytics, you have the option to use a reference data source. If used, where must this reference data be stored?

A

In Amazon S3, as a CSV or JSON file.

218
Q

Kinesis Data Analytics can use __________ or ____________ as the input data stream.

A

Kinesis Data Stream… Kinesis Data Firehose…

219
Q

__________ ________ ___________ automatically provides an in-application error stream for each application. If your application has issues while processing certain records (for example, because of a type mismatch or late arrival), that record is written to the error stream.

A

Kinesis Data Analytics

220
Q

_________ __________ is a distributed streaming platform that was originally developed by LinkedIn and was made open source in 2011.

A

Apache Kafka

221
Q

In Amazon MSK (Managed Streaming for Apache Kafka), the ________ nodes receive messages from producers (publishers) and store them for the consumers (subscribers) to view.

The ________ nodes coordinate and track the broker nodes and also manage the Kafka topics (categories).

A

broker… zookeeper…

222
Q

After you have created an Amazon MSK cluster in your VPC, can you change which VPC your cluster is in?

A

No.

223
Q

For Amazon MSK clusters, the _____ broker type has higher throughput than the ____ broker type, and it is recommended for production workloads.

A

M5… T3…

224
Q

When using Amazon MSK, what is the maximum retention period for the data?

A

Unlimited

225
Q

In a Kinesis Data Stream using provisioned mode, each shard can support up to ___ MB/sec of write throughput and ___ MB/sec of read throughput.

A

1MB/sec… 2MB/sec…

226
Q

For Kinesis Data Streams in provisioned mode, the default shard quota is _____ shards per AWS account for us-west-1, us-east-1, and eu-west-1. For all other regions, the default shard quota is _____ shards per AWS account.

A

500… 200…

227
Q

The Kinesis __________ is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams.

A

Agent

228
Q

What are 3 ways you can add data to a Kinesis Data Stream?

A
  1. PutRecord API
  2. Kinesis Producer Library (KPL)
  3. Kinesis Agent
229
Q

What 3 compression formats can you use with Kinesis Data Firehose?

A
  1. GZIP
  2. ZIP
  3. SNAPPY
230
Q

If you need to deliver data using Kinesis Data Firehose to Redshift as the final destination, which compression format do you need to use?

A

GZIP

231
Q

For Kinesis Data Firehose delivery streams, a data record can be up to _____ in size.

A

1,000 KB

232
Q

Kinesis Data Firehose uses either a _________ _______ or a _________ _______ which determines how long it takes before KDF delivers the data to the destination.

A

buffer size (in MB)… buffer time (in seconds)…

233
Q

When delivering data to S3 using Kinesis Data Firehose, you can choose a buffer size
of __-___ MiBs and a buffer interval of __-___ seconds. The condition that is satisfied first triggers data delivery to Amazon S3.

A

1–128… 60–900…

234
Q

Which AWS service would you use for REAL-TIME analytics on IoT sensor data?

A

Kinesis Data Analytics — Remember: If the question doesn’t specifically state that you must analyze or transform the data in real-time, then KDA is probably not the correct answer.

235
Q

For the consumer applications in Kinesis Data Streams, you can choose between ______________ and ______________ consumer types to read data from a stream.

A

shared fan-out… enhanced fan-out…

236
Q

_____ ___________ is an ideal compute service for serverless application scenarios that need to scale up rapidly, and scale down to zero when not in demand.

A

AWS Lambda

237
Q

When using AWS Lambda, can you log in to compute instances or customize the operating system?

A

No, because Lambda is serverless. AWS manages these things.

238
Q

The AWS service called _____ _____ _________ builds and deploys containerized web applications automatically.

A

AWS App Runner

239
Q

AWS Lambda functions will need to assume an ___________ role when the function is invoked, in order to access other specified AWS services.

A

execution

240
Q

In AWS Lambda, you deploy your function code to Lambda using a deployment package. There are two types:
A ______ file that contains your function code and its dependencies, OR
a ___________ _________.

A

.zip file
container image

241
Q

An _______ _________ __________ is a Lambda resource that reads from an event source and invokes a Lambda function. You can use these to process items from a stream or queue in services that don’t invoke Lambda functions directly.

Two examples of AWS services that could be used for this are __________.

A

event source mapping

DynamoDB
Amazon SQS

242
Q

An AWS Lambda ___________ provides a language-specific environment that runs in an execution environment.

Examples: python3.10, nodejs18.x

A

runtime

243
Q

An AWS Lambda _________ provides a convenient way to package libraries and other dependencies that you can use with your Lambda functions. It is a .zip file that contains additional code. You can use ____ of these per function.

A

layer… 5…

244
Q

AWS Lambda layers only apply to the _____ _________ Lambda deployment package type. Functions deployed as a __________ ________ do not use layers because everything is already bundled together.

A

.zip file… container image…

245
Q

In AWS Lambda, __________ is the term used to define the number of requests that your function is serving at any given time.

A

concurrency

246
Q

In AWS Lambda, are incoming requests processed in order?

A

No, they are often processed out of order.

247
Q

An AWS Lambda ________ refers to the incoming data (in JSON format) that a Lambda function will process.

A

event

248
Q

What is the function timeout limit for AWS Lambda?

A

15 minutes

249
Q

For AWS Lambda, what are the request and response payload size limits for synchronous and asynchronous invocations?

A

synchronous: 6 MB
asynchronous: 256 KB

250
Q

For AWS Lambda functions, the default concurrency execution limit is _______ and the default storage limit for your functions is ____ GB.

A

1,000… 75 GB…

251
Q

What AWS service can schedule automated data movement and data processing throughout AWS? (Note: This AWS service is actively being phased out for better alternatives)

It allows you to define a chain of activities using data sources, destinations, and processing activities referred to as a _____________.

A

AWS Data Pipeline

pipeline

252
Q

Which AWS (serverless) service provides a visual drag-and-drop editor to build workflows, orchestrate data processing pipelines, and integrates directly with over 250 AWS services?

A

AWS Step Functions

253
Q

The two types of Amazon Redshift snapshots are _________ and _________.

________ snapshots are enabled by default. They take a snapshot every ____ hours or following every ___ GB per node of data changes.

A

automated… manual.

automated… 8… 5 GB…

254
Q

Amazon Redshift Serverless measures data warehouse capacity in _________ __________ ________.

A

Redshift Processing Units (RPUs)

255
Q

The default base capacity for Amazon Redshift Serverless is ____ RPUs and if you only want to run simple workloads, the minimum capacity is ____ RPUs.

Note: 1 RPU provides ___ GB of memory.

A

128 RPUs… 8 RPUs…

16 GB

256
Q

AWS Lake Formation uses __________ ___________ functionality to provide temporary credentials for granting short-term access to S3 data.

A

credential vending

257
Q

How would you allow your S3 data to be used in AWS Lake Formation?

A

You would have to “register” the S3 data with AWS Lake Formation.

258
Q

Does AWS Lake Formation only use IAM policies for its permissions?

A

No, it has its own internal permissions system that augments IAM policies.

259
Q

An AWS Lake Formation __________ is a container for a set of related AWS Glue jobs, crawlers, and triggers. You create this in AWS Lake Formation, and it executes in the AWS Glue service.

A

workflow

260
Q

AWS Lake Formation workflows are created based on _________, which are predefined templates for ingesting data from a particular source (RDS, CloudTrail, etc.) into a data lake.

A

blueprints

261
Q

AWS Lake Formation uses the _____ _______ _______ ___________ to store metadata about its data sources.

A

AWS Glue Data Catalog

262
Q

What 5 AWS services can you layer on top of Lake Formation that all honor Lake Formation’s granular permissions?

A
  1. AWS Glue
  2. Amazon Athena
  3. Amazon Redshift Spectrum
  4. Amazon QuickSight (Enterprise Edition)
  5. Amazon EMR
263
Q

How would you connect Amazon QuickSight to Amazon Redshift?

A

Create a security group for the Redshift cluster. Allow inbound access from the IP address range of the QuickSight servers.

264
Q

Does Amazon QuickSight support Amazon Athena as a data source?

A

Yes

265
Q

How many AWS Glue Data Catalogs does each AWS account have per Region?

A

1 Data Catalog per Region

266
Q

An AWS Glue job is composed of a _________, ______ _________, and ______ _________.

A

script… data source… data target

267
Q

With AWS Glue, you are charged an hourly rate based on the number of ____________________ used to run your ETL job.

A

Data Processing Units (DPUs)

268
Q

In AWS Glue, a Data Processing Unit (DPU) is also referred to as a ___________.

A

worker

269
Q

AWS Glue ___________ are used to organize metadata tables in AWS Glue. When you define a table in the AWS Glue Data Catalog, you add it to a ___________.

A

databases… database

270
Q

An AWS Glue ____________ is a Data Catalog object that stores login credentials, URI strings, virtual private cloud (VPC) information, and more for a particular data store.

A

connection

271
Q

In AWS Glue, you set up your crawler with an ordered set of ___________ that will read the data in a data store and return a certainty number between 0.0 and 1.0.

A

classifiers

272
Q

In AWS Glue, you can collect metrics about your jobs to view in CloudWatch by enabling the _____ ___________ option within AWS Glue.

A

job metrics

273
Q

An AWS Glue ________ policy can be used to control access to AWS Glue Data Catalog resources.

A

resource

274
Q

What is one reason you would create an AWS Glue Data Catalog table manually rather than using the crawler?

A

If you want custom naming conventions for your tables.

Using the crawler will automatically name the tables.

275
Q

In AWS Glue, you can use ______ ____________ to keep track of previously processed data. When a job runs, only new incremental data is processed since the last checkpoint.

A

job bookmarks

276
Q

MQTT is a messaging protocol used for IoT devices. AWS supports this protocol with its ______ ______ ______ service.

A

AWS IoT Core

277
Q

______ ______ __________ is an AWS service that filters, transforms, and enriches IoT data before storing it in a time-series data store for analysis.

A

AWS IoT Analytics

278
Q

In Amazon Athena, when you partition your S3 data sometimes it doesn’t load into Athena. What command can you use to solve this? (this command only works with Hive-style partitions)

A

MSCK REPAIR TABLE

279
Q

Can Amazon QuickSight connect to a Redshift cluster that is in a different region?

A

Yes

280
Q

In Amazon Redshift, the term “predicate” simply refers to a ___________.

A

condition

281
Q

For Amazon Redshift tables that are not frequently updated, what distribution style is most appropriate?

A

DISTSTYLE ALL

282
Q

In Amazon Redshift, a best practice for choosing the right distribution style is to use a column with a ______ cardinality.

A

high

283
Q

_____ ___________ _____ ___________ _______ is a feature that lets you send a stream of log events from CloudWatch Logs to other AWS services for custom processing.

A

AWS CloudWatch Logs Subscription Filters

284
Q

With ML Insights, Amazon QuickSight provides what three major features?

A
  1. anomaly detection
  2. forecasting
  3. autonarratives
285
Q

When using AWS Lambda to process data from a Kinesis Data Stream or a DynamoDB data stream, what setting can you configure in AWS Lambda to process each shard with more than one simultaneous Lambda invocation?

A

ParallelizationFactor (can set this between 1 and 10)

286
Q

When launching an EMR cluster using the RunJobFlow API, you can optionally set the __________ parameter to TRUE so that the cluster will transition to the WAITING state rather than shutting down after the steps have completed.

A

KeepJobFlowAliveWhenNoSteps

287
Q

Amazon S3 Select works on objects stored in ____, _____, or Apache _______ format.

It also works with objects that are compressed with ______ or _______, and server-side encrypted objects.

A

CSV… JSON… Parquet

GZIP… BZIP2…

288
Q

An Amazon OpenSearch Service _______ is the terminology used to refer to an OpenSearch cluster.

A

domain

289
Q

An Amazon OpenSearch Serverless __________ refers to an auto-scaling OpenSearch cluster.

A

collection

290
Q

Amazon OpenSearch Serverless collections are always ___________.

A

encrypted

291
Q

What are the two primary use cases for Amazon OpenSearch Serverless?

A
  1. log analytics
  2. full-text search
292
Q

OpenSearch serverless has a decoupled architecture that separates the _________ (ingest) components from the ________ (query) components.

A

indexing… search…

293
Q

What are the two primary collection types in OpenSearch Serverless?

A
  1. time-series
  2. search
294
Q

OpenSearch Serverless compute capacity is measured in ___________ ____________ ______.

When you create your first collection, OpenSearch Serverless instantiates a total of _____ (____ each for indexing and search).

A

OpenSearch Capacity Units (OCUs)

4… 2…

295
Q

In Amazon OpenSearch Serverless:

_________ collections use a combination of hot and warm caches.
_________ collections store all data in the hot cache.

A

time-series
search

296
Q

In terms of encryption, what is the difference between OpenSearch Service (i.e. manual provisioned clusters) and OpenSearch Serverless?

A

Encryption is optional for OpenSearch Service.

Encryption is required for OpenSearch Serverless.

297
Q

In OpenSearch Service, if the JVM memory pressure metric is too high, you can solve this by reducing the traffic to your cluster by ___________ (increasing/decreasing) the number of shards.

A

decreasing

298
Q

When determining the DISTKEY in Amazon Redshift, does the DISTKEY column have to be the same between the dimension table and fact table?

A

Yes.

Use the dimension table’s primary key and the fact table’s corresponding foreign key.

299
Q

In Amazon Redshift, can the distribution style be different between a dimension table and a fact table?

A

Yes.

You can have different distribution styles for each of your tables.

300
Q

What feature of S3 is ideal when you need to transfer gigabytes to terabytes of data on a regular basis across continents?

A

S3 Transfer Acceleration

301
Q

______ ___________ __________ is a secure transfer service that enables you to transfer files into and out of AWS storage services.

The two supported AWS storage services are _______ and _______.

A

AWS Transfer Family

S3… EFS…

302
Q

Which AWS service allows you to quickly move large amounts of data between on-premises and AWS infrastructure and provides end-to-end security, including encryption and integrity validation?

A

AWS DataSync

303
Q

In Amazon Redshift, you can use ____________ _____________ to create query queues to efficiently configure query traffic so short, fast-running queries won’t get stuck in queues behind long-running queries.

A

workload management (WLM)

304
Q

Are AWS Glue streaming ETL jobs executed in real-time?

A

No, the data is processed in 100-second windows.

305
Q

Can Amazon QuickSight use Amazon EMR as a data source?

A

No.

306
Q

If you have 12 months of data stored in Amazon Redshift but your business needs only require querying the last 2 months worth of data, what can you do to reduce costs?

A

You can unload the first 10 months of data into S3, because otherwise the additional Redshift data will slow down the queries.

307
Q

When querying S3 with Amazon Athena, what are two things you can do to reduce the cost and also increase query performance?

A
  1. store your S3 data in Avro or Parquet format
  2. compress your S3 data
308
Q

To improve query performance and reduce costs when using Amazon Athena, you can partition your data which restricts the amount of data scanned by each query.

A customer who has data coming in on a ______ basis would partition by year, month, date, and hour.
A customer who has data coming in on a ______ basis would partition by a data source identifier and date.

A

hourly

daily

309
Q

In Amazon Athena, you can use ____________ to isolate workloads, users, and teams into groups. This can help to control costs by tracking each user’s queries that they run and sending the query metrics to CloudWatch.

A

workgroups

310
Q

When creating a data lake with AWS Lake Formation, would you register the S3 bucket name or the S3 bucket path with Lake Formation?

A

S3 bucket path

311
Q

In Amazon EMR, if you attach an EBS volume to your cluster for increased storage and you want to encrypt it, what type of encryption is used?

A

LUKS

Note: The recommended way to encrypt EBS volumes on EMR is with “EBS encryption” … This applies to both the root volume and any attached volumes. In contrast, LUKS encryption only applies to attached volumes.

312
Q

To query data in S3, you can use S3 Select, Athena, or Redshift Spectrum. S3 select is different because it only allows you to query a _________ of the data.

A

subset

313
Q

With Amazon EMR, can you run analyses on data stored in DynamoDB ?

A

Yes, they are natively integrated.

314
Q

In Amazon CloudWatch Events, a ________ is the terminology used that matches incoming events and routes them to targets for processing.

A

rule

315
Q

In Amazon CloudWatch Events, a ________ is the terminology used for the thing that processes events.

A

target

316
Q

__________ __________ _________delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams.

A

Amazon CloudWatch Events

317
Q

In Amazon EMR, the _______ metric tracks whether a cluster is live,
but not currently running tasks.

A

IsIdle

318
Q

When data is moved or transitioned to the Amazon S3 _________ storage class, it is no longer readable or queryable by Athena.

A

GLACIER

319
Q

Can Amazon Athena query S3 data in a different region?

A

Yes

320
Q

The S3DistCp command is primarily used in copying data from Amazon S3 to ___________ and not to Redshift.

A

Amazon EMR

321
Q

The DynamoDB object size limit is _______ per item.

A

400 KB

322
Q

Would you use IAM policies for column-based permissions in Amazon Redshift? If not, how would you handle this?

A

No, these permissions are based on the table and are set by using the GRANT command

323
Q

The _______________ configuration for an EMR cluster (which applies to all node categories: primary/core/task) offers the widest variety of provisioning options for EC2 instances.

A

instance fleets

324
Q

When launching an Amazon EMR cluster, what configuration type will typically result in better price performance?

A

instance fleets

325
Q

When using KMS server-side encryption in Kinesis Data Streams, how frequently does Kinesis make an API call to KMS to rotate the key?

A

approximately every 5 minutes

326
Q

An ______ is a friendly name for an AWS KMS key (e.g. “test-key-1”)

A

alias

327
Q

How quickly can you access your data when using S3 Glacier with expedited retrieval?

A

1-5 minutes

328
Q

Before using the COPY command to move S3 data to Redshift, what can you do with your files to make sure you are taking maximum advantage of the MPP architecture of Redshift?

A

Make sure your quantity of S3 files is a multiple of the number of slices in your Redshift cluster.

329
Q

In Kinesis Data Analytics, do tumbling windows overlap?

A

No

330
Q

In Kinesis Data Analytics, _______ windows have a fixed starting and ending time, whereas ________ windows don’t begin until the first event matching the partition key arrives.

A

tumbling… stagger…

331
Q

Amazon QuickSight can create visualizations based on S3 data stored in _____ or _____ format, but not in _____ format.

A

csv or json… not in parquet

332
Q

Does Amazon Athena support column names that include special characters?

A

No, only alphanumeric characters and underscores are supported.

333
Q

Storing passwords in your AWS Glue ETL job script is not recommended. What does AWS recommend instead for using passwords in your script?

A

Use boto3 to retrieve your passwords from AWS Secrets Manager or AWS Glue Data Catalog.

334
Q

Can you use the COPY command to copy data directly from an RDS database into Amazon Redshift?

A

No

335
Q

In Amazon Redshift, you can enable _________ ____________ to track information about authentication attempts, connections, disconnections, changes to database user definitions, and queries run in the database.

A

audit logging

336
Q

By default, Amazon EMR uses _________ as a cluster resource manager.

A

YARN (Yet Another Resource Negotiator)

337
Q

For Amazon EMR, you can use automatic scaling with a custom policy for which EMR configuration type?

A

instance groups

338
Q

Each EMR instance group in a cluster, except the instance group of the _________ node, can have its own auto-scaling policy, which consists of scale-out and scale-in rules.

Remember: EMR auto-scaling only applies for instance groups configuration.

A

primary

339
Q

Regarding S3 data storage formats, ____________ is better for saving storage space, whereas __________ is better for efficient read-heavy operations.

A

compressed csv… Apache ORC…

340
Q

AWS Database Migration Service (AWS DMS) is a web service that you can use to migrate data from a source data store to a target data store. However, what is the one requirement for using AWS DMS?

A

You can’t use AWS DMS to migrate from an on-premises database to another on-premises database. One of your endpoints (source or target) needs to be an AWS service.

341
Q

In AWS DMS, you can use the ________ ____________ feature to collect data from your on-premises database and analytic servers, and build an inventory of servers, databases, and schemas that you can migrate to the AWS.

A

Fleet Advisor

342
Q

For “ad-hoc”, “infrequent”, “cost-effective” analysis requirements, would EMR be a good solution?

A

No, it would be overkill and too expensive. Use Athena instead.

343
Q

If your application is having trouble reading all of the required S3 data in an efficient manner, how can you scale the S3 read performance?

A

Since S3 supports 5,500 GET/HEAD requests per second per prefix in a bucket, the best solution would be to add more prefixes (e.g. 10 prefixes = 55,000 GET requests).

344
Q

If your application needs to scan millions or billions of objects in S3, what does AWS recommend doing to help parallelize read capacity and performance?

A

It is recommended to create a random string and add that to the beginning of the object prefixes.

In effect, this creates more unique prefixes, which dramatically increases read capacity, since each bucket prefix supports 5,500 GET/HEAD requests per second.

345
Q

In Kinesis Data Analytics, what query method would you use to monitor a chosen stock in real-time, identify when a threshold is reached, and then create a real-time notification for the customer?

A

continuous query

346
Q

When you terminate an EMR cluster, Amazon EMR retains metadata about the cluster for ______ months at no charge.

A

two

347
Q

In Amazon OpenSearch Service, __________ storage provides a cost-effective way to store large amounts of read-only data for your OpenSearch Service cluster.

These nodes use Amazon S3 and a sophisticated caching solution
to improve performance.

A

UltraWarm

348
Q

In Amazon OpenSearch Service, _________ storage takes the form of instance stores or Amazon EBS volumes attached to each node and provides the fastest possible performance for indexing and searching new data.

A

hot

349
Q

In Amazon OpenSearch Service, _________ storage lets you store any amount of infrequently accessed or historical data on your Amazon OpenSearch Service domain and analyze it on demand, at a lower cost than other storage tiers.

A

cold

350
Q

In Amazon OpenSearch Service, _________ ________ ____________ lets you define custom management policies that automate routine tasks. For example, you can define
a policy that moves your index into a read_only state after 30 days and then ultimately deletes it after 90 days.

A

Index State Management (ISM)

351
Q

In Amazon OpenSearch Service, you can use ________ ___________ to reduce storage costs by periodically rolling up old data into summarized indices. This allows you to store months or years of historical data at a fraction of the cost with the same query performance.

A

index rollups

352
Q

In Amazon OpenSearch Service, you can create ________ ___________ jobs to visualize, analyze, and/or summarize your data in different ways.

A

index transform

353
Q

When using the AWS Glue Schema Registry feature, a ___________ is a logical container of schemas.

A

registry

354
Q

AWS Glue comes with ________ worker types to help you select the configuration that meets your job latency and cost requirements.

List them.

A

three

  1. standard
  2. G.1X
  3. G.2X
355
Q

What are 4 data sources that Redshift can use with its COPY command?

A
  1. S3
  2. EMR cluster files
  3. EC2 Instance files
  4. DynamoDB
356
Q

To meet compliance requirements and ensure Amazon EMR cluster data is not publicly accessible, make sure the ________ _________ _________ option is enabled in the console.

A

block public access

357
Q

How quickly can the following visualization tools pull in the data? Which tool supports “near-real-time” and “time-sensitive” dashboards?
1. QuickSight
2. OpenSearch Dashboards (Kibana)

A
  1. minimum of 15 minutes
  2. near-real-time
358
Q

To query multiple distributed datasets in-place with SQL, you could use ________ ___________ running on Amazon EMR.

A

Apache Presto

359
Q

If a Kinesis Producer Library (KPL) producer “RecordMaxBufferedTime” property is resulting in a delay that the application cannot tolerate, you can optionally use the AWS SDK directly. For example, you can update your producer’s code to use the __________ API call.

A

PutRecord(s)

360
Q

Can you create an Amazon QuickSight dashboard directly from S3 data stored in Apache Parquet format?

A

No, you would need to use Athena to query the S3 data in Apache Parquet format, and then use Athena as the data source for QuickSight.

361
Q

In Amazon QuickSight, which query mode (direct query or SPICE) is most cost-effective when you have 1000 readers?

A

SPICE, because the data stored in SPICE can be reused without incurring additional costs. Direct Query mode would incur costs every time someone views the dashboard because the data is refreshed constantly.

362
Q

Amazon Athena has a featured called ___________ __________ that lets you run SQL queries across multiple data sources stored in relational, non-relational, object, and custom data sources.

A

Federated Query

363
Q

What are the two types of cost constraints when using Amazon Athena workgroups?

A
  1. per-query limit
  2. per-workgroup limit
364
Q

By default, each AWS account has a _________ workgroup within Amazon Athena. This ________ (can / cannot) be deleted.

A

primary… cannot…

365
Q

In Amazon Athena, you can set up _____________ settings that enforce constraints for all queries that run in a workgroup.

A

workgroup-wide settings

366
Q

In Amazon Athena workgroups, how many “per-query” limits and “per-workgroup” limits can you create?

A

Only 1 “per-query” limit per workgroup.

Multiple “per-workgroup” limits.

367
Q

In Amazon Athena, _________ _________ write new data to a specified location in Amazon S3, whereas “views” do not write any data.

A

CTAS queries

368
Q

_____________ and _____________ are two complementary ways to reduce the amount of data Amazon Athena must scan when you run a query.

A

Partitioning and bucketing

369
Q

In Amazon Athena, good candidates for partition keys are columns that have ______ cardinality.

A

low

370
Q

What are two rules of thumb for deciding what data columns to use for Amazon Athena bucketing?

A
  1. high-cardinality columns
  2. evenly distributed values (i.e. you want every bucket to have approximately the same amount of data)
371
Q

____ ________ ________ _______ helps you to easily deploy and enforce compliance controls for individual S3 Glacier vaults.

A _______ ________ policy can be locked to prevent future changes, which provides strong enforcement for your compliance controls.

A

S3 Glacier Vault Lock

Vault Lock

372
Q

When using job bookmarks in AWS Glue, always have _________ command in the beginning of the script and the __________ in the end of the script.

A

job.init() … job.commit() …

373
Q

The Kinesis Client Library (KCL) is only able to use __________ as the checkpointing table.

A

DynamoDB

374
Q

AWS recommends using a Snowball when the data to be transferred is less than _____ and a Snowmobile when the data is >= _________

A

10PB… 10PB…

375
Q

Can the Kinesis Agent send data to both KDS and KDF ?

A

Yes

376
Q

When using the AWS Glue crawler, you can use _________ ___________ to ignore the filepaths that you do not need or that have already been crawled.

A

exclude patterns

377
Q

What are the two primary reasons why you might see duplicate records when using Kinesis Data Streams?

A
  1. producer retries
  2. consumer retries
378
Q

An _________ schema in Amazon Redshift is a logical grouping of tables that are not stored in Redshift, but are accessible to Redshift through a data catalog.

A

external

379
Q

What is the message size limit in Amazon SQS?

A

256 KB

380
Q

In Amazon SQS, a FIFO queue can support up to _______ messages per second with batching, or up to _______ messages per second without.

A

3,000… 300…

381
Q

When given the choice between different partitioning methods in S3, always lean towards partitioning by ________ unless you have a VERY strong reason not to.

A

date

382
Q

If your AWS Glue job uses a JDBC connection and it is running slow and timing out, how can you solve this problem?

A

Read the JDBC dataset with parallelization by using multiple JBDC connections instead of the default (which is only one connection)

383
Q

A company needing a machine learning application with the most cost-effective solution should deploy the application using ____________ as opposed to ______________.

A

EC2 instances… SageMaker…

384
Q

When setting up the federated queries feature of Amazon Athena to join together different data sources, is this easy to set up?

A

No, it requires a lot of development effort. If the end goal is to create visualizations, a better solution would be to connect QuickSight straight to the relevant data sources.

385
Q

In Amazon Redshift, AWS recommends using _____________ access control, instead of _________, to manage access to sensitive columns within a table.

A

column-level… views…