Domain 4: Analysis Flashcards

Question 1

Q

RANDOM_CUT_FOREST

Answer

A

Kinesis Data Analytics SQL (or Flink) Function for anomaly detection in numeric columns

Question 2

Q

Kinesis Firehose Buffer Limits

Answer

A

1 to 128 MB

60 to 900 seconds

Question 3

Q

Kinesis Data Analytics Supported Sources

Answer

A

Kinesis Streams and Kinesis Firehose

Question 4

Q

Kinesis Data Analytics Supported Destinations

Answer

A

Kinesis Streams, Kinesis Firehose, Lambda

Question 5

Q

What happens if a record arrives late to a Kinesis Data Analytics application

Answer

A

Record is written to the error stream

Question 6

Q

In what form does Kinesis Data Analytics provision capacity?

Answer

A

Kinesis Processing Units

Question 7

Q

How much memory is provided per KPU?

Question 8

Q

What is the default number of KPU per Kinesis Data Analytics application?

Question 9

Q

What is the name of the visualization tool in the Elastic Stack?

Question 10

Q

Is ElasticSearch Serverless?

Answer

A

No, still have to scales servers

Question 11

Q

What should ElasticSearch NOT be used for?

Answer

A

OLTP (RDS or DynamoDB instead)

- Ad-Hoc Querying (Athena instead)

Question 12

Q

How can data be imported to ElasticSearch?

Answer

A

Kinesis, DynamoDB, Logstash, Beats, ElasticSearch API

Question 13

Q

What query engine does Athena use?

Question 14

Q

What data formats does Athena support?

Answer

A

CSV, JSON, Parquet, ORC, Avro

Question 15

Q

Is Athena serverless?

Question 16

Q

Does Athena support unstructured data?

Question 17

Q

Which data formats are columnar?

Answer

A

ORC and Parquet

Question 18

Q

Which data formats are splittable?

Answer

A

ORC, Parquet, Avro

Question 19

Q

Which notebooks can Athena integrate with?

Answer

A

Jupyter, Zeppelin, RStudio

Question 20

Q

What is the cost rate for Athena?

Answer

A

$5 per TB scanned

Question 21

Q

Do cancelled queries count toward Athena charges?

Question 22

Q

Do failed queries count toward Athena charges?

Question 23

Q

What data format will be the most cost effective in Athena?

Answer

A

Columnar (ORC, Parquet)

Question 24

Q

Does Athena charge for DDL processing?

Question 25

Q

How can Athena results be encrypted?

Answer

A

Encrypt at rest in S3 using SSE-S3, SSE-KMS, CSE-KMS

Question 26

Q

Can Athena access S3 in another account?

Question 27

Q

How are Athena results encrypted in transit?

Answer

A

Transport Layer Security (TLS)

Question 28

Q

Is Redshift Serverless or Fully Managed?

Answer

A

Fully Managed?

Question 29

Q

What is the maximum number of compute nodes in a Redshift cluster?

Question 30

Q

What are the two types of compute nodes that can be selected for a Redshift cluster?

Answer

A

Dense Storage (DS) - uses HDDs for large size at low cost
Dense Compute (DC) - uses SSD and lots of memory for faster performance at a higher cost

Question 31

Q

How many HDDs on an ds2.xlarge Redshift compute node?

Answer

A

3 for a total of 2TB storage

Question 32

Q

How many HDDs on an ds2.8xlarge Redshift compute node?

Answer

A

24 for a total of 16TB storage

Question 33

Q

How many SSDs on an dc2.large Redshift compute node?

Answer

A

160GB SSD storage, 15GB RAM

Question 34

Q

How many SSDs on an dc2.8xlarge Redshift compute node?

Answer

A

2.6TB SSD, 244GB RAM

Question 35

Q

What determines the number of Node Slices on a Compute Node?

Answer

A

The size of the Compute Node

Question 36

Q

What kind of data storage does Redshift use for high performance?

Question 37

Q

Can you change the compression encoding for a column after a table is created in Redshift?

Question 38

Q

How many copies of your data is stored within Redhisft?

Answer

A

Three - one main on cluster, one backup on cluster, one snapshot in S3

Question 39

Q

Can Redshift data be backed up to another region?

Answer

A

Yes - asynchronously in S3

Question 40

Q

How many AZs is Redshift limited to?

Question 41

Q

What is the default Redshift distribution style

Question 42

Q

What is the EVEN Redshift distribution style?

Answer

A

Steps through each slice and assigns data in round-robin fashion

Question 43

Q

What is the KEY Redshift distribution style?

Answer

A

Assigns data to each slice based on a selected key column. Ideal if you plan to query data on a specific column.

Question 44

Q

What is the ALL Redshift distribution style?

Answer

A

All data is replicated on every node in the cluster. Multiplies storage by the number of nodes in the cluster.

Question 45

Q

What are Redshift Sort Keys?

Answer

A

Similar to an index, makes for fast range queries

Question 46

Q

What are the three types of Redshift Sort Keys?

Answer

A

Single, Compound, Interleaved

Question 47

Q

What is the default types of Redshift Sort Key?

Question 48

Q

Does the order of Compound Sort Keys matter in Redshift?

Answer

A

Yes - first will be primary

Question 49

Q

What is required when performing COPY from S3 to Redshift?

Answer

A

Manifest File and IAM role

Question 50

Q

What is the command to copy Redshift data into S3?

Question 51

Q

How can you configure S3 to Redshift connections without going over public internet?

Answer

A

Enhanced VPC routing

Question 52

Q

Can COPY decrypt S3 data as it is loaded into Redshift?

Answer

A

Yes, using hardware accelerated SSL

Question 53

Q

If loading a tall but narrow table to Redshift, what should you attempt to do for efficiency?

Answer

A

Try to use only one COPY command (metadata is added for each COPY command)

Question 54

Q

How do you copy a Redshift snapshot to another region?

Answer

A

Create KMS Key in destination region
Specify unique name for your snapshot copy grant
Specify the KMS Key for which you’re creating the copy grant
In the source region, Enable copying of snapshots to the copy grant you created

Question 55

Q

What is Redshift DBLINK?

Answer

A

Connects Redshift to PostgreSQL (which could be on RDS)

MUST be in the same Availability Zone

Question 56

Q

Can data be imported from DynamoDB to Redshift?

Question 57

Q

What is Redshift Workload Management (WLM)?

Answer

A

Prioritizes short, fast queries vs long, slow queries

Question 58

Q

How can you configure Redshift WLM?

Answer

A

Redshift Console, CLI, or API

Question 59

Q

What is Redshift Concurrency Scaling?

Answer

A

Automatically adds cluster capacity to handle increases in concurrent read queries

Question 60

Q

How do Redshift WLM and Concurrency Scaling interact?

Answer

A

WLM queues can manage which queries are sent to concurrency scaling clusters

Question 61

Q

How many queues can be created with Redshift Automatic WLM?

Answer

A

8 (default of 5)

Question 62

Q

Is concurrency raised or lowered on large queries in Automatic WLM?

Question 63

Q

How many queues can be created with Redshift Manual WLM?

Answer

A

8 (default 1)

Question 64

Q

What is the default concurrency of the default queue in Redshift Manual WLM?

Answer 45

A

Timed out queries automatically hop to another queue and retry

Answer 46

A

Prioritizes short queries. Alternative to WLM.

Answer 47

A

CREATE TABLE AS, and SELECT statements

Answer 48

A

Machine Learning

Answer 49

A

Recovers space from deleted rows?

Answer 50

A

FULL, DELETE ONLY, SORT ONLY, REINDEX (reanalyzes interleaved sort keys)

Answer 51

A

Quickly add or remove nodes of the same type. Low downtime. For some types, you can only double of halve the nodes.

Answer 52

A

Change node type and/or number of nodes. Can lead to hours or days of read-only.

Answer 53

A

Used to keep cluster available during a Classic resize. Minimizes downtime.

Answer 54

A

Allow you scale compute and storage capacity independently

Answer 55

A

Unloads Redshift to S3 in Parquet format

Answer 56

A

2x faster, 6x smaller, automatically partitioned, compatible with many services (Spectrum, Athena, EMR, Sagemaker)

Answer 57

A

Atomicity, Consistency, Isolation, Durability

Answer 58

A

No, you can use standard Athena Data Catalogs

Answer 59

A

Many different sources, highly structured, single source of truth, stored for long periods of time, performant on large sizes of data

Answer 60

A

Don’t want to worry about formatting or infrastructure, quick queries for troublehsooting, ad-hoc

Answer 61

A

Need a wide variety of custom processing tasks, fine grained control over your clusters, custom code

Answer 62

A

allows you to run SQL queries across variety of relational, non-relational, and custom data sources. A unified way to run SQL queries across various data stores.

Answer 63

A

No (only Presto is supported)

Answer 64

A

Serializer/Deserializer; libraries that tell Hive how to interpret data formats; also used by Athena

Answer 65

A

ALTER TABLE ADD PARTITION