Domain 4: Analysis Flashcards

1
Q

RANDOM_CUT_FOREST

A

Kinesis Data Analytics SQL (or Flink) Function for anomaly detection in numeric columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Firehose Buffer Limits

A

1 to 128 MB

60 to 900 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Kinesis Data Analytics Supported Sources

A

Kinesis Streams and Kinesis Firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Kinesis Data Analytics Supported Destinations

A

Kinesis Streams, Kinesis Firehose, Lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens if a record arrives late to a Kinesis Data Analytics application

A

Record is written to the error stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In what form does Kinesis Data Analytics provision capacity?

A

Kinesis Processing Units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How much memory is provided per KPU?

A

4GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the default number of KPU per Kinesis Data Analytics application?

A

8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the name of the visualization tool in the Elastic Stack?

A

Kibana

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is ElasticSearch Serverless?

A

No, still have to scales servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What should ElasticSearch NOT be used for?

A
  • OLTP (RDS or DynamoDB instead)

- Ad-Hoc Querying (Athena instead)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can data be imported to ElasticSearch?

A

Kinesis, DynamoDB, Logstash, Beats, ElasticSearch API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What query engine does Athena use?

A

Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What data formats does Athena support?

A

CSV, JSON, Parquet, ORC, Avro

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is Athena serverless?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Does Athena support unstructured data?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which data formats are columnar?

A

ORC and Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which data formats are splittable?

A

ORC, Parquet, Avro

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which notebooks can Athena integrate with?

A

Jupyter, Zeppelin, RStudio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the cost rate for Athena?

A

$5 per TB scanned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Do cancelled queries count toward Athena charges?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Do failed queries count toward Athena charges?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What data format will be the most cost effective in Athena?

A

Columnar (ORC, Parquet)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Does Athena charge for DDL processing?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can Athena results be encrypted?

A

Encrypt at rest in S3 using SSE-S3, SSE-KMS, CSE-KMS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Can Athena access S3 in another account?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How are Athena results encrypted in transit?

A

Transport Layer Security (TLS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Is Redshift Serverless or Fully Managed?

A

Fully Managed?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the maximum number of compute nodes in a Redshift cluster?

A

128

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the two types of compute nodes that can be selected for a Redshift cluster?

A
Dense Storage (DS) - uses HDDs for large size at low cost
Dense Compute (DC) - uses SSD and lots of memory for faster performance at a higher cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How many HDDs on an ds2.xlarge Redshift compute node?

A

3 for a total of 2TB storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How many HDDs on an ds2.8xlarge Redshift compute node?

A

24 for a total of 16TB storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How many SSDs on an dc2.large Redshift compute node?

A

160GB SSD storage, 15GB RAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How many SSDs on an dc2.8xlarge Redshift compute node?

A

2.6TB SSD, 244GB RAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What determines the number of Node Slices on a Compute Node?

A

The size of the Compute Node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What kind of data storage does Redshift use for high performance?

A

Columnar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Can you change the compression encoding for a column after a table is created in Redshift?

A

No

38
Q

How many copies of your data is stored within Redhisft?

A

Three - one main on cluster, one backup on cluster, one snapshot in S3

39
Q

Can Redshift data be backed up to another region?

A

Yes - asynchronously in S3

40
Q

How many AZs is Redshift limited to?

A

One

41
Q

What is the default Redshift distribution style

A

AUTO

42
Q

What is the EVEN Redshift distribution style?

A

Steps through each slice and assigns data in round-robin fashion

43
Q

What is the KEY Redshift distribution style?

A

Assigns data to each slice based on a selected key column. Ideal if you plan to query data on a specific column.

44
Q

What is the ALL Redshift distribution style?

A

All data is replicated on every node in the cluster. Multiplies storage by the number of nodes in the cluster.

45
Q

What are Redshift Sort Keys?

A

Similar to an index, makes for fast range queries

46
Q

What are the three types of Redshift Sort Keys?

A

Single, Compound, Interleaved

47
Q

What is the default types of Redshift Sort Key?

A

Compound

48
Q

Does the order of Compound Sort Keys matter in Redshift?

A

Yes - first will be primary

49
Q

What is required when performing COPY from S3 to Redshift?

A

Manifest File and IAM role

50
Q

What is the command to copy Redshift data into S3?

A

UNLOAD

51
Q

How can you configure S3 to Redshift connections without going over public internet?

A

Enhanced VPC routing

52
Q

Can COPY decrypt S3 data as it is loaded into Redshift?

A

Yes, using hardware accelerated SSL

53
Q

If loading a tall but narrow table to Redshift, what should you attempt to do for efficiency?

A

Try to use only one COPY command (metadata is added for each COPY command)

54
Q

How do you copy a Redshift snapshot to another region?

A
  1. Create KMS Key in destination region
  2. Specify unique name for your snapshot copy grant
  3. Specify the KMS Key for which you’re creating the copy grant
  4. In the source region, Enable copying of snapshots to the copy grant you created
55
Q

What is Redshift DBLINK?

A

Connects Redshift to PostgreSQL (which could be on RDS)

MUST be in the same Availability Zone

56
Q

Can data be imported from DynamoDB to Redshift?

A

Yes

57
Q

What is Redshift Workload Management (WLM)?

A

Prioritizes short, fast queries vs long, slow queries

58
Q

How can you configure Redshift WLM?

A

Redshift Console, CLI, or API

59
Q

What is Redshift Concurrency Scaling?

A

Automatically adds cluster capacity to handle increases in concurrent read queries

60
Q

How do Redshift WLM and Concurrency Scaling interact?

A

WLM queues can manage which queries are sent to concurrency scaling clusters

61
Q

How many queues can be created with Redshift Automatic WLM?

A

8 (default of 5)

62
Q

Is concurrency raised or lowered on large queries in Automatic WLM?

A

Lowered

63
Q

How many queues can be created with Redshift Manual WLM?

A

8 (default 1)

64
Q

What is the default concurrency of the default queue in Redshift Manual WLM?

A

5

65
Q

What is the maximum concurrency level in Redshift Manual WLM?

A

50

66
Q

What is query queue hopping?

A

Timed out queries automatically hop to another queue and retry

67
Q

What is Redshift Short Query Acceleration (SQA)?

A

Prioritizes short queries. Alternative to WLM.

68
Q

What statements does Redshift SQA support?

A

CREATE TABLE AS, and SELECT statements

69
Q

How does Redshift SQA predict query execution time?

A

Machine Learning

70
Q

What is the Redshift VACUUM command?

A

Recovers space from deleted rows?

71
Q

What are the four types of Redshift VACUUM commands?

A

FULL, DELETE ONLY, SORT ONLY, REINDEX (reanalyzes interleaved sort keys)

72
Q

What is Elastic Resize in Redshift?

A

Quickly add or remove nodes of the same type. Low downtime. For some types, you can only double of halve the nodes.

73
Q

What is Classic Resize in Redshift?

A

Change node type and/or number of nodes. Can lead to hours or days of read-only.

74
Q

What is Redshift Snapshot, restore, resize?

A

Used to keep cluster available during a Classic resize. Minimizes downtime.

75
Q

What are Redshift RA3 nodes?

A

Allow you scale compute and storage capacity independently

76
Q

What is Redshift Data Lake Export?

A

Unloads Redshift to S3 in Parquet format

77
Q

What are some advantages of Parquet?

A

2x faster, 6x smaller, automatically partitioned, compatible with many services (Spectrum, Athena, EMR, Sagemaker)

78
Q

What does ACID stand for?

A

Atomicity, Consistency, Isolation, Durability

79
Q

What port does Kibana run on?

A

5601

80
Q

Do you have to use Glue Data Catalogs when using Athena?

A

No, you can use standard Athena Data Catalogs

81
Q

What language does Glue’s ETL engine use?

A

Python

82
Q

Can Athena invoke SageMaker models?

A

Yes

83
Q

Is Athena’s Data Catalog Hive metastore compatible?

A

Yes

84
Q

When should you use Redshift?

A

Many different sources, highly structured, single source of truth, stored for long periods of time, performant on large sizes of data

85
Q

When should you use Athena?

A

Don’t want to worry about formatting or infrastructure, quick queries for troublehsooting, ad-hoc

86
Q

When should you use EMR?

A

Need a wide variety of custom processing tasks, fine grained control over your clusters, custom code

87
Q

What is Federated query in Athena?

A

allows you to run SQL queries across variety of relational, non-relational, and custom data sources. A unified way to run SQL queries across various data stores.

88
Q

Can Athena read from compressed files?

A

Yes

89
Q

Can Hive Query be run on Athena?

A

No (only Presto is supported)

90
Q

What is SerDe?

A

Serializer/Deserializer; libraries that tell Hive how to interpret data formats; also used by Athena

91
Q

What needs to be done to add data to a partitioned table in Athena?

A

ALTER TABLE ADD PARTITION

92
Q

Can Athena access an S3 bucket in another account?

A

Yes