Analytics Flashcards

1
Q

What is AWS Glue?

A

It performs discovery on the underlying schema of your data.

It also performs custom ETL jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is stored in the Glue Data Catalog?

A

Your table definitions or schemas. All the original data is still in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Glue really allow you to do?

A

Query your unstructured data in S3 like it is structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hive?

A

It runs on EMR and allows you to run SQL like queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can a Hive metastore be used in Glue?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can a Glue Data Catalog be used in Hive?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does enabling Job Metrics do in AWS Glue?

A

It helps you understand the maximum DPU that you need for your Glue job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where can you plot the Glue Job Metrics maximum needed executors versus maximum allocated?

A

In the Glue console. You do not need cloudwatch for this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a dynamic frame in AWS Glue?

A

A collection of dynamic records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can I remove outliers in my data in AWS Glue ETL?

A

Use the filter transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can you join data in AWS Glue ETL?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you find matches or duplicates in your data in AWS Glue when there is no common unique identifier?

A

Use the FindMatchesML transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can you convert formats in Glue?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does ResolveChoice do in AWS Glue?

A

It deals with Ambuguities in your data, eg., two columns named price.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What options does ResolveChoice in AWS Glue have?

A

Make_cols - Makes columns

Cast - Casts to a specific type

Make_Struct Creates structure that contains each data type

Project: Projects every type to a given type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you modify your Glue Data Catalog when you added a new partition to your data?

A

Run the enableUpdateCatalog and PartitionKeys option.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you modify your Glue Data Catalog when you added a new Schema or table to your data?

A

EnableUpdateCatalog / updateBehavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How are you billed in AWS Glue?

A

By the second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How are you billed for development endpoints in AWS Glue?

A

By the minute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When you want to use Hive or Pig, what ETL engine should you use as a matter of best practice?

A

EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Can Glue ingest streaming data?

A

Yes. From Kinesis or Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Glue Data Quality?

A

Rules for your data quality, if the threshold is exceeded the job can stop or a cloudwatch alarm can be triggered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What language does Glue Data Quality support?

A

DQDL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are recipes in Glue Data Brew?

A

They are transformations that can be saved and applied to other jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Can you create a dataset from RedShift or Snowflake in Glue Data Brew?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is AWS Lake Formation?

A

AWS Managed Data Lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Do Glue and Lake Formation have overlapping functionality?

A

Yes. Anything that can be done in Glue can also be done in LakeFormation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What AWS services can talk to Lake Formation?

A

Athena, RedShift, EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

If I want to add someone from another account to my DataLake, how would I do this?

A

The recipient must be set up as a data lake administrator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is required to access encrypted data catalogs in Lake Formation?

A

IAM permissions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are governed tables in Lake Formation?

A

They support ACID Transactions with your data lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Can Lake Formation support Streaming data?

A

Yes, using governed tables. It can accept streams from Kinesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Does Lake Formation have row and cell level security?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Can Lake formation support SAML?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are policy tags in Lake Formation?

A

They are used on databases, tables, or columns and can be used for security. aka only admins can see the users table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is a Data Filter in Lake Formation?

A

They provide column, row, or cell-level security.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

When are Data Filters applied in Lake Formation?

A

When granting SELECT permissions on Table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is Athena?

A

A SQL interface for your data in AWS S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

When you need to add partitions after the fact in Athena, what command needs to be run?

A

MSCK REPAIR TABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How do you Optimize your Athena table after using ACID transactions

A

OPTIMIZE TABLE REWRITE DATA USING BIN_PACK to compact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How are Athena fine grained controls managed?

A

AWS IAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Does Spark support streaming?

A

Yes. Kinesis, Kafka, EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Can you convert data into another format using Athena?

A

Yes. It can use Parquet or Orc. It can also use GZIP or Snappy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Can spark streaming support Kinesis?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Can spark stream into RedShift?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Can Athena use Spark?

A

Yes. You can run a notebook within the Athena console and select Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What are Athena Federated Queries?

A

It allows you to query sources other than S3.

CloudWatch
DynamoDB,
OpenSearch
RDS
Etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Do Athena Federated Queries support Views?

A

Yes. They are stored in Glue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Do Athena Federated Queries support cross-account data sources?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is EMR?

A

A managed Hadoop framework on EC2

51
Q

What is a benefit of EMR over Glue?

A

More granular server access

52
Q

What are the nodes in an EMR cluster?

A

Master Node

Core Node

Task Node

53
Q

What is the EMR Master Node

A

Manages the cluster

54
Q

What is an EMR Core Node

A

Hosts HDFS data and runs tasks.

55
Q

What is a Task Node?

A

Only runs tasks, but does not store data.

56
Q

Can you start your EMR clusters as part of a data pipeline?

A

Yes, using AWS Data Pipelines

57
Q

What is the block size of HDFS?

58
Q

What is EMR Managed Scaling?

A

Scales all your instances regardless of type.

59
Q

How does EMR Managed Scaling scale?

A

It adds core nodes first and then task nodes to the maximum specified.

60
Q

How does EMR Managed Scaling scale down?

A

It removes task nodes first and then core nodes.

61
Q

In EMR Managed scaling, are On-Demand or Spot Instances scaled down first?

62
Q

Can EMR run as serverless?

63
Q

How large can a Kinesis Data Record Be?

64
Q

What is the throughput from the Kinesis Producer to the Stream?

A

1MBPS or 1000 messages per second per shard

65
Q

What is the max retention in Kinesis Data Streams?

66
Q

Does Kinesis allow replay?

67
Q

What are the capacity modes for Kinesis Data Streams

A

Provisioned

On-Demand

68
Q

Do you have to manage capacity for Kinesis using On-Demand mode?

69
Q

What is the default capacity for Kinesis on-demand mode

A

4MB or 4K records per second.

70
Q

What is the best use case for the Kinesis Producer SDK?

A

low throughput, higher latency

71
Q

If I want to use an asynchronous call to put data into Kinesis which producer would I use?

72
Q

What is batching in Kinesis Producer Library?

A

Aggregation, Data is sent when a threshold is met. Allows you to go over the 1000 records per second limit.

73
Q

How do you adjust the buffer time in Kinesis Data Streams?

A

RecordMaxBufferedTime

74
Q

If an application cannot tolerate latency, what Kinesis producer should be used?

A

The SDK.. Batching would be problematic and cause latency.

75
Q

Can Spark be a consumer of Kinesis?

76
Q

How many Kinesis GetRecords API calls can be made per shard per second?

77
Q

How much data Kinesis GetRecords API return?

A

Up to 10MB of data.

78
Q

How do you handle Checkpointing in the KCL?

79
Q

What does it mean when you get an ExpiredIteratorException in the Kinesis Client Library?

A

You are checkpointing and Dynamo DB was throttled. Need more WCU

80
Q

Can Lambda perform light ETL for Kinesis Data Streams?

81
Q

What is the latency for Kinesis enhanced fanout latency?

82
Q

What is the latency for Kinesis standard consumer latency?

83
Q

Can Kinesis resharding be done in parallel?

A

No. It takes a few seconds per shard.

84
Q

How can duplicates end up in the Kinesis Producer?

A

Network timeouts

85
Q

How can duplicates end up in the Kinesis Consumers?

A

Resharding events

Starting the application

Worker instances are added or removed

application is deployed

86
Q

What is the max record size for Kinesis Data Firehose?

87
Q

How do you perform data transformation in Kinesis Data Streams

88
Q

What are the three main targets for Kinesis Data Streams?

A

S3

Redshift (using S3 copy.. No direct integration)

OpenSearch

Custom destinations that use HTTP endpoints

89
Q

Can you archive data coming into Kinesis Data Firehose?

A

Yes. This can be stored to S3. All or just failed records.

90
Q

Does Kinesis Data Firehose automatically Scale?

91
Q

Does Kinesis Data Firehose perform data conversions?

A

Yes.. To parquet

92
Q

Does Kinesis Data Firehose support compression?

93
Q

Can Spark or the KCL read from Kinesis Data Firehose?

A

No. Not possible

94
Q

How is the buffer in Kinesis Data Firehose configured?

A

By time or Size

95
Q

What are the default sizes

A

32MB or 2minutes

96
Q

How do you send data in real-time to OpenSearch?

A

Kinesis Data Streams

97
Q

What is Managed Service for Apache Flink?

A

Managed instance of Apache Flink to process data streams.

98
Q

What are common sources for Apache Flink in AWS?

A

Kinesis

Kafka

99
Q

Can Flink send data to Kinesis?

A

Yes, Using Data Streams or Firehose

100
Q

What is Managed Service for Kafka Connect?

A

A plugin service for other services. Works with Redshift, S3, Opensearch, etc…

101
Q

Can MSK be run as serverless?

102
Q

What is OpenSearch?

A

Petabyte scale analysis and reporting. fundamentally a search engine.

103
Q

Does Opensearch support visualizations?

A

Yes. Quicksight is more robust.

104
Q

In OpenSearch, Indexes are split into ____________

105
Q

How many shards does an OpenSearch index have?

A

2 primary

2 replica

106
Q

In OpenSearch, can you scale up or down without downtime?

107
Q

Do Master Nodes in OpenSearch hold or process data?

A

No. They only manage the cluster

108
Q

How do you perform backups in OpenSearch?

A

Snapshot to S3

109
Q

What is UltraWarm storage in OpenSearch?

A

Uses S3 caching.

Good for log Data

Best for indices with few writes.

110
Q

In OpenSearch, What is Index State Management?

A

It deletes old indexes after a period of time.

Moved them from storage type to storage type

Automates snapshots

Reduces replica count

Index Rollups

111
Q

Can you have cross-cluster replication in OpenSearch?

112
Q

What is the leader index in OpenSearch?

A

The master copy of your index. The follower is the index following it for replication.

113
Q

How many OpenSearch master nodes should you have?

114
Q

Can openSearch run as serverless?

115
Q

What are some popular quicksight data sources?

A

Redshift

Aurora

Athena

116
Q

What is SPICE in Quicksight?

A

Is a parallel in memory calculation engine to accelerate queries on large datasets.

117
Q

How much SPICE does each user get?

118
Q

In Quicksight, what version will I need for column level security?

A

Enterprise

119
Q

Does QuickSight only access data from within the same region?

120
Q

How does QuickSight get around the data from within the same region limitation?

A

Create a security group with an inbound rule of the IP range of the quicksights servers.

121
Q

If I want to use an elastic network interface to put Quicksight in the same VPC as Redshift, what version of Quicksight will I need?

A

Enterprise

122
Q

Can Active Directory be used with Quicksight?

A

Yes, but only with Enterprise

123
Q

Do you get Encryption at Rest with Quicksight Standard?

A

No, only Enterprise.