Datawarehousing Flashcards

1
Q

Is Redshift good for ELT?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Can Lambda Expression be trigged by IOT?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can Lambda Expression be trigged by Kinesis?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Can Apache Spark notebooks run on EMR?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can Apache Spark read from S3?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can Apache Zeppelin be used to visualize data in Amazon Redshift?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is Redshift a columnar database?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is Redshift MPP?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is Redshift ANSI SQL Compliant?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In addition, to data compression and columnar storage, how is I/O reduced in Redshift?

A

Zone maps : A zone map exists for each 1 MB block, and consists of in-memory metadata that tracks the minimum and maximum values within the block, Hence if you sort the column e.g. a date_column If it is sorted then it will be faster to find the block in which data is stored. Amazon redshift does not use indexes as any conventional database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can Redshift Clusters be managed via API?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Does redshift support ODBC and JDBC?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe Redshift architecture?

A

1 Leader Node. Communicating to multiple Compute nodes that house the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does Redshift encrypt data at rest?

A

Yes AES-256

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does Amazon Redshift take care of key management?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Anti-Patterns for Redshift

A

Small datasets, OLTP, Unstructured data, BLOB data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the 2 methods used by Kinesis Firehouse?

A

PutRecord and PutRecordBatch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the max size for a Firehouse PutRecord?

A

1000 Kb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Kinesis Agent

A

Java agent is a stand-alone software which can send information to Kinesis and Kinesis Firehose. It can be installed on Linux servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Can the Kinesis Agent monitor multiple files and write to multiple streams?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the max buffer size for Kinesis Firehose?

A

3Mb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Can Kinesis Firehouse invoke a Lambda Function?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why should a record separator be added to Kinesis Stream data?

A

Kinesis stream bundles records together. If you don’t add a record separator, you can’t split the records later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are buffer sizes for S3?

A

1 MB - 128 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the buffer intervals for S3?

A

60 to 900 Seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Can Kinesis Firehouse dynamically raise the buffer size?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does the Redshift copy command do?

A

Copies data from dynamoDB or S3 into Redshift existing table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Before you send a record to Kinesis Firehouse, what do you need to do?

A

Flatten the record and make sure it is in UTF-8 encoded into a single JSON object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the elastic search buffer size range?

A

1 MB to 100 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the buffer interval for elastic search

A

60 to 900 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Describe Kinesis Analytics

A

A SQL based query that can aggregate data in a stream and output to a kinesis stream or a lambda function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the maximum time a Lambda Function can run?

A

5 minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do Kinesis Stream and Kinesis Firehose differ?

A

Kinesis Streams. The more customizable option, Streams is best suited for developers building custom applications or streaming data for specialized needs. The customizability of the approach, however, requires manual scaling and provisioning. Data typically is made available in a stream for 24 hours, but for an additional cost, users can gain data availability for up to seven days.
Kineses Firehose. The simpler approach, Firehose handles loading data streams directly into AWS products for processing. Scaling is handled automatically, up to gigabytes per second, and allows for batching, encrypting, and compressing. Firehose also allows for streaming to S3, Elasticsearch Service, or Redshift, where data can be copied for processing through additional services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are some destinations for Kinesis Analytics?

A

Firehouse, Streams, S3, Redshift, Elastic Search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Can data be enriched via Kinesis Stream?

A

Yes, but it must be stored in S3 and then an in-application reference table is created by Kinesis stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is a common use case for Kinesis Stream?

A

Read streaming data and analyze and aggregate it and drop to EMR or Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why would one use KPL and KPC?

A

KPL and KPC are the kinesis libraries that take care of load balancing, multi-threading, aggregatio and de-aggregation, retries, scaling, and other functionality not in the Kinesis API. They are placed between the produce and consumer programs and the streams.

38
Q

How else is data placed into a Kinesis Stream?

A

Via the API or via an agent that is installed on each client. The agent monitors for file changes (e.g. log files)

39
Q

What are the two modes of operation for the KPL?

A

Synchronous and Asynchronous?

40
Q

Which mode is preferred practice?

A

Asynchronous

41
Q

If you had to reduce end-to-end latency would you use KPL, Kinesis Agent, or the Kinesis API?

A

API?

42
Q

What languages does Lambda support?

A

AWS Lambda supports code written in Node.js (JavaScript), Python, Java (Java 8 compatible), and C# (.NET Core) and Go. Your code can include existing libraries, even native ones.

43
Q

What are the 3 ways provision your I/O in kinesis stream?

A

They can be provisioned in in 1 MB increments via API, Console, or SDK

44
Q

What can you tell me about data in a Kinesis stream?

A

It is stored for 24 hours by default, and replicated across 3 AZs.

45
Q

Ideal Patterns for Kinesis Stream?

A

Real-time data analytics, log and data intake and processing, Real-time metrics and reporting

46
Q

Is a Kinesis stream made up of shards?

A

Yes

47
Q

How many read transactions does each shard give you?

A

5

48
Q

How many MB can 5 read transactions give you?

A

2 MB

49
Q

How many writes per second can a shard support?

A

1000

50
Q

A shard can support how much per second?

A

1 MB data written per second

51
Q

What determines the data capacity of your stream?

A

The number of shards

52
Q

Each shard can capture how many MB per second?

A

1 MB

53
Q

Each shard can write how many MB per second?

A

2 MB

54
Q

In case of failure, where can you store the cursor for Kinesis?

A

DynamoDB

55
Q

What is kinesis storm spout?

A

The Amazon Kinesis Storm Spout helps developers use Amazon Kinesis with Storm, an open source, distributed real-time computation system. This version of the Amazon Kinesis Storm Spout fetches data from the Amazon Kinesis stream and emits it as tuples that Storm topologies can process. Developers can add the Spout to their existing Storm topologies, and leverage Amazon Kinesis as a reliable, scalable, stream capture, storage, and replay service that powers their Storm processing applications.

56
Q

Name two anti-patterns for Kinesis?

A

Long term storage and small scale consistent throughput

57
Q

Name 5 ideal patterns for lambda?

A

real-time processing, real-time file processing, cron, AWS events, ETL

58
Q

What two modes can Lambda expressions function?

A

Synchronously and Asynchronously

59
Q

What happens when a synchronously called Lambda function fails?

A

It throws an exception

60
Q

What happens when an asynch lambda gets called and fails?

A

It gets called 3 times.

61
Q

How many lambda functions can run concurrently per account?

A

100

62
Q

What are the 3 anti-patterns for Lambda?

A

Long running apps. Dynamic websites. Stateful apps.

63
Q

Ideal usage patterns?

A

log processing, ETL, Big Data, data mining

64
Q

Is EMR fault-tolerant for code node failure?

A

Yes

65
Q

Does EMR provision for failed slave nodes?

A

No

66
Q

Amazon EMR with MapR distribution has what advantage?

A

No-name node architecture that can tolerate failure

67
Q

Does EMR integrate with S3 and DynamoDB?

A

Yes

68
Q

What is Spark?

A

An open-source analytics in-memory analytics engine?

69
Q

What is Impala?

A

SQL for hadoop

70
Q

What is Hbase?

A

An open-source distributed database running on top of hadoop

71
Q

What is S3DispCP

A

Apache DistCp is an open-source tool you can use to copy large amounts of data.During a copy operation, S3DistCp stages a temporary copy of the output in HDFS on the cluster. S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3.

72
Q

What is EMRFS?

A

an implementation of HDFS on S3. You can enable client and server side encryption. Metadata is stored in dynamodb

73
Q

Name 2 anti-patterns for EMR?

A

small data sets and ACID transactions

74
Q

Name 2 anti-patterns for ML?

A

Very large dataset and unsupported learning tasks?

75
Q

What is dynamodb streams?

A

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table, and stores this information in a log for up to 24 hours. … A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table.

76
Q

What is the limit of data storage for dynamo db?

A

None

77
Q

What are the anti-patterns for dynamo db?

A

Joins, ad-hoc query, blobs, and large-data with low i/o rate

78
Q

What service would you use for OLAP/BI?

A

Redshift because it has columnar storage. It is scaleable and works with BI tools

79
Q

Where do Redshift clusters reside?

A

Within an AZ

80
Q

Can Redshift clusters reside across multiple AZs?

A

If you set it up for replication manually, yes.

81
Q

Name 4 Redshift anti-patterns.

A

ACID, BLOB, Unstructured and small datasets

82
Q

What types of searches are done with Elastic Search?

A

Text, structure data, analytics

83
Q

Is Elastic Search self-healing?

A

Failed clusters are replaced auto-magically

84
Q

What does ES integrate with?

A

Logtash (log pipeline) and Kibana (Analytics and visualization)

85
Q

Elastic Search suited for?

A

Log analysis, streaming data,

86
Q

Elastic Search Anti-Patterns

A

OLTP and Petabyte Storage

87
Q

Quicksight

A

Cloud powered-BI for visualization and ad-hoc queries

88
Q

AWS Shield

A

managed DDoS

89
Q

What is Cost Explorer?

A

Service that lets you gain insight into where costs are spent.

90
Q

Spark Streaming

A

extends spark API can be installed on EMR.

91
Q

SparkSQL

A

extends spark API allows SQL queries along side complex calculations