Big Data Flashcards

Question 1

Q

Redshift

Answer

A

a fully managed petabyte-scale data warehouse

- a very large relational database

Question 2

Q

Redshift - size

Answer

A

up to 16 PB of data per cluster. (You don’t have to split up large data sets)

Question 3

Q

Redshift - relational

Answer

A

use your standard SQL and BI tools to interact with it

Question 4

Q

Redshift use cases

Answer

A

BI applications

- not a replacement for standard RDS

Question 5

Q

Limitations of Redshift

Answer

A

not highly available

- can only exist within one AZ

Question 6

Q

EMR

Answer

A

Elastic Map Reduce

ETL (Extract Transform Load)
an AWS managed big data platform that allows you to process vast amounts of data using open source tools such as Spark, Hive, HBase, Flink, Hudi and Presto

Question 7

Q

EMR exam tips

Answer

A

opensource cluster
a managed fleet of EC2 instances running open source tools
EC2 rules apply - use spot instances and RIs to reduce your costs
it processes and moves data

Question 8

Q

Kinesis

Answer

A

a big highway to transport stuff

- allows you to ingest, process and analyze real-time streaming data

Question 9

Q

Kinesis Data Streams

Answer

A

real time streaming for ingesting data
you’re responsible for creating the consumer & scaling the stream
older than Firehose
a lot of overhead to configure
does not automatically scale

Question 10

Q

Kinesis Data Firehose

Answer

A

Data transfer tool to get info to S3, Redshift, ElasticSearch, Splunk
Speed: w/i 60 seconds (near real time)
plug & play w/ AWS architecture
automatically scales

Question 11

Q

Kinesis Data Analytics

Answer

A

paired with Data Firehose or Data Stream
lets you analyze data using SQL
easy, simple
no servers (fully managed)
pay per use

Question 12

Q

How long can Kinesis store data?

Answer

A

up to one year

Question 13

Q

When to use SQS over Kinesis?

Answer

A

slightly delayed message delivery
not much configuration needed
simple to use

Question 14

Q

When to use Kinesis over SQS?

Answer

A

real time message delivery
complicated to configure
mostly used for big data applications

Question 15

Q

What is the easiest way to process streaming data going thru Kinesis using SQL?

Answer

A

Kinesis Data Analytics

Question 16

Q

Amazon Athena

Answer

A

makes it easy to analyze data in S3 using SQL
Serverless SQL
fully-managed

Question 17

Q

Glue

Answer

A

a serverless data integration service that makes it seasy to discover, prepare and combine data
- ETL

Question 18

Q

Amazon QuickSight

Answer

A

visualizing data using dashboards

- fully managed BI data visualization service

Question 19

Q

ElasticSearch

Answer

A

a fully managed version of open source Elasticsearch
allows you to quickly search over stored data and analyze the data you get back
primarily used in ELK (ElasticSearch, Logstash, Kibana) stack

Question 20

Q

Elasticsearch exam tip

Answer

A

If exam scenario wants a 3rd party logging solution, you can use ElasticSearch as part of the solution

Question 21

Q

When exam wants to store GBs of data

Answer

A

RDS or Aurora

Question 22

Q

When exam wants to store PBs/TBs of data

Answer

A

Redshift, S3