Big Data Flashcards
Redshift
- a fully managed petabyte-scale data warehouse
- a very large relational database
Redshift - size
- up to 16 PB of data per cluster. (You don’t have to split up large data sets)
Redshift - relational
- use your standard SQL and BI tools to interact with it
Redshift use cases
- BI applications
- not a replacement for standard RDS
Limitations of Redshift
- not highly available
- can only exist within one AZ
EMR
Elastic Map Reduce
- ETL (Extract Transform Load)
- an AWS managed big data platform that allows you to process vast amounts of data using open source tools such as Spark, Hive, HBase, Flink, Hudi and Presto
EMR exam tips
- opensource cluster
- a managed fleet of EC2 instances running open source tools
- EC2 rules apply - use spot instances and RIs to reduce your costs
- it processes and moves data
Kinesis
- a big highway to transport stuff
- allows you to ingest, process and analyze real-time streaming data
Kinesis Data Streams
- real time streaming for ingesting data
- you’re responsible for creating the consumer & scaling the stream
- older than Firehose
- a lot of overhead to configure
- does not automatically scale
Kinesis Data Firehose
- Data transfer tool to get info to S3, Redshift, ElasticSearch, Splunk
- Speed: w/i 60 seconds (near real time)
- plug & play w/ AWS architecture
- automatically scales
Kinesis Data Analytics
- paired with Data Firehose or Data Stream
- lets you analyze data using SQL
- easy, simple
- no servers (fully managed)
- pay per use
How long can Kinesis store data?
up to one year
When to use SQS over Kinesis?
- slightly delayed message delivery
- not much configuration needed
- simple to use
When to use Kinesis over SQS?
- real time message delivery
- complicated to configure
- mostly used for big data applications
What is the easiest way to process streaming data going thru Kinesis using SQL?
Kinesis Data Analytics
Amazon Athena
- makes it easy to analyze data in S3 using SQL
- Serverless SQL
- fully-managed
Glue
a serverless data integration service that makes it seasy to discover, prepare and combine data
- ETL
Amazon QuickSight
- visualizing data using dashboards
- fully managed BI data visualization service
ElasticSearch
- a fully managed version of open source Elasticsearch
- allows you to quickly search over stored data and analyze the data you get back
- primarily used in ELK (ElasticSearch, Logstash, Kibana) stack
Elasticsearch exam tip
If exam scenario wants a 3rd party logging solution, you can use ElasticSearch as part of the solution
When exam wants to store GBs of data
RDS or Aurora
When exam wants to store PBs/TBs of data
Redshift, S3