Big Data & Serverless Flashcards
What does ETL stand for?
Extract, Transform, Load
Big data ETL tool that can use open-source software (such has Spark, HBase ect) natively on AWS
Amazon EMR
What Amazon services can you run an EMR cluster on?
EC2
EKS
Outpost
processes data and puts in S3 Bucket
Open source tools that can be run on EMR
Spark
Hbase
Hadoop
Presto
How to reduce overall cost of EMR EC2 Clusters?
Use Spot or RI instances
Real-time streaming service for AWS
Kinesis
Kinesis that provides real-time speed where you have to manage producers and consumers (you must scale shards)
Kinesis Data Streams
Kinesis that provides nearly real time speed where you don’t have to worry as much about scaling AWS manages scaling
Kinesis Firehouse
Endpoints available for Kinesis Firehouse
Elasticsearch
S3
Redshift
Analyze Kinesis data using standard SQL
Kinesis Data Analytics
Application Requires real-time message delivery which service should you use?
Kinesis
TRUE or FALSE Kinesis Data Analytics is Serverless?
TRUE
Interactive query service that makes it easy to analyze data in S3 using SQL
Athena
Serverless data integration service to preform ETL without having to manage servers?
AWS Glue
How to use Athena with Glue?
Set up S3 bucket with data
Set up a Glue crawler to analyze data in bucket
Data is put in Glue Catalog
Amazon Athena can run queries on restructured data in the Catalog
Amazon Quicksight to visualize data in dashboard
TRUE or FALSE, Athena is serverless
TRUE
Fully managed data visualization service for BI similar to Tableau
AWS Quicksight
Managed ETL service for automating movement and transform of your data. Create data-driven workflows and enforces logic you define.
AWS Data-Pipeline
How to configure notifications and failures in AWS Data-Pipeline?
Via Amazon SNS
AWS Storage and Compute Services that AWS Data-Pipeline can be integrated with
DynamoDB
RDS
Redshift
S3
Compute:
EC2
EMR
TRUE or FALSE, for AWS Data-Pipeline I cannot use RI instances
FALSE, you can use previously existing instances