Big Data Flashcards
What is Redshift?
A fully managed, petabyte scale data warehouse service in the cloud.
How much information can Redshift hold?
16 petabytes
Is Redshift relational?
Yes
What is typical use case for Redshift?
Business Intelligence
Is Redshift a better RDS?
No, Redshift is not meant to replace RDS’s
What is EMR?
A managed big data platform that allows you to process vast amounts of data (AWS”s ETL tool)
What is Kinesis?
Allows you to ingest process and analyze real time streaming data. (think of it as a huge data highway)
What is Kinesis data streams for?
the real time streaming for ingesting data
What is kinesis data firehose for?
data transfer tool to get information to S3, Redshift, elasticsearch, or spunk
What is the downside to Kinesis data stream?
A lot of work to set up (specify shards and data consumer)
What can kinesis data firehose be thought of as
a simpler data stream
What is Kinesis data analytics?
allows us to analyze data in the pipeline using standard data
When would you choose Kinesis over SQS for messages?
If messages need real time delivery
Does kinesis data stream or kinesis data firehose automatically scale?
data streams
What is AWS Athena?
An interactive query service that makes it easy to analyze data in S3 using SQL. This allows you to query from S3 without uploading it to database
What is AWS Glue?
A serverless service that allows you to perform ETL workloads without managing underlying servers
If you are ever needing serverless SQL, what should you think of?
Athena
What is Quicksight?
A fully managed business intelligence data visualization service
What is AWS data pipeline?
a managed ETL service for automating movement and transformation of your data
What is a pipeline definition in regards to AWS data pipeline?
where you specify the business logic of your data
How do you create dependencies between tasks and activities?
data driven workflows
What service can you use with AWS data pipeline to alert you of any failures?
AWS SNS
Does AWS data pipeline have automatic retries for data driven workflows?
Yes
What does Amazon MSK stand for?
Amazon managed streaming for apache kafka.
What is Amazon MSK?
a fully managed service for running data streaming applications that leverage apache kafka
Does Amazon AFK have automatic detection and recovery?
Yes
What is Amazon MSK Serverless?
A cluster type within Amazon MSK offering serverless cluster management with automatic provisioning and scaling
What is MSK Connect?
Allows developers to easily stream data to and from Apache kafka clusters
What is Amazon Open Search service?
a managed service allowing you to fun search and analytics engines for various use cases
What is the successor to Amazon Elastics Search Service?
Amazon Open Search
What is typically the best tool for visualizing log file analytics or BI reports?
Amazon Open Search
What type of database is Redshift?
relational
Why can’t you do multi-AZ deployments with Redshift?
You can
What service offers real time streaming of data?
Kinesis data streams