Section 22: Data and Analytics Flashcards
A serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats
Amazon Athena
Amazon Athena is commonly paired with _____ in order to create reports and dashboards
Amazon Quicksight
This service is the best tool available when you need to analyze data in S3 using serverless SQL
Amazon Athena
What are four ways you can enhance the performance of Amazon Athena?
Use columnar data for cost-savings (less scan)
Compress data for smaller retrievals
Partition your datasets in S3
Use larger files
If you have data in sources other than Amazon S3, you can use this Athena feature to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3
Amazon Athena Federated Query
How much do Athena queries cost to run?
$5.00 per TB of data scanned
This service is a fully managed, petabyte-scale data warehouse service in the cloud
Amazon Redshift
What types of nodes comprise a Redshift Cluster?
Leader Node
Compute Nodes
What are three ways you can insert data into Redshift?
Kinesis Data Firehose
S3 Copy Command
Insert in batches from EC2 instance using JDBC driver
Feature that allows you to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Redshift Spectrum
An open source, distributed search and analytics suite derived from Elasticsearch that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more
Amazon OpenSearch Service
AWS service that is a cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto
Amazon EMR (Elastic MapReduce)
EMR node type that coordinates and manages the health of all your other nodes
Master Node
EMR node type that runs tasks and stores data
Core Node
EMR node type that only runs tasks - typically it is a good practice to use Spot instances for these nodes
Task Node
A cloud-native, serverless, business intelligence service with native ML integrations and usage-based pricing, used to create interactive dashboards
Amazon QuickSight
Engine that performs in-memory computations if you import data directly into QuickSight
SPICE Engine
A serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development
AWS Glue
AWS Glue feature that prevents the re-processing of old data
Glue Job Bookmarks
AWS Glue feature that allows you to combine and replicate data across multiple data stores using SQL
Glue Elastic Views
AWS Glue feature that allows you to clean and normalize data using pre-built transformations
AWS DataBrew
AWS Glue feature that provides you with a GUI to create, run, and monitor ETL jobs
Glue Studio
AWS Glue feature that allows you to run streaming ETL jobs that can be integrated with Kinesis Data Streaming, Kafka, MSK, etc.
Glue Streaming ETL
AWS service that easily creates secure data lakes, making data available for wide-ranging analytics
AWS Lake Formation
This service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics
Kinesis Data Analytics for SQL Applications
How long can you retain data in Amazon MSK?
As long as you want as long as you pay for the underlying EBS Storage
A fully managed service for Apache Kafka that makes it easier for developers to build and run highly available, secure, and scalable applications based on Apache Kafka
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
You would like to have a database that is efficient at performing analytical queries on large sets of columnar data. You would like to connect to this Data Warehouse using a reporting and dashboard tool such as Amazon QuickSight. Which AWS technology do you recommend?
Amazon Redshift
You have a lot of log files stored in an S3 bucket that you want to perform a quick analysis, if possible Serverless, to filter the logs and find users that attempted to make an unauthorized action. Which AWS service allows you to do so?
Amazon Athena
As a Solutions Architect, you have been instructed you to prepare a disaster recovery plan for a Redshift cluster. What should you do?
Enable Automated Snapshots, then configure the Redshift cluster to automatically copy snapshots to another AWS region
Which feature in Redshift forces all COPY and UNLOAD traffic moving between your cluster and data repositories through your VPCs?
Enhanced VPC Routing
You are running a gaming website that is using DynamoDB as its data store. Users have been asking for a search feature to find other gamers by name, with partial matches if possible. Which AWS technology do you recommend to implement this feature?
Amazon OpenSearch
An AWS service allows you to create, run, and monitor ETL (extract, transform, and load) jobs in a few clicks
AWS Glue
A company is using AWS to host its public websites and internal applications. Those different websites and applications generate a lot of logs and traces. There is a requirement to centrally store those logs and efficiently search and analyze those logs in real-time for detection of any errors and if there is a threat. Which AWS service can help them efficiently store and analyze logs?
Amazon OpenSearch service
This service makes it easy and cost-effective for data engineers and analysts to run applications built using open source big data frameworks such as Apache Spark, Hive, or Presto without having to operate or manage clusters
Amazon Elastic Map Reduce (EMR)
An e-commerce company has all its historical data such as orders, customers, revenues, and sales for the previous years hosted on a Redshift cluster. There is a requirement to generate some dashboards and reports indicating the revenues from the previous years and the total sales, so it will be easy to define the requirements for the next year. The DevOps team is assigned to find an AWS service that can help define those dashboards and have native integration with Redshift. Which AWS service is best suited?
Amazon Quicksight
Which AWS Glue feature allows you to save and track the data that has already been processed during a previous run of a Glue ETL job?
Glue Job Bookmarks
You are a DevOps engineer in a machine learning company which 3 TB of JSON files stored in an S3 bucket. There’s a requirement to do some analytics on those files using Amazon Athena and you have been tasked to find a way to convert those files’ format from JSON to Apache Parquet. Which AWS service is best suited?
AWS Glue
You have an on-premises application that is used together with an on-premises Apache Kafka to receive a stream of clickstream events from multiple websites. You have been tasked to migrate this application as soon as possible without any code changes. You decided to host the application on an EC2 instance. What is the best option you recommend to migrate Apache Kafka?
Amazon MSK
You have data stored in RDS, S3 buckets and you are using AWS Lake Formation as a data lake to collect, move and catalog data so you can do some analytics. You have a lot of big data and ML engineers in the company and you want to control access to part of the data as it might contain sensitive information. What can you use?
Lake Formation Fine-grained Access Control
Which AWS service is most appropriate when you want to perform real-time analytics on streams of data?
Amazon Kinesis Data Analytics