AWS Analytics Flashcards
A web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.
Highly distributed processing frameworks such as Hadoop, Spark, and Presto. Run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data.
Amazon Elastic Map Reduce (EMR)
Amazon Elastic Map Reduce (EMR) Features
Utilizes a hosted Hadoop framework running on Amazon EC2 and Amazon S3
Managed Hadoop framework for processing huge amounts of data.
Also support Apache Spark, HBase, Presto and Flink
Common Amazon Elastic Map Reduce (EMR) uses
Most commonly used for log analysis, financial analysis, or extract, translate and loading (ETL) activities
Amazon Elastic Map Reduce (EMR) programmatic task for performing some process on data (e.g. count words)
Step
A collection of EC2 instances provisioned by EMR to run your Steps
Cluster
Other EMR Facts
EMR uses Apache Hadoop as its distributed data processing engine, which is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware.
EMR is a good place to deploy Apache Spark, an open-source distributed processing used for big data workloads which utilizes in-memory caching and optimized query execution.
You can also launch Presto clusters. Presto is an open-source distributed SQL query engine designed for fast analytic queries against large datasets.
EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone.
You can access Amazon EMR by using the AWS Management Console, Command Line Tools, SDKS, or the EMR API.
With EMR you have access to the underlying operating system (you can SSH in)
Simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments
Flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements
An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
Run interactive queries against data directly in Amazon S3 without worrying about formatting data or managing infrastructure. Can use with other services such as Amazon RedShift
Amazon Athena
Amazon Athena Features
Serverless, so there is no infrastructure to manage, and you pay only for the queries that you run
Easy to use – simply point to your data in Amazon S3, define the schema, and start querying using standard SQL
Uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro
While Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization, it can also handle complex analysis, including large joins, window functions, and arrays.
Amazon Athena uses a managed Data Catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3
Easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers
Fully managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics
Transform and move data to various destinations. Used to prepare and load data for analytics. Data source can be S3, RedShift or another database. Glue Data Catalog can be queried by Athena, EMR and RedShift Spectrum
AWS Glue
AWS Glue Features
Automatically discovers and profiles data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas.
Runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination
Allows you to setup, orchestrate, and monitor complex data flows
discover properties of data, transform it, and prepare it for analytics.
Glue can automatically discover both structured and semi-structured data stored in data lakes on Amazon S3, data warehouses in Amazon Redshift, and various databases running on AWS.
It provides a unified view of data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
Glue automatically generates Scala or Python code for ETL jobs that you can further customize using tools you are already familiar with.
AWS Glue is serverless, so there are no compute resources to configure and manage.
AWS data warehouse that provides the fastest query performance for enterprise reporting and business intelligence workloads, particularly those involving extremely complex SQL with multiple joins and sub-queries.
Pull data from many sources, format and organize it, store it, and support complex, high speed queries that produce business reports.
Amazon Redshift
AWS Service that makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.
Collection of services for processing streams of various data.
Data is processed in “shards”.
Amazon Kinesis
Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
Durably stores, encrypts, and indexes video data streams, and allows access to data through easy-to-use APIs.
Producers provide data streams.
Stores data for 24 hours by default, up to 7 days.
Consumers receive and process data.
Can have multiple shards in a stream.
Supports encryption at rest with server-side encryption (KMS) with a customer master key.
AWS Kinesis Video Streams
Enables you to build custom applications that process or analyze streaming data for specialized needs.
Enables real-time processing of streaming big data.
Useful for rapidly moving data off data producers and then continuously processing the data.
Stores data for later processing by applications (key difference with Firehose which delivers data directly to AWS services)
AWS Kinesis Data Streams
Accelerated log and data feed intake.
Real-time metrics and reporting.
Real-time data analytics.
Complex stream processing.
AWS Kinesis Data Streams Common Use Cases