AWS Analytics Flashcards
What is Amazon Elastic Map Reduce (EMR)?
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.
How does Amazon EMR work?
EMR utilizes a Managed hosted Hadoop framework running on Amazon EC2 and Amazon S3 for processing huge amounts of data.
Also support Apache Spark, HBase, Presto and Flink.
What is EMR normally used for?
Most commonly used for log analysis, financial analysis, or extract, translate and loading (ETL) activities.
What is a Step?
A Step is a programmatic task for performing some process on the data (e.g. count words).
What is a Cluster?
A cluster is a collection of EC2 instances provisioned by EMR to run your Steps.
What does EMR use as distributed data processing engine?
EMR uses Apache Hadoop
Apache Hadoop which is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware
What software is EMR a good place to deploy for?
EMR is a good place to deploy Apache Spark.
Apache Spark is an open-source distributed processing used for big data workloads which utilizes in-memory caching and optimized query execution.
What clusters can EMR launch?
EMR can launch Presto clusters.
Presto is an open-source distributed SQL query engine designed for fast analytic queries against large datasets.
Where does EMR launch all its nodes?
EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone.
What is Amazon Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
What are Amazon Athena’s characteristics?
Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Athena is easy to use – simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
What does Amazon Athena use?
Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro.
Amazon Athena uses a managed Data Catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3.
What is Amazon Athena’s use case?
Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization
It can also handle complex analysis, including large joins, window functions, and arrays.
What is AWS Glue?
AWS Glue is a fully managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics.
How does AWS Glue work?
AWS Glue automatically discovers and profiles data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas.
AWS Glue runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination.