Data and Analytics Flashcards
What is Amazon Athena?
A server-less query service used to analyse data stored in Amazon S3 with SQL queries
What are federated queries?
Queries that can be run across multiple data sources than just what is in S3, such as relational, non-relational, object and custom data sources
What is 1 method that can be used to increase the performance of Athena?
Partitioning
Using columnar data
Use larger files as these are easier to scan and retrieve for Athena
What is Amazon Redshift used for?
Data warehousing and analytics
Is Redshift columnar or row-based?
Columnar
What engine is Redshift based on?
PostgreSQL
What are the two snapshot modes of Redshift and what are the differences?
Automated and manual.
With automated, the snapshot is retained for a period that the user sets, whereas with manual the snapshot is kept until it is deleted.
What are the two node types within a Redshift cluster?
Leader and compute
What is Redshift Spectrum?
A service that allows the user to query data that is already in S3 without having to load it
What is the principal benefit of Redshift spectrum?
It allows the user to leverage a lot more computing power than they actually have provisioned and for the avoidance of having to actually load the S3 data
What is OpenSearch?
A service that allows the user to search any field, including partial matches, of a database
What is EMR?
Elastic Map Reduce - a service that allows the user to create Hadoop clusters for big data analytics
How does EMR scale?
Automatically, through the provisioning of additional clusters
What are the node types within an EMR cluster?
Master, core and task.
Master nodes manage the cluster and co-ordinate the other nodes. There is only 1 in a cluster.
Core nodes run tasks and store data.
Tasks nodes are optional and just run tasks but don’t store data.
What service would be used to make ML-powered interactive dashboards?
QuickSight
When can QuickSight not use SPICE in-memory computation?
When it is attached to another database
What granularity of security can you set in QuickSight?
Column-level security
What does ETL stand for?
Extract, transform and load
What is Glue?
A managed and server-less ETL service for analytics
What are Glue Crawlers?
Scripts that crawl databases or data and write metadata to Glue Data Catalog, e.g. the type of data and its format
What are Glue Job Bookmarks?
Bookmarks that show where a job was up to, preventing the re-processing of data
What is Glue DataBrew?
A service that cleans and normalises data for analytics and ML without having to write code - many pre-written transformations
What is Lake Formation and data lakes?
A data lake is a central place to keep all data of different types for analytics purposes.
Lake Formation is an AWS service that simplifies the process of creating a data lake through the automation of many complex processes.
What level of granularity does Lake Formation have in terms of security?
Row/column level
What service would be used for real-time analytics using SQL?
Kinesis Data Analytics for SQL
Where can Kinesis Data Analytics for SQL read from?
Kinesis Data Streams and Kinesis Data Firehose
What is a benefit of using Kinesis Data Analytics for Apache Flink over for SQL?
Flink is more powerful with more advanced querying that just using SQL
What is Amazon MSK?
Managed Streaming for Apache Kafka (a data streaming alternative to Kinesis) - fully managed Kafka on AWS
Are Kinesis Data Streams’ streams encrypted?
Yes, in-flight using TLS
When would KDA be used over Athena?
For scenarios when the analysis needs to happen before the data is written to storage