Services - Analytics Flashcards
1
Q
Athena - Characteristics
A
- It’s serverless and interactive query service that eases data analysis in S3 using standard SQL
- Pay only for the queries you run
- No ETL process needed and accesses S3 easily. Supports many input formats like CSV, JSON, TSV, and others
- Queries can be executed in parallel
2
Q
Athena - Creation steps
A
- Create S3 bucket
- Create a metadata database
- Create a schema
- (Optional) Fine tune the serializer/deserializer (serde)
- Run the query
- Access the history (to rerun previous queries or save them)
3
Q
Athena - Use cases
A
- Extract info from auto-generated log files
- Query exported spreadsheets
- Get info from non-AWS database export
4
Q
Elasticsearch Service - Characteristics
A
- Now named OpenSearch. It’s a distributed, open-source search and analytics suite
- Provides a highly scalable system with fast access and response to large volumes of data with an integrated visualization tool, named OpenSearch Dashboards
- Pay based on three dimensions: instance hours (hours available); storage needed; and data transferred in and out of OpenSearch Service
- Can load streaming data from Kinesis Data Firehose and CloudWatch Logs directly
- Can load streaming data from S3, Kinesis Data Streams and DynamoDB by using Lambda functions as event handlers
5
Q
Elasticsearch Service - Features
A
- Encryption, authentication, authorization, and auditing features
- Offers SQL query syntax
- Reporting, notifications, and asynchronous search
- Anomaly detection on data ingested
- Identify performance problems with OpenTelemetry data
6
Q
EMR - Characteristics
A
- Named Elastic MapReduce. Let you easily run and scale Hadoop clusters.
- Integrates with Kinesis, DynamoDB, Redshift, CloudFormation, CloudWatch, Data Pipeline, S3
7
Q
EMR - Hadoop definition
A
- It’s an open source, highly scalable distributed system that processes massive amounts of data
- Uses Hadoop Distributed File System (HDFS)
- Processes structured, unstructured, or semi-structured data
- Supports tools like MapReduce, Spark, and others
8
Q
EMR - Other characteristics 1
A
- Creates and scales managed Hadoop clusters of EC2 instances
- Provides EMRFS (EMR file system) connectors for S3, DynamoDB, Kinesis, and Redshift
- The architecture consists of a master node, core nodes (where data is stored), and task nodes (compute only)
- No need to manually provision, configure or tune Hadoop clusters
9
Q
EMR - Other characteristics 2
A
- Pay as you go by: avoiding paying idle EC2 instances, or take advantage of EC2 spot and reserved instances
- Can use security groups, isolated VPCs, and encryption to restrict access
- Monitors, identifies and replaces poorly performing instances
10
Q
Kinesis - Characteristics
A
- It’s a managed and scalable service that helps to collect, process, and analyze real-time streaming data
- Can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data
11
Q
Kinesis - Video and data capabilities
A
- Kinesis Video Streams: streams video from connected devices to AWS for analytics, ML, and other processing
- Kinesis Data Streams: real-time data streaming service that can capture GB per second from hundreds of thousands of sources
- Kinesis Data Firehose: captures, transforms, and loads data streams into AWS data stores for near real-time analytics with existing business intelligence tools
- Can transform using Lambda functions. And later store it, using its features, in S3, Redshift and ES
- Kinesis Data Analytics: process data streams in real time with SQL or Apache Flink
- Can ingest data from Kinesis Streams and Kinesis Firehose
12
Q
Kinesis - Pricing
A
- Video Streams: pay for the volume of data ingested, stored, and consumed through the service. Also WebRTC capabilities are charged
- Data Streams: pay as you go. Based on two core dimensions (Shard Hour and PUT Payload Unit) and other optional dimensions
- Data Firehose: pay for the volume of data ingested into the service
- Data Analytics: pay for what you use. Based on the number of Kinesis Processing Units (KPUs) used to run your application
13
Q
Kinesis - Streams Characteristics 1
A
- Kinesis Streams is a set of shards that receives data records from producers and puts them on consumers.
- A shard is a uniquely identified sequence of data records in a stream
- A partition key is used to group data by shard within a stream
14
Q
Kinesis - Streams Characteristics 2
A
- Consist of producers, Kinesis Streams application, and consumers
- Can store data in S3, Redshift and DynamoDB
- The default retention period is 24 hours. And can be configured up to 168 hours
15
Q
QuickSight - Characteristics
A
- Lets you create and publish interactive BI dashboards, and receive answers in seconds through natural language queries
- Can embed BI dashboards in applications
- Can scale to tens of thousands users without setup