Analytics Flashcards
- real time data processing service that continuously captures( and stores) large amounts of data that can power realtime streaming dash boards.
- Using the AWS provided ADKs, you can create real-time dashboards, integrate dynamic pricing strategies, and export data from kinesis to other AWS services.
Kinesis export data to other services
Kinesis Components
Producers(data creators)
Consumers(data consumers)
Shards(processing power)
Kinesis Benefits
- Realtime processing – continuously collect and build applications that analyze the data as it’s generated.
- Parallel Processing – Multiple Kinesis applications can be processing the same incoming data streaming concurrently.
- -Durable – Kinesis synchronously replicates the streaming data across three data centers within a single AWS region and preserves the data for up to 24 hours.
- -Scales - can stream from as little as a few megabytes to several terabytes per hour.
Kinesis When to use
Gaming; Real-time analytics; Application alerts; Log/Event data collection Mobile data capture
Kinesis– Producer
- devices to collect data for kinesis processing.
- continuously input data into a kinesis stream
- include but not limited to: LOT sensors; Mobile Devices
- the more data you want to process, the more “Shards” you add to your Kinesis Stream.
- each shard can process 2MB of read data per second, and 1MB of write data per second.
Kinesis Consumer
- consume data, done concurrently
- multiple consumers can consume the same data at the same time.
- include: real-time dashboards; S3; Redshift; EMR
- any application you careted can consume the streams data
- keeps 24 hours of streaming data stored by default, but can be configured to store up to 7 days.
EMR (Elastic MapREduce)
is a service which deploys out EC2 instances based off of the Hadoop big data framework.
- used to analyze and process vast amounts of data.
- Supports other distributed frameworks like Apache Spark, HBase, Presto, Flink
EMR Workflow
- -Data stored in S3, DynamoDB, or Redshift is sent to EMR
- -the data is mapped to a “cluster” of Hadoop Master/Slave node for precessing.
- computations(coded/created by the developer) are used to process the data.
- the processed data is then reduced to a single output set of return information.
Other EMR Facts
- admin has the ability o access the underlying operating system.
- you can add user data to EC2 instances launched into the cluster via bootstrapping.
- EMR takes advantage of parallel processing for duster processing of data.
- you can resize a running cluster at any time, and you can deploy multiple clusters.
EMR Master Node
- -a node that manages the cluster by running software components which coordinate the distribution of data and tasks among other(slave) nodes for processing.
- tracks the status of tasks and monitors the health of the cluster.
EMR Slave Node
Core node and Task Node.
Core Node
a slave node that software components which run tasks AND stores data in the Hadoop Distributed File System(HDFS) on your cluster.
– do the heavy lifting with the data.
Task Node
a slave node that has software components which only run tasks.
– optional
EMR Map Phase
- Mapping is a function that defines the processes which splits the large data file for precessing.
- during mapping phase, the data is split into 128 MB”CHunks”;
the larger the instance size used in our EMP cluster, the more chucks you can map and process at the same time. - if there are more chunks than nodes/mappers, the chunks will queue for precessing.
EMR reduce phase
reducing i sa function that aggregates the split data back into one data source.
reduced data needs to be stored as data processed by the EMR cluster is not persistent.