Analytics Flashcards
Kinesis
- real time data processing service that continuously captures( and stores) large amounts of data that can power realtime streaming dash boards.
- Using the AWS provided ADKs, you can create real-time dashboards, integrate dynamic pricing strategies, and export data from kinesis to other AWS services.
Kinesis export data to other services
EMR;
S3;
RedShift;
Lambda
Kinesis Components
Stream
Producers(data creators)
Consumers(data consumers)
Shards(processing power)
Kinesis Benefits
- Realtime processing – continuously collect and build applications that analyze the data as it’s generated.
- Parallel Processing – Multiple Kinesis applications can be processing the same incoming data streaming concurrently.
- -Durable – Kinesis synchronously replicates the streaming data across three data centers within a single AWS region and preserves the data for up to 24 hours.
- -Scales - can stream from as little as a few megabytes to several terabytes per hour.
Kinesis When to use
Gaming; Real-time analytics; Application alerts; Log/Event data collection Mobile data capture
Kinesis– Producer
- devices to collect data for kinesis processing.
- continuously input data into a kinesis stream
- include but not limited to: LOT sensors; Mobile Devices
- the more data you want to process, the more “Shards” you add to your Kinesis Stream.
- each shard can process 2MB of read data per second, and 1MB of write data per second.
Kinesis Consumer
- consume data, done concurrently
- multiple consumers can consume the same data at the same time.
- include: real-time dashboards; S3; Redshift; EMR
- any application you careted can consume the streams data
- keeps 24 hours of streaming data stored by default, but can be configured to store up to 7 days.
EMR (Elastic MapREduce)
is a service which deploys out EC2 instances based off of the Hadoop big data framework.
- used to analyze and process vast amounts of data.
- Supports other distributed frameworks like Apache Spark, HBase, Presto, Flink
EMR Workflow
- -Data stored in S3, DynamoDB, or Redshift is sent to EMR
- -the data is mapped to a “cluster” of Hadoop Master/Slave node for precessing.
- computations(coded/created by the developer) are used to process the data.
- the processed data is then reduced to a single output set of return information.
Other EMR Facts
- admin has the ability o access the underlying operating system.
- you can add user data to EC2 instances launched into the cluster via bootstrapping.
- EMR takes advantage of parallel processing for duster processing of data.
- you can resize a running cluster at any time, and you can deploy multiple clusters.
EMR Master Node
- -a node that manages the cluster by running software components which coordinate the distribution of data and tasks among other(slave) nodes for processing.
- tracks the status of tasks and monitors the health of the cluster.
EMR Slave Node
Core node and Task Node.
Core Node
a slave node that software components which run tasks AND stores data in the Hadoop Distributed File System(HDFS) on your cluster.
– do the heavy lifting with the data.
Task Node
a slave node that has software components which only run tasks.
– optional
EMR Map Phase
- Mapping is a function that defines the processes which splits the large data file for precessing.
- during mapping phase, the data is split into 128 MB”CHunks”;
the larger the instance size used in our EMP cluster, the more chucks you can map and process at the same time. - if there are more chunks than nodes/mappers, the chunks will queue for precessing.