Analytics Flashcards

Question 1

Q

Kinesis

Answer

A

real time data processing service that continuously captures( and stores) large amounts of data that can power realtime streaming dash boards.
Using the AWS provided ADKs, you can create real-time dashboards, integrate dynamic pricing strategies, and export data from kinesis to other AWS services.

Question 2

Q

Kinesis export data to other services

Answer

A

EMR;
S3;
RedShift;
Lambda

Question 3

Q

Kinesis Components

Answer

A

Stream
Producers(data creators)
Consumers(data consumers)
Shards(processing power)

Question 4

Q

Kinesis Benefits

Answer

A

Realtime processing – continuously collect and build applications that analyze the data as it’s generated.
Parallel Processing – Multiple Kinesis applications can be processing the same incoming data streaming concurrently.
-Durable – Kinesis synchronously replicates the streaming data across three data centers within a single AWS region and preserves the data for up to 24 hours.
-Scales - can stream from as little as a few megabytes to several terabytes per hour.

Question 5

Q

Kinesis When to use

Answer

A

Gaming;
Real-time analytics;
Application alerts;
Log/Event data collection
Mobile data capture

Question 6

Q

Kinesis– Producer

Answer

A

devices to collect data for kinesis processing.
- continuously input data into a kinesis stream
- include but not limited to: LOT sensors; Mobile Devices
- the more data you want to process, the more “Shards” you add to your Kinesis Stream.
- each shard can process 2MB of read data per second, and 1MB of write data per second.

Question 7

Q

Kinesis Consumer

Answer

A

consume data, done concurrently
multiple consumers can consume the same data at the same time.
- include: real-time dashboards; S3; Redshift; EMR
- any application you careted can consume the streams data
- keeps 24 hours of streaming data stored by default, but can be configured to store up to 7 days.

Question 8

Q

EMR (Elastic MapREduce)

Answer

A

is a service which deploys out EC2 instances based off of the Hadoop big data framework.

- used to analyze and process vast amounts of data.
- Supports other distributed frameworks like Apache Spark, HBase, Presto, Flink

Question 9

Q

EMR Workflow

Answer

A

-Data stored in S3, DynamoDB, or Redshift is sent to EMR
-the data is mapped to a “cluster” of Hadoop Master/Slave node for precessing.
computations(coded/created by the developer) are used to process the data.
the processed data is then reduced to a single output set of return information.

Question 10

Q

Other EMR Facts

Answer

A

admin has the ability o access the underlying operating system.
- you can add user data to EC2 instances launched into the cluster via bootstrapping.
- EMR takes advantage of parallel processing for duster processing of data.
- you can resize a running cluster at any time, and you can deploy multiple clusters.

Question 11

Q

EMR Master Node

Answer

A

-a node that manages the cluster by running software components which coordinate the distribution of data and tasks among other(slave) nodes for processing.
- tracks the status of tasks and monitors the health of the cluster.

Question 12

Q

EMR Slave Node

Answer

A

Core node and Task Node.

Question 13

Q

Core Node

Answer

A

a slave node that software components which run tasks AND stores data in the Hadoop Distributed File System(HDFS) on your cluster.
– do the heavy lifting with the data.

Question 14

Q

Task Node

Answer

A

a slave node that has software components which only run tasks.
– optional

Question 15

Q

EMR Map Phase

Answer

A

Mapping is a function that defines the processes which splits the large data file for precessing.
during mapping phase, the data is split into 128 MB”CHunks”;
the larger the instance size used in our EMP cluster, the more chucks you can map and process at the same time.
if there are more chunks than nodes/mappers, the chunks will queue for precessing.

Question 16

Q

EMR reduce phase

Answer

Study These Flashcards

A

reducing i sa function that aggregates the split data back into one data source.
reduced data needs to be stored as data processed by the EMR cluster is not persistent.

Analytics Flashcards

(16 cards)