Section - Big Data Flashcards

Question 1

Q

The 3 V’s of Big Data?

Answer

A

Volume
- Ranges from terabytes to petabytes of data
Variety
- Includes data from a wide range of sources and formats
Velocity
- Business require speed.
- Data needs to be collected, stored, processed and analyzed within a short period of time.

Question 2

Q

What is Redshift?

Answer

A

Redshift is a fully managed, petabyte-scale data warehouse service in the cloud
It’s a very large relational database traditionally used in big data applications.
Redshift is incredibly big- it can hold up to 16 Petabyte of data.
Redshift is not a high availabity service, it only runs in zone.
Automatic backups are retained for 1 day but can be extend to 35 days

Question 3

Q

What is an ETL?

Answer

A

Extract
Transform
Load

Question 4

Q

What is AWS Elastic Map Reduce(EMR)?

Answer

A

EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase,Flink,Hudi and Presto.
It’s AWS’s ETL tool.
It’s an open-source cluster (Fleet of EC2 instances)
EC2 Rules Apply
- You can use Reserved Instances and Spot instances to reduce your cost.
The architecture lives inside a VPC.

Question 5

Q

What is AWS Kinesis?

Answer

A

Kinesis is originally a Greek word, meaning the movement or motion. Amazon Kinesis deals with data that is in motion, or streaming data.

Streaming Data?

Data generated continuously by the thousands of data sources, which typically send in the data records similtaneously and in small size(kilobytes)

Financial Transactions
Stock prices
Game data (as the gamer plays)
Social media feeds
Location tracking data (Uber)
IoT sensors
Clickstream
Log files

Question 6

Q

What are the 4 core service of AWS Kinesis?

Answer

A

Kinesis Video Streams
- Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
Kinesis Data Streams
- Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.
Kinesis Data Firehose
- Capture, transform, load data streams into AWS data stores to enable near-real-time analytics with BI tools
Kinesis Data Analytics
- Analyze, query and transform streamed data in real-time using standard SQL. Store the results in an AWS data store.

Question 7

Q

Kinesis Data Streams?

Answer

A

Producers
- Devices which produce data for streaming
- e.g. IoT device
Kinesis Streams
- Data is stored in Shard
- Data is stored for 24 hrs, with a max of 7 days retention
Consumers
- Consume stored data and apply business logic
- e.g EC2 instance, Lambda functions …

Question 8

Q

AWS Kinesis Shards?

Answer

A

Kinesis streams are made up of shards, each shard is a sequence of one or more data records and provides a fixed unit of capacity.

Five reads per second
The max total read rate is 2MB per second
1,000 write per second
The max total write rate is 1 MB per second

NB: The data capacity of the stream is determined by the number of shards. if the data rate increases, you can increase capacity on your stream by increasing the number of shards.

Question 9

Q

AWS Kinesis Data Firehose?

Answer

A

Producers
- Devices produce data
- e.g IoT
Kinesis Firehose
No shards
No data retention

Question 10

Q

AWS Kinesis Data Analytics?

Answer

A

Producers
- Devices produce data
- e.g. IoT
Data is pushed to Firehose
You can run SQL query against incoming data and store the results.
Real-time analytics

Question 11

Q

AWS Kinesis Exam Tips?

Question 12

Q

AWS Kinesis Video Streams?

Answer

A

Securely stream video from connected devices to AWS

Videos can be used for analytics and machine learning.

Question 13

Q

AWS Kinesis Shards and Consumers?

Answer

A

The kinesis client library running on your consumers create a record processor for each shard that is being consumed by your instance
If you increase the number of shards, the KCL will add more record processors on your consumers
CPU utilisation is what should drive the quantity of consumer instances you have, NOT the number of shards in your Kinesis stream.
Use an Auto Scalling group, and base the scaling decisions on CPU load on your consumers.

Question 14

Q

What is AWS Athena?

Answer

A

Athena is an interactive query service that makes it easy to analyze data in S3 using SQL.
This allows you to directly query data in your S3 Bucket without loading it into a database.

Question 15

Q

What is AWS Glue?

Answer

A

Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data.
It allows you to perform ETL workloads without managing underlying servers.
It replaces EMR … serverless architech
Glue structures the data

Question 16

Q

Exam Tips: Glue and Athena?

Answer

Study These Flashcards

A

Serverless SQL
- It’s the only service that allows you to directly query your data thats’s stored on S3.
Both Athena and Glue are fully managed by AWS
Glue can help design a schema for your data

Question 17

Q

What is AWS QuickSight(Like PowerBI)?

Answer

Study These Flashcards

A

Amazon QuickSight is fully managed business intelligent (BI) data visualization service.
It allows you to easily create dashboards and share them within your company.

Question 18

Q

What is Elasticsearch?

Answer

Study These Flashcards

A

Amazon Elasticsearch service is fully managed version of open-source application Elasticsearch.
It allows you quickly search over your stored data and analyze the data you get back.
it’s commonly used as part of Elasticsearch, Logstash, Kibana(ELK) stack.

Section - Big Data Flashcards

(18 cards)