Section - Big Data Flashcards
The 3 V’s of Big Data?
-
Volume
- Ranges from terabytes to petabytes of data
-
Variety
- Includes data from a wide range of sources and formats
-
Velocity
- Business require speed.
- Data needs to be collected, stored, processed and analyzed within a short period of time.
What is Redshift?
- Redshift is a fully managed, petabyte-scale data warehouse service in the cloud
- It’s a very large relational database traditionally used in big data applications.
- Redshift is incredibly big- it can hold up to 16 Petabyte of data.
- Redshift is not a high availabity service, it only runs in zone.
- Automatic backups are retained for 1 day but can be extend to 35 days
What is an ETL?
- Extract
- Transform
- Load
What is AWS Elastic Map Reduce(EMR)?
- EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase,Flink,Hudi and Presto.
- It’s AWS’s ETL tool.
- It’s an open-source cluster (Fleet of EC2 instances)
- EC2 Rules Apply
- You can use Reserved Instances and Spot instances to reduce your cost.
- The architecture lives inside a VPC.
What is AWS Kinesis?
Kinesis is originally a Greek word, meaning the movement or motion. Amazon Kinesis deals with data that is in motion, or streaming data.
Streaming Data?
Data generated continuously by the thousands of data sources, which typically send in the data records similtaneously and in small size(kilobytes)
- Financial Transactions
- Stock prices
- Game data (as the gamer plays)
- Social media feeds
- Location tracking data (Uber)
- IoT sensors
- Clickstream
- Log files
What are the 4 core service of AWS Kinesis?
-
Kinesis Video Streams
- Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
-
Kinesis Data Streams
- Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.
-
Kinesis Data Firehose
- Capture, transform, load data streams into AWS data stores to enable near-real-time analytics with BI tools
-
Kinesis Data Analytics
- Analyze, query and transform streamed data in real-time using standard SQL. Store the results in an AWS data store.
Kinesis Data Streams?
-
Producers
- Devices which produce data for streaming
- e.g. IoT device
-
Kinesis Streams
- Data is stored in Shard
- Data is stored for 24 hrs, with a max of 7 days retention
-
Consumers
- Consume stored data and apply business logic
- e.g EC2 instance, Lambda functions …
AWS Kinesis Shards?
Kinesis streams are made up of shards, each shard is a sequence of one or more data records and provides a fixed unit of capacity.
- Five reads per second
- The max total read rate is 2MB per second
- 1,000 write per second
- The max total write rate is 1 MB per second
NB: The data capacity of the stream is determined by the number of shards. if the data rate increases, you can increase capacity on your stream by increasing the number of shards.
AWS Kinesis Data Firehose?
- Producers
- Devices produce data
- e.g IoT
- Kinesis Firehose
- No shards
- No data retention
AWS Kinesis Data Analytics?
- Producers
- Devices produce data
- e.g. IoT
- Data is pushed to Firehose
- You can run SQL query against incoming data and store the results.
- Real-time analytics
AWS Kinesis Exam Tips?
AWS Kinesis Video Streams?
Securely stream video from connected devices to AWS
- Videos can be used for analytics and machine learning.
AWS Kinesis Shards and Consumers?
- The kinesis client library running on your consumers create a record processor for each shard that is being consumed by your instance
- If you increase the number of shards, the KCL will add more record processors on your consumers
- CPU utilisation is what should drive the quantity of consumer instances you have, NOT the number of shards in your Kinesis stream.
- Use an Auto Scalling group, and base the scaling decisions on CPU load on your consumers.
What is AWS Athena?
- Athena is an interactive query service that makes it easy to analyze data in S3 using SQL.
- This allows you to directly query data in your S3 Bucket without loading it into a database.
What is AWS Glue?
- Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data.
- It allows you to perform ETL workloads without managing underlying servers.
- It replaces EMR … serverless architech
- Glue structures the data