Big Data Flashcards
Volume
Variety
Velocity
3 V of Big Data
Fully managed petabyte-scale data warehouse service in the cloud
Redshift
Very large relational database traditionally used in big data applications
Redshift
Based on the PostgreSQL database engine but not used for OLTP workloads
Redshift
Column-based data storage instead of row-based
Redshift
Does Redshift support Multi-AZ
Yes, only spans 2 AZ
Does Redshift use Snapshots?
Yes, contained in S3
Can you control the S3 bucket containing snapshots?
No
Query and retrieve data from S3 without having to load the data into Redshift tables
Redshift Spectrum
All copy and unload traffic between your cluster and your data repositories is forced through your VPC
Enhanced VPC Routing
Extract
Transform
Load
ETL
AWS service used to help with ETL processing
Elastic Map Reduce (EMR)
Scalable file system for hadoop that distributes stored data across instances.
Hadoop Distributed File System (HDFS)
Used for caching results during processing
HDFS
Extends Hadoops to add the ability to directly access data stored in Amazon S3
EMR File System (EMRFS)
Locally connected disk created with each EC2 instance, volume will only remain during the lifecycle of ec2 instance.
Local File System
Groups of EC2 instances (nodes) within Amazon EMR
Cluster
Manages the cluster, coordinates the distribution of data and tasks
Primary Node
Node that runs tasks and stores data in the Hadoop Distributed File System
Core Node
Optional node that only runs tasks with no storage of data within the HDFS
Task Node
Interactive Query service that makes it easy to analyze data in S3 using SQL.
Serverless SQL solution
Athena
Service that directly query data in s3 bucket without loading it into a database
Athena
Serverless data integration service that makes it easy to discover, prepare, and combine data.
Glue
Service allows you to perform ETL workloads without managing underlying servers.
Glue