Big Data Flashcards
Volume
Variety
Velocity
3 V of Big Data
Fully managed petabyte-scale data warehouse service in the cloud
Redshift
Very large relational database traditionally used in big data applications
Redshift
Based on the PostgreSQL database engine but not used for OLTP workloads
Redshift
Column-based data storage instead of row-based
Redshift
Does Redshift support Multi-AZ
Yes, only spans 2 AZ
Does Redshift use Snapshots?
Yes, contained in S3
Can you control the S3 bucket containing snapshots?
No
Query and retrieve data from S3 without having to load the data into Redshift tables
Redshift Spectrum
All copy and unload traffic between your cluster and your data repositories is forced through your VPC
Enhanced VPC Routing
Extract
Transform
Load
ETL
AWS service used to help with ETL processing
Elastic Map Reduce (EMR)
Scalable file system for hadoop that distributes stored data across instances.
Hadoop Distributed File System (HDFS)
Used for caching results during processing
HDFS
Extends Hadoops to add the ability to directly access data stored in Amazon S3
EMR File System (EMRFS)