Big Data Flashcards
Volume
Variety
Velocity
3 V of Big Data
Fully managed petabyte-scale data warehouse service in the cloud
Redshift
Very large relational database traditionally used in big data applications
Redshift
Based on the PostgreSQL database engine but not used for OLTP workloads
Redshift
Column-based data storage instead of row-based
Redshift
Does Redshift support Multi-AZ
Yes, only spans 2 AZ
Does Redshift use Snapshots?
Yes, contained in S3
Can you control the S3 bucket containing snapshots?
No
Query and retrieve data from S3 without having to load the data into Redshift tables
Redshift Spectrum
All copy and unload traffic between your cluster and your data repositories is forced through your VPC
Enhanced VPC Routing
Extract
Transform
Load
ETL
AWS service used to help with ETL processing
Elastic Map Reduce (EMR)
Scalable file system for hadoop that distributes stored data across instances.
Hadoop Distributed File System (HDFS)
Used for caching results during processing
HDFS
Extends Hadoops to add the ability to directly access data stored in Amazon S3
EMR File System (EMRFS)
Locally connected disk created with each EC2 instance, volume will only remain during the lifecycle of ec2 instance.
Local File System
Groups of EC2 instances (nodes) within Amazon EMR
Cluster
Manages the cluster, coordinates the distribution of data and tasks
Primary Node
Node that runs tasks and stores data in the Hadoop Distributed File System
Core Node
Optional node that only runs tasks with no storage of data within the HDFS
Task Node
Interactive Query service that makes it easy to analyze data in S3 using SQL.
Serverless SQL solution
Athena
Service that directly query data in s3 bucket without loading it into a database
Athena
Serverless data integration service that makes it easy to discover, prepare, and combine data.
Glue
Service allows you to perform ETL workloads without managing underlying servers.
Glue
Fully managed serverless business intelligence (BI) data visualization service
Amazon QuickSight
Useful for business data visualizations, ad-hoc data analytics, and obtaining important data-based business insights
QuickSight
In-memory engine used to perform advanced calculations within QuickSight
SPICE
Managed ETL service for automating movement and transformation of your data
Data Pipeline
Data-driven workflows. Steps are dependent on previous tasks completing successfully
Data Driven
Define parameters for data transformations. AWS Data Pipeline enforces chosen logic
Parameters
Specify the business logic of your data management needs
Pipeline Definition
Service will create EC2 instances to perform your activities
Managed Compute
Poll for different tasks and perform them when found
Task Runners
Define the locations and types of data that will be input and output
Data Nodes
Pipeline components that define the work to perform
Activities
Processing data in EMR using Hadoop streaming
Importing or exporting DynamoDB data
Copying CSV files or data between S3 buckets
Exporting RDS data to S3
Copying data to redshift
Data Pipeline
Fully managed service for running data streaming applications that leverage Apache Kafka
Amazon MSK
Used as a managed analytics and visualization service. It is suitable for creating a logging solution involving visualization of log file analytics or BI reports.
Amazon OpenSearch
Managed service allowing you to run search and analytics engines for various use cases.
Amazon OpenSearch Service