Redshift Flashcards
What is Amazon Redshift?
A fully managed, petabyte-scale data warehouse service in the cloud.
What are some key benefits of using Amazon Redshift?
- High performance: Up to 10x faster than other data warehouse solutions.
- Cost-effective: Pay only for what you use.
- Scalable: Easily scale your cluster up or down to meet changing needs.
- Secure: Data is encrypted at rest and in transit.
- Durable: Data is replicated within the cluster and backed up to S3.
What is Redshift Spectrum?
Allows you to query exabytes of data stored in S3 without loading it into Redshift.
How does Redshift achieve high performance?
Utilizes Massively Parallel Processing (MPP) to distribute queries across multiple nodes.
What are the different Redshift distribution styles?
- AUTO: Redshift automatically chooses the best distribution style.
- EVEN: Data is distributed evenly across all nodes.
- KEY: Data is distributed based on a chosen key column.
- ALL: The entire table is copied to all nodes.
What is the purpose of the COPY command?
Efficiently load large amounts of data from external sources into Redshift.
What is the function of the UNLOAD command?
Unload data from Redshift tables to files in S3.
What is Redshift Workload Management (WLM)?
Prioritizes different types of queries and manages resources to optimize performance.
What is Concurrency Scaling?
Automatically adds cluster capacity to handle spikes in concurrent read queries.
Automatically adds cluster capacity to handle spikes in concurrent read queries.
Automatically adds cluster capacity to handle spikes in concurrent read queries.
What are the two types of cluster resizing in Redshift?
Elastic Resize: Quick resizing with minimal downtime.
Classic Resize: More time-consuming but allows for changing node types.
What is the purpose of the VACUUM command?
Reclaims disk space by removing deleted rows and sorting data.
What are RA3 nodes?
A new generation of Redshift nodes that allow for independent scaling of compute and storage.
What is Redshift Data Lake Export?
Enables unloading data from Redshift to S3 in Parquet format for efficient data lake integration.
What are Materialized Views?
Pre-compute and store query results for faster performance on complex queries.
What is Redshift Data Sharing?
Allows you to securely share live data across Redshift clusters without copying or moving data.
What are Redshift Lambda UDFs?
Integrate custom code written in any language with your SQL queries.
What are Redshift Federated Queries?
Query and analyze data across databases, data warehouses, and data lakes without ETL.
What is the Redshift Data API?
Execute SQL statements against your Redshift cluster via HTTPS.
What is Redshift Serverless?
Automatically provisions and scales compute capacity based on workload demands, eliminating the need to manage infrastructure.
What are Redshift Processing Units (RPUs)?
A measure of compute capacity in Redshift Serverless.
What are some of the Redshift system tables and views?
- SVV_EXTERNAL_SCHEMAS: Information about external schemas.
- STL_QUERYTEXT: Text of queries executed on the system.
- SYS_QUERY_HISTORY: History of queries run on the system.
What is DBLINK?
Allows you to connect Redshift to a PostgreSQL database (e.g., in RDS) for data copying and synchronization.
What is the benefit of Amazon Aurora zero-ETL integration with Redshift?
Enables automatic data replication from Aurora to Redshift without needing ETL processes.
What are some Redshift anti-patterns?
- Using Redshift for small datasets (better suited for RDS).
- Using Redshift for OLTP workloads (consider RDS or DynamoDB).
- Storing unstructured data directly in Redshift (use ETL and/or Redshift Spectrum).
- Storing BLOB data directly in Redshift (store references to files in S3 instead).
How can you enhance security in Redshift?
- Use a Hardware Security Module (HSM) for encryption key management.
- Define granular access privileges for users and groups using GRANT and REVOKE commands.
What are the different types of data shares in Redshift?
- Standard data shares: For sharing data across Redshift clusters.
- AWS Data Exchange data shares: For sharing data with subscribers.
- AWS Lake Formation-managed data shares: For sharing data with fine-grained access control.
What are some use cases for the Redshift Data API?
- Application integration: Build applications that interact with Redshift data.
- ETL orchestration: Use with AWS Step Functions to create serverless data processing workflows.
- Event-driven ETL: Trigger ETL jobs based on events using Amazon EventBridge.
- Access from SageMaker notebooks: Run queries and analyze data in Jupyter notebooks.
What are some limitations of Redshift Serverless?
- No support for parameter groups.
- No workload management capabilities.
- Limited AWS Partner integration.
- No public endpoints (must be accessed within a VPC).
How can you monitor Redshift Serverless?
- Monitoring views: SYS_QUERY_HISTORY, SYS_LOAD_HISTORY, SYS_SERVERLESS_USAGE.
- CloudWatch logs: Connection and user logs, optional user activity logs.
- CloudWatch metrics: QueriesCompletedPerSecond, QueryDuration, QueriesRunning.
What are the different ways to load data into Redshift?
- COPY command: For loading large datasets from external sources (S3, EMR, DynamoDB, etc.)
- INSERT INTO … SELECT: For loading data from existing tables within Redshift.
- CREATE TABLE AS: For creating a new table based on the results of a query.
- Streaming ingestion: For continuously loading data from Kinesis Data Streams or Amazon MSK.
What are some best practices for using the COPY command?
- Use a manifest file when loading from S3.
- Leverage IAM roles for secure access to data sources.
- Consider using compression (Gzip, Lzop, bzip2) to speed up data loading.
- Use the AUTOMATIC COMPRESSION option to let Redshift determine the optimal compression scheme.
What is a Redshift snapshot?
A point-in-time backup of your Redshift cluster that can be used for disaster recovery or to create a new cluster.
How can you improve the performance of queries in Redshift?
- Use the appropriate distribution style for your tables.
- Choose the right sort keys for your tables.
- Use materialized views for frequently accessed data.
- Optimize your queries by avoiding unnecessary joins and scans.
- Utilize Redshift Workload Management (WLM) to prioritize queries.
What are some common Redshift performance metrics to monitor?
- CPU utilization
- Disk space usage
- Query runtime
- Throughput (queries per second)
- Concurrency (number of active queries)
What are some tools and services that can be integrated with Redshift?
- AWS S3: For data storage and loading.
- Amazon EMR: For data preprocessing and ETL.
- AWS Data Pipeline: For orchestrating data movement and transformation.
- AWS Database Migration Service (DMS): For migrating data to Redshift.
- Amazon QuickSight: For data visualization and business intelligence.
What is the difference between Redshift and Amazon Aurora?
- Redshift is a data warehouse optimized for analytical workloads.
- Aurora is a relational database optimized for transactional workloads.
What are some security best practices for Redshift?
- Encrypt your cluster using AWS KMS.
- Control network access using security groups and VPCs.
- Implement least privilege access control using IAM roles and policies.
- Regularly audit user activity and monitor for suspicious behavior.
What are some cost optimization strategies for Redshift?
- Choose the right node type and cluster size for your workload.
- Utilize Redshift Spectrum to query data directly in S3.
- Leverage reserved instances for predictable workloads.
- Monitor your cluster usage and identify opportunities for optimization.
- Consider using Redshift Serverless for variable workloads.
What are some resources for learning more about Redshift?
- AWS documentation
- AWS Redshift blogs and whitepapers
- AWS training courses and certifications
- AWS community forums and support