Redshift Flashcards

1
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Amazon Redshift?

A

A fully managed, petabyte-scale data warehouse service in the cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some key benefits of using Amazon Redshift?

A
  • High performance: Up to 10x faster than other data warehouse solutions.
  • Cost-effective: Pay only for what you use.
  • Scalable: Easily scale your cluster up or down to meet changing needs.
  • Secure: Data is encrypted at rest and in transit.
  • Durable: Data is replicated within the cluster and backed up to S3.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Redshift Spectrum?

A

Allows you to query exabytes of data stored in S3 without loading it into Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does Redshift achieve high performance?

A

Utilizes Massively Parallel Processing (MPP) to distribute queries across multiple nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the different Redshift distribution styles?

A
  • AUTO: Redshift automatically chooses the best distribution style.
  • EVEN: Data is distributed evenly across all nodes.
  • KEY: Data is distributed based on a chosen key column.
  • ALL: The entire table is copied to all nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of the COPY command?

A

Efficiently load large amounts of data from external sources into Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the function of the UNLOAD command?

A

Unload data from Redshift tables to files in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Redshift Workload Management (WLM)?

A

Prioritizes different types of queries and manages resources to optimize performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Concurrency Scaling?

A

Automatically adds cluster capacity to handle spikes in concurrent read queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Automatically adds cluster capacity to handle spikes in concurrent read queries.

A

Automatically adds cluster capacity to handle spikes in concurrent read queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two types of cluster resizing in Redshift?

A

Elastic Resize: Quick resizing with minimal downtime.
Classic Resize: More time-consuming but allows for changing node types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of the VACUUM command?

A

Reclaims disk space by removing deleted rows and sorting data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are RA3 nodes?

A

A new generation of Redshift nodes that allow for independent scaling of compute and storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Redshift Data Lake Export?

A

Enables unloading data from Redshift to S3 in Parquet format for efficient data lake integration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are Materialized Views?

A

Pre-compute and store query results for faster performance on complex queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Redshift Data Sharing?

A

Allows you to securely share live data across Redshift clusters without copying or moving data.

18
Q

What are Redshift Lambda UDFs?

A

Integrate custom code written in any language with your SQL queries.

19
Q

What are Redshift Federated Queries?

A

Query and analyze data across databases, data warehouses, and data lakes without ETL.

20
Q

What is the Redshift Data API?

A

Execute SQL statements against your Redshift cluster via HTTPS.

21
Q

What is Redshift Serverless?

A

Automatically provisions and scales compute capacity based on workload demands, eliminating the need to manage infrastructure.

22
Q

What are Redshift Processing Units (RPUs)?

A

A measure of compute capacity in Redshift Serverless.

23
Q

What are some of the Redshift system tables and views?

A
  • SVV_EXTERNAL_SCHEMAS: Information about external schemas.
  • STL_QUERYTEXT: Text of queries executed on the system.
  • SYS_QUERY_HISTORY: History of queries run on the system.
24
Q

What is DBLINK?

A

Allows you to connect Redshift to a PostgreSQL database (e.g., in RDS) for data copying and synchronization.

25
Q

What is the benefit of Amazon Aurora zero-ETL integration with Redshift?

A

Enables automatic data replication from Aurora to Redshift without needing ETL processes.

26
Q

What are some Redshift anti-patterns?

A
  • Using Redshift for small datasets (better suited for RDS).
  • Using Redshift for OLTP workloads (consider RDS or DynamoDB).
  • Storing unstructured data directly in Redshift (use ETL and/or Redshift Spectrum).
  • Storing BLOB data directly in Redshift (store references to files in S3 instead).
27
Q

How can you enhance security in Redshift?

A
  • Use a Hardware Security Module (HSM) for encryption key management.
  • Define granular access privileges for users and groups using GRANT and REVOKE commands.
28
Q

What are the different types of data shares in Redshift?

A
  • Standard data shares: For sharing data across Redshift clusters.
  • AWS Data Exchange data shares: For sharing data with subscribers.
  • AWS Lake Formation-managed data shares: For sharing data with fine-grained access control.
29
Q

What are some use cases for the Redshift Data API?

A
  • Application integration: Build applications that interact with Redshift data.
  • ETL orchestration: Use with AWS Step Functions to create serverless data processing workflows.
  • Event-driven ETL: Trigger ETL jobs based on events using Amazon EventBridge.
  • Access from SageMaker notebooks: Run queries and analyze data in Jupyter notebooks.
30
Q

What are some limitations of Redshift Serverless?

A
  • No support for parameter groups.
  • No workload management capabilities.
  • Limited AWS Partner integration.
  • No public endpoints (must be accessed within a VPC).
31
Q

How can you monitor Redshift Serverless?

A
  • Monitoring views: SYS_QUERY_HISTORY, SYS_LOAD_HISTORY, SYS_SERVERLESS_USAGE.
  • CloudWatch logs: Connection and user logs, optional user activity logs.
  • CloudWatch metrics: QueriesCompletedPerSecond, QueryDuration, QueriesRunning.
32
Q

What are the different ways to load data into Redshift?

A
  • COPY command: For loading large datasets from external sources (S3, EMR, DynamoDB, etc.)
  • INSERT INTO … SELECT: For loading data from existing tables within Redshift.
  • CREATE TABLE AS: For creating a new table based on the results of a query.
  • Streaming ingestion: For continuously loading data from Kinesis Data Streams or Amazon MSK.
33
Q

What are some best practices for using the COPY command?

A
  • Use a manifest file when loading from S3.
  • Leverage IAM roles for secure access to data sources.
  • Consider using compression (Gzip, Lzop, bzip2) to speed up data loading.
  • Use the AUTOMATIC COMPRESSION option to let Redshift determine the optimal compression scheme.
34
Q

What is a Redshift snapshot?

A

A point-in-time backup of your Redshift cluster that can be used for disaster recovery or to create a new cluster.

35
Q

How can you improve the performance of queries in Redshift?

A
  • Use the appropriate distribution style for your tables.
  • Choose the right sort keys for your tables.
  • Use materialized views for frequently accessed data.
  • Optimize your queries by avoiding unnecessary joins and scans.
  • Utilize Redshift Workload Management (WLM) to prioritize queries.
36
Q

What are some common Redshift performance metrics to monitor?

A
  • CPU utilization
  • Disk space usage
  • Query runtime
  • Throughput (queries per second)
  • Concurrency (number of active queries)
37
Q

What are some tools and services that can be integrated with Redshift?

A
  • AWS S3: For data storage and loading.
  • Amazon EMR: For data preprocessing and ETL.
  • AWS Data Pipeline: For orchestrating data movement and transformation.
  • AWS Database Migration Service (DMS): For migrating data to Redshift.
  • Amazon QuickSight: For data visualization and business intelligence.
38
Q

What is the difference between Redshift and Amazon Aurora?

A
  • Redshift is a data warehouse optimized for analytical workloads.
  • Aurora is a relational database optimized for transactional workloads.
39
Q

What are some security best practices for Redshift?

A
  • Encrypt your cluster using AWS KMS.
  • Control network access using security groups and VPCs.
  • Implement least privilege access control using IAM roles and policies.
  • Regularly audit user activity and monitor for suspicious behavior.
40
Q

What are some cost optimization strategies for Redshift?

A
  • Choose the right node type and cluster size for your workload.
  • Utilize Redshift Spectrum to query data directly in S3.
  • Leverage reserved instances for predictable workloads.
  • Monitor your cluster usage and identify opportunities for optimization.
  • Consider using Redshift Serverless for variable workloads.
41
Q

What are some resources for learning more about Redshift?

A
  • AWS documentation
  • AWS Redshift blogs and whitepapers
  • AWS training courses and certifications
  • AWS community forums and support