Database Specialty - Redshift Architecture Flashcards
1
Q
Redshift Architecture
A
- Massively parallel columnar database, runs within a VPC
- Single leader node and multiple compute nodes
- You can connect to Redshift using any application supporting JDBC or ODBC driver
for PostgreSQL - Clients query the leader node using SQL endpoint
- A job is distributed across compute nodes.
- Compute nodes partition the job into slices.
- Leader node then aggregates the results and returns them to the client
2
Q
Redshift node types
A
- Dense compute nodes (DC2)
- For compute-intensive DW workloads with local SSD storage
- Dense storage nodes (DS2)
- For large DWs, uses hard disk drives (HDDs)
- RA3 nodes with managed storage
- For large DWs, uses large local SSDs
- Recommended over DS2
- Automatically offloads data to S3 if node grows beyond its size
- Compute and managed storage is billed independently
3
Q
Loading data into Redshift
A
- Typically, data from OLTP systems is loaded into Redshift for analytics and BI purposes
- Data from OLTP systems can be loaded into S3 and data from S3 can then be loaded into Redshift
- Data from Kinesis Firehose can also be loaded in the same way
- COPY command
- Loads data from files stored in S3 into Redshift
- Data is stored locally in the Redshift cluster (persistent storage = cost)
- DynamoDB table data and EMR data can also be loaded using COPY command
4
Q
Loading data from S3 with COPY command
copy users from ‘s3://my_bucket/tickit/allusers_pipe.txt’
credentials ‘aws_iam_role=arn:aws:iam::0123456789:role/MyRedshiftRole’
delimiter ‘|’ region ‘us-west-2’;
A
- Create an IAM Role
- Create your Redshift cluster
- Attach the IAM role to the cluster
- The cluster can then temporarily assume the IAM role on your behalf
- Load data from S3 using COPY command
5
Q
More ways to load data into Redshift
A
- Use AWS Glue – fully managed ETL service
- ETL = Extract, Transform, and Load
- Use ETL tools from APN partners
- Use Data Pipeline
- For migration from on-premise, use:
- AWS Import/Export service (AWS Snowball)
- AWS Direct Connect (private connection between your datacenter and AWS)
6
Q
Querying external data with Redshift
A
- Two ways
- Redshift Spectrum
- Redshift Federated Query
7
Q
Redshift Spectrum
A
- Query exabytes of data from S3 without loading it into Redshift
- Must have a Redshift cluster available to start the query
- The query is then submitted to thousands of Redshift Spectrum nodes
- External table structure/schemas can be created in external data catalog like Athena / Glue / Apache Hive metastore (EMR)
- These external tables are read-only (insert / update / delete operations not possible)
- Redshift cluster and S3 bucket must be in the same region
8
Q
Redshift Federated Query
A
- Query and analyze data across different DBs, DWs, and data lakes
- Currently works with Redshift, PostgreSQL on RDS, Aurora PostgreSQL
and S3
9
Q
Star Schema
A
- Typical way of organizing data in a data warehouse
- Two types of tables – fact tables and dimension tables
- A star pattern consists of a fact table and multiple dimension tables
- Fact table has foreign key relationships with multiple dimension tables
- Dimension tables are generally small (fewer records, but often with many fields)
- Recommended to store:
- smaller dimension tables as local Redshift tables
- larger fact tables as external Spectrum tables
10
Q
Redshift Key Constraints
A
- Redshift does not support indexes (clustered or non-clustered)
- Has a concept of sort keys
- data is stored on disk in a sorted order by sort key
- Primary Key, Unique keys, and Foreign key constraints are not enforce (are informational only)
- Query optimizer uses these constraints to generate more efficient query plans
- You can enforce PK/FK relationships through your application
11
Q
Redshift Table Design
A
- Key factors for efficient table design
- Data distribution – how data is distributed across nodes
- Sort strategies – how data is sorted in the tables
- Compression – to reduce storage and I/O needs
12
Q
Redshift Sort Styles
A
- Single-column sort key (e.g. Dept)
- Compound sort key
- more that one column as sort key
- e.g. Dept + Location
- Hierarchical (order of the column in the sort key is important)
- Interleaved sort key
- gives equal weight to each column (or subset of columns) in the sort key
- In effect, you can have multiple sort key combinations
- e.g. Dept + Location, Location + Dep
- Must be defined at the table creation time
13
Q
VACUUM operation in the tables
A
- As you delete row from or add more rows to a sorted table containing data, performance might deteriorate over time
- VACUUM operation re-sorts rows in one or all DB tables
- And reclaims space from table rows marked for deletion
- Redshift automatically sorts data and runs VACUUM DELETE in the background
- Need not run manually, but you can if required
14
Q
Redshift Compression
A
- Column-level operation to reduce size of stored data
- Reduces disk I/O and improves query performance
- Compression type = encoding
- Each column can be separately compressed with different encodings (manually / automatically)
- COPY command applies compression by default
- RAW = No compression (default for sort keys)
- LZO = Very high compression with good performance (default encoding)
- Encoding cannot be changed after table creation
15
Q
Redshift Workload Management (WLM)
A
- Helps you prioritize workloads
- Can prevent long-running queries from impacting short-running ones
- Two modes – Automatic WLM and Manual WLM
- Automatic WLM supports queue priorities
- SQA (Short query acceleration)
- prioritize selected short running queries w/o creating dedicated queue