Domain - 4 - Analytics Flashcards
Redshift Spectrum
Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.
Redshift Distribution Style
Redshift has 3 distribution Styles
1) Even (default) - rows distributed across the slices regardless of values in a particular column in a round-robin fashion
When to use?
No joins, reduced parallelism is ok, where KEY and ALl are not a clear choice
Key - distribute data evenly among slices; Collocate matching rows on the same slice, good for table joins performance - no cross node traffic
When to use?
- Tables used in joins
- Large fact tables in a star schema
ALL - entire copies of table is distributed to each node; More storage needed
When to use?
- Table data that does not change
- Reasonably sized tables ( a few million rows)
- No common distribution key
Redshift Sort Key
Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans
Redshift - Different Sort Keys
Compound sort key - table sorted by columns listed in the sort key definition - weight based on the sort key order in the definition
Interleave sort key - Equal weight to each column in the sort key
Redshift - Data Types
There are 12 data types, e.g.
- CHAR, CHARACTER, NCHAR: max 4096 bytes
- VARCHAR: max 65535 bytes (64K-1)
Redshift - WLM
Workload Management is an ordered set of query queues that define how resources are allocated and how queries are routed for processing. WLM definition is part of the Parameter Group for a Redshift cluster.
Redshift - Dynamic and Static WLM Settings
Dynamic WLM settings can be applied immediately to any cluster associated with the Parameter Group; e.g. Memory
Static WLM settings require reboot for changes to be applied; e.g. User Group, Query Group
Redshift - Query Queue Settings
With each queue, you can define
- Concurrency - default to 5
- User Group - queries run by Redshift users in this group will be default to this queue, wildcard supported
- Query Group - when you run a query from a SQL client, you can set it to run in a query group
- Timeout -
- Memory (%)
Redshift - WLM Queue Assignment Rule
Superuser queue; User group queue; Query group queue; Default queue
WLM query queue hopping - if a query is timeout, it is automatically moved to the next queue and restart execution
Redshift - which 4 AWS services can copy data directly to Redshift?
S3, EC2, DynamoDB, and EMR
Redshift - when to data load manifest?
- Load required files only;
- Load files with different prefix;
- Load files from different buckets
Redshift - is UPSERT supported?
No.
Redshift - what are 2 methods to update or merge data?
Using staging tables to 1) replace existing rows in target table; 2) update certain columns in the target.
Redshift - which S3 encryption supported by COPY?
SSE-S3, SSE-KMS, CSE-Symmetric
Redshift
which S3 encryption supported by UNLOAD?
SSE-S3 (default), SSE-KMS, CSE-CMK
Redshift
What are the node types?
What are their storage capacity?
How many node slices supported?
RA3 - high performance with scalable managed storage (up to 64TB/node)
RA3.4Xlarge: 2-32 nodes and upto 2PB;
12 vCPU / 4 slices
RA3.16Xlarge: 2-128 nodes and up to 8.2PB;
32vCPU / 16 slices
DC2 - High performance with fixed local SSD storage (up to 2.6TB/node)
DC2.large: 1-32 nodes and up to 20.5TB total;
2 vCPU / 2 slices
DC2.8Xlarge: 2-128 nodes and up to 327.7TB total;
32 vCPU / 16 slices
DS2 (legacy) - Large workloads with fixed local HDD storage (16TB/node)
Redshift - what is VACUUM command and why we need it?
VACUUM helps recover the space and sorts the table.
It is needed since Redshift blocks are immutable
- Updates results in new block
- Delete rows are not removed from disk
- Old rows consume disk space and get scanned when a query is executed
Wasted storage space and performance degradation
Redshift - what are different vacuum commands?
VACUUM FULL;
VACUUM SORT ONLY;
VACUUM DELETED ONLY;
VACUUM REINDEX ;
Redshift - what is deep copy? what is the best way to perform deep copy?
Deep copy is faster than VACUUM for large table (700GB+).
It can be done by
1. Using original DDL, SELECT INSERT data from original, drop the original and rename to original;
- CRAETE TABLE LIKE
- CREATE TEMP TABLE
Using original DDL is the fastest
Redshift - cross-region snapshots
For KMS-encrypted cluster, you would need to create a snapshot grant in the destination region so Redshift can use a master key in the destination region
ML - what are 3 ML models supported by Amazon Machine Learning?
- Binary classification;
- Multiclass classification;
- Linear regression;
S3 - what is S3 Select?
S3 Select is a new Amazon S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3
S3 - what is Glacier Select?
Glacier Select allow you query Glacier data using SQL within minutes. The results are written to S3 and there is no need to restore from Glacier to S3
Quiz - You have a table in your Redshift cluster, and the data in this table changes infrequently. The table has fewer than 15 million rows and does not JOIN any other tables. Which distribution style would you select for this table?
1) EVEN 2) KEY 3) DEFAULT 4) ALL
4) ALL - The ALL distribution type is appropriate for tables that change infrequently (tables that are not updated frequently or extensively). With this distribution style, the entire table is distributed to every node.