Domain - 4 - Analytics Flashcards
Redshift Spectrum
Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.
Redshift Distribution Style
Redshift has 3 distribution Styles
1) Even (default) - rows distributed across the slices regardless of values in a particular column in a round-robin fashion
When to use?
No joins, reduced parallelism is ok, where KEY and ALl are not a clear choice
Key - distribute data evenly among slices; Collocate matching rows on the same slice, good for table joins performance - no cross node traffic
When to use?
- Tables used in joins
- Large fact tables in a star schema
ALL - entire copies of table is distributed to each node; More storage needed
When to use?
- Table data that does not change
- Reasonably sized tables ( a few million rows)
- No common distribution key
Redshift Sort Key
Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans
Redshift - Different Sort Keys
Compound sort key - table sorted by columns listed in the sort key definition - weight based on the sort key order in the definition
Interleave sort key - Equal weight to each column in the sort key
Redshift - Data Types
There are 12 data types, e.g.
- CHAR, CHARACTER, NCHAR: max 4096 bytes
- VARCHAR: max 65535 bytes (64K-1)
Redshift - WLM
Workload Management is an ordered set of query queues that define how resources are allocated and how queries are routed for processing. WLM definition is part of the Parameter Group for a Redshift cluster.
Redshift - Dynamic and Static WLM Settings
Dynamic WLM settings can be applied immediately to any cluster associated with the Parameter Group; e.g. Memory
Static WLM settings require reboot for changes to be applied; e.g. User Group, Query Group
Redshift - Query Queue Settings
With each queue, you can define
- Concurrency - default to 5
- User Group - queries run by Redshift users in this group will be default to this queue, wildcard supported
- Query Group - when you run a query from a SQL client, you can set it to run in a query group
- Timeout -
- Memory (%)
Redshift - WLM Queue Assignment Rule
Superuser queue; User group queue; Query group queue; Default queue
WLM query queue hopping - if a query is timeout, it is automatically moved to the next queue and restart execution
Redshift - which 4 AWS services can copy data directly to Redshift?
S3, EC2, DynamoDB, and EMR
Redshift - when to data load manifest?
- Load required files only;
- Load files with different prefix;
- Load files from different buckets
Redshift - is UPSERT supported?
No.
Redshift - what are 2 methods to update or merge data?
Using staging tables to 1) replace existing rows in target table; 2) update certain columns in the target.
Redshift - which S3 encryption supported by COPY?
SSE-S3, SSE-KMS, CSE-Symmetric
Redshift
which S3 encryption supported by UNLOAD?
SSE-S3 (default), SSE-KMS, CSE-CMK