Domain - 4 - Analytics Flashcards
Redshift Spectrum
Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.
Redshift Distribution Style
Redshift has 3 distribution Styles
1) Even (default) - rows distributed across the slices regardless of values in a particular column in a round-robin fashion
When to use?
No joins, reduced parallelism is ok, where KEY and ALl are not a clear choice
Key - distribute data evenly among slices; Collocate matching rows on the same slice, good for table joins performance - no cross node traffic
When to use?
- Tables used in joins
- Large fact tables in a star schema
ALL - entire copies of table is distributed to each node; More storage needed
When to use?
- Table data that does not change
- Reasonably sized tables ( a few million rows)
- No common distribution key
Redshift Sort Key
Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans
Redshift - Different Sort Keys
Compound sort key - table sorted by columns listed in the sort key definition - weight based on the sort key order in the definition
Interleave sort key - Equal weight to each column in the sort key
Redshift - Data Types
There are 12 data types, e.g.
- CHAR, CHARACTER, NCHAR: max 4096 bytes
- VARCHAR: max 65535 bytes (64K-1)
Redshift - WLM
Workload Management is an ordered set of query queues that define how resources are allocated and how queries are routed for processing. WLM definition is part of the Parameter Group for a Redshift cluster.
Redshift - Dynamic and Static WLM Settings
Dynamic WLM settings can be applied immediately to any cluster associated with the Parameter Group; e.g. Memory
Static WLM settings require reboot for changes to be applied; e.g. User Group, Query Group
Redshift - Query Queue Settings
With each queue, you can define
- Concurrency - default to 5
- User Group - queries run by Redshift users in this group will be default to this queue, wildcard supported
- Query Group - when you run a query from a SQL client, you can set it to run in a query group
- Timeout -
- Memory (%)
Redshift - WLM Queue Assignment Rule
Superuser queue; User group queue; Query group queue; Default queue
WLM query queue hopping - if a query is timeout, it is automatically moved to the next queue and restart execution
Redshift - which 4 AWS services can copy data directly to Redshift?
S3, EC2, DynamoDB, and EMR
Redshift - when to data load manifest?
- Load required files only;
- Load files with different prefix;
- Load files from different buckets
Redshift - is UPSERT supported?
No.
Redshift - what are 2 methods to update or merge data?
Using staging tables to 1) replace existing rows in target table; 2) update certain columns in the target.
Redshift - which S3 encryption supported by COPY?
SSE-S3, SSE-KMS, CSE-Symmetric
Redshift
which S3 encryption supported by UNLOAD?
SSE-S3 (default), SSE-KMS, CSE-CMK
Redshift
What are the node types?
What are their storage capacity?
How many node slices supported?
RA3 - high performance with scalable managed storage (up to 64TB/node)
RA3.4Xlarge: 2-32 nodes and upto 2PB;
12 vCPU / 4 slices
RA3.16Xlarge: 2-128 nodes and up to 8.2PB;
32vCPU / 16 slices
DC2 - High performance with fixed local SSD storage (up to 2.6TB/node)
DC2.large: 1-32 nodes and up to 20.5TB total;
2 vCPU / 2 slices
DC2.8Xlarge: 2-128 nodes and up to 327.7TB total;
32 vCPU / 16 slices
DS2 (legacy) - Large workloads with fixed local HDD storage (16TB/node)
Redshift - what is VACUUM command and why we need it?
VACUUM helps recover the space and sorts the table.
It is needed since Redshift blocks are immutable
- Updates results in new block
- Delete rows are not removed from disk
- Old rows consume disk space and get scanned when a query is executed
Wasted storage space and performance degradation
Redshift - what are different vacuum commands?
VACUUM FULL;
VACUUM SORT ONLY;
VACUUM DELETED ONLY;
VACUUM REINDEX ;
Redshift - what is deep copy? what is the best way to perform deep copy?
Deep copy is faster than VACUUM for large table (700GB+).
It can be done by
1. Using original DDL, SELECT INSERT data from original, drop the original and rename to original;
- CRAETE TABLE LIKE
- CREATE TEMP TABLE
Using original DDL is the fastest
Redshift - cross-region snapshots
For KMS-encrypted cluster, you would need to create a snapshot grant in the destination region so Redshift can use a master key in the destination region
ML - what are 3 ML models supported by Amazon Machine Learning?
- Binary classification;
- Multiclass classification;
- Linear regression;
S3 - what is S3 Select?
S3 Select is a new Amazon S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3
S3 - what is Glacier Select?
Glacier Select allow you query Glacier data using SQL within minutes. The results are written to S3 and there is no need to restore from Glacier to S3
Quiz - You have a table in your Redshift cluster, and the data in this table changes infrequently. The table has fewer than 15 million rows and does not JOIN any other tables. Which distribution style would you select for this table?
1) EVEN 2) KEY 3) DEFAULT 4) ALL
4) ALL - The ALL distribution type is appropriate for tables that change infrequently (tables that are not updated frequently or extensively). With this distribution style, the entire table is distributed to every node.
Quiz - How many concurrent queries can you run on a Redshift cluster?
1) 500 2) 200 3) 150 4) 50
4) 50 - The total concurrency level for queries in Redshift is 50.
Quiz - Your analytics team runs large, long-running queries in an automated fashion throughout the day. The results of these large queries are then used to make business decisions. However, the analytics team also runs small queries manually on ad-hoc basis. How can you ensure that the large queries do not take up all the resources, preventing the smaller ad-hoc queries from running?
1) Assign each query a priority number.
2) Create a query user group for small queries based on the analysts’ Redshift user IDs, and create a second query group for the large, long-running queries.
3) Do nothing, because Redshift handles this automatically.
4) Setup node affinity and assign large queries and small queries to run-specific nodes.
2) - Redshift supports workload management (WLM), which enables users to create query groups to assign to query queues so that so that short, fast-running queries won’t get stuck in queues behind long-running queries.
Quiz - True or False: Defining primary keys and foreign keys is an important part of Redshift design because it helps maintain data integrity
False - Redshift does not enforce primary key and foreign key constraints. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
Quiz - In your current data warehouse, BI analysts consistently join two tables: the customer table and the orders table. The column they JOIN on (and common to both tables) is called customer_id. Both tables are very large, over 1 billion rows. Besides being in charge of migrating the data, you are also responsible for designing the tables in Redshift. Which distribution style would you choose to achieve the best performance when the BI analysts run queries that JOIN the customer table and orders table using customer_id?
1) DEFAULT 2) EVEN 3) KEY 4) ALL
3) KEY - The KEY distribution style will help achieve the best performance with in this case. In Redshift, rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.
Quiz - Which of the following AWS services directly integrate with Redshift using the COPY command. Choose 3:
1) DynamoDB
2) EMR/EC2 instances
3) Kinesis Streams
4) Data Pipeline
5) Machine Learning
6) S3
1) 2) 6)
Quiz - True or False: When you use the UNLOAD command in Redshift to write data to S3, it automatically creates files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3).
True - If you want to ensure files are automatically encrypted on S3 with server-side encryption, no special action is needed. The unload command automatically creates files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3).
Quiz - A company has set revenue goals for each of their regional businesses across the world. The CEO needs a report showing revenue goals by region. She also needs a visual that uses rectangle sizes and colors to show which regions have the highest revenue goals. Which visual type would best suit her needs?
1) HeatMap 2) Tree Map
3) Scatter Plot 4) Pivot Table
The tree map visual type would be ideal for this scenario. Use tree maps to visualize one or two measures for a dimension. Each rectangle on the tree map represents one item in the dimension. Rectangle size represents the proportion of the value for the selected measure that the item represents compared to the whole for the dimension. You can optionally use rectangle color to represent another measure for the item. Rectangle color represents where the value for the item falls in the range for the measure, with darker colors indicating higher values and lighter colors indicating lower ones.
EMR - What is Tez?
Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop.
Both Tez and MapReduce are execution engines in Hive
EMR - What is Hadoop Encrypted Shuffle?
The shuffle phase is the process of transferring data from the mappers to the reducers. This process involves transferring data from node to node within the cluster, and if you want that data to be encrypted in-transit between nodes, then Hadoop encrypted shuffle has to be setup. Encrypted Shuffle capability allows encryption of the MapReduce shuffle using HTTPS. When you select the in-transit encryption checkbox in the EMR security configuration, Hadoop Encrypted Shuffle is automatically setup for you upon cluster launch.
Quiz - True or False, Encryption on a Redshift cluster is immutable
True - you can not change an encrypted Redshift cluster to a un-encrypted cluster
Redshift - what are the four tiers of encryption at REST?
Master key, Cluster encryption key, Database encryption key, Data encryption keys.
Redshift - what are the options of managing encryption keys? What are difference between these options? (name 3 of them)
You have two options: KMS and HSM (CloudHSM or On-Prem HSM)
Differences: 1) managed v.s. non-managed; 2) symmetric v.s. symmetric & non-symmetric encrytions; 3) multi-tenant v.s. single tenant 4) usage based pricing v.s. hourly rate on region
Quiz - Which of the following mechanisms work together to encrypt data at rest on instance store volumes and EBS volumes?
1) DriveCrypt
2) LUKS
3) Open-source HDFS Encryption
4) TrueCrypt
2) LUKS and 3) Open-source HDFS Encryption
Quiz - Your organization is going to use EMR with EMRFS. However, your security team requires that you both encrypt all data before sending it to S3 and that you maintain the keys. Which encryption option should be used?
1) CSE-KMS 2) CSE-Custom 3) SSE-S3 4) SSE-KMS
2) CSE-Custom
In this case, you would use CSE-Custom, where you would encrypt the data before sending it to S3 and also manage the client-side master key. The other encryption options available are: S3 Server-Side Encryption (SSE-S3), S3 manages keys for you; Server-Side Encryption with KMS–Managed Keys (SSE-KMS), S3 uses a customer master key that is managed in the Key Management Service to encrypt and decrypt the data before saving it to an S3 bucket; Client-Side Encryption with KMS-Managed Keys (CSE-KMS), the EMR cluster uses a customer master key to encrypt data before sending it to Amazon S3 for storage and to decrypt the data after it is downloaded.
Quiz - You have enabled encryption for your Redshift cluster and loaded 7 TBs of data into the cluster. It turns out that encryption was not required on this cluster after all. You want to make a change and ensure that the cluster is not encrypted. What are your options?
1) Restore from the latest snapshot.
2) Unload the data into S3 and reload it into a cluster that is not encrypted.
3) Check the decrypt checkbox in the Redshift modification options.
4) Delete the KMS managed keys.
5) Create a new cluster that is not encrypted and reload the data.
2) and 5)
Once you enable encryption for a Redshift cluster upon launch, you can cannot then change it to an unencrypted cluster. You’ll have to unload the data and reload the data into a new cluster with your new encryption setting. Alternatively, if you still have the data in S3 or other supported source, then load the data into a new cluster with your new encryption setting. When you encrypt a Redshift cluster, data blocks and system metadata are encrypted for the cluster and its snapshots.