Domain 4: Analysis Flashcards
RANDOM_CUT_FOREST
Kinesis Data Analytics SQL (or Flink) Function for anomaly detection in numeric columns
Kinesis Firehose Buffer Limits
1 to 128 MB
60 to 900 seconds
Kinesis Data Analytics Supported Sources
Kinesis Streams and Kinesis Firehose
Kinesis Data Analytics Supported Destinations
Kinesis Streams, Kinesis Firehose, Lambda
What happens if a record arrives late to a Kinesis Data Analytics application
Record is written to the error stream
In what form does Kinesis Data Analytics provision capacity?
Kinesis Processing Units
How much memory is provided per KPU?
4GB
What is the default number of KPU per Kinesis Data Analytics application?
8
What is the name of the visualization tool in the Elastic Stack?
Kibana
Is ElasticSearch Serverless?
No, still have to scales servers
What should ElasticSearch NOT be used for?
- OLTP (RDS or DynamoDB instead)
- Ad-Hoc Querying (Athena instead)
How can data be imported to ElasticSearch?
Kinesis, DynamoDB, Logstash, Beats, ElasticSearch API
What query engine does Athena use?
Presto
What data formats does Athena support?
CSV, JSON, Parquet, ORC, Avro
Is Athena serverless?
Yes
Does Athena support unstructured data?
Yes
Which data formats are columnar?
ORC and Parquet
Which data formats are splittable?
ORC, Parquet, Avro
Which notebooks can Athena integrate with?
Jupyter, Zeppelin, RStudio
What is the cost rate for Athena?
$5 per TB scanned
Do cancelled queries count toward Athena charges?
Yes
Do failed queries count toward Athena charges?
No
What data format will be the most cost effective in Athena?
Columnar (ORC, Parquet)
Does Athena charge for DDL processing?
No
How can Athena results be encrypted?
Encrypt at rest in S3 using SSE-S3, SSE-KMS, CSE-KMS
Can Athena access S3 in another account?
Yes
How are Athena results encrypted in transit?
Transport Layer Security (TLS)
Is Redshift Serverless or Fully Managed?
Fully Managed?
What is the maximum number of compute nodes in a Redshift cluster?
128
What are the two types of compute nodes that can be selected for a Redshift cluster?
Dense Storage (DS) - uses HDDs for large size at low cost Dense Compute (DC) - uses SSD and lots of memory for faster performance at a higher cost
How many HDDs on an ds2.xlarge Redshift compute node?
3 for a total of 2TB storage
How many HDDs on an ds2.8xlarge Redshift compute node?
24 for a total of 16TB storage
How many SSDs on an dc2.large Redshift compute node?
160GB SSD storage, 15GB RAM
How many SSDs on an dc2.8xlarge Redshift compute node?
2.6TB SSD, 244GB RAM
What determines the number of Node Slices on a Compute Node?
The size of the Compute Node
What kind of data storage does Redshift use for high performance?
Columnar
Can you change the compression encoding for a column after a table is created in Redshift?
No
How many copies of your data is stored within Redhisft?
Three - one main on cluster, one backup on cluster, one snapshot in S3
Can Redshift data be backed up to another region?
Yes - asynchronously in S3
How many AZs is Redshift limited to?
One
What is the default Redshift distribution style
AUTO
What is the EVEN Redshift distribution style?
Steps through each slice and assigns data in round-robin fashion
What is the KEY Redshift distribution style?
Assigns data to each slice based on a selected key column. Ideal if you plan to query data on a specific column.
What is the ALL Redshift distribution style?
All data is replicated on every node in the cluster. Multiplies storage by the number of nodes in the cluster.
What are Redshift Sort Keys?
Similar to an index, makes for fast range queries
What are the three types of Redshift Sort Keys?
Single, Compound, Interleaved
What is the default types of Redshift Sort Key?
Compound
Does the order of Compound Sort Keys matter in Redshift?
Yes - first will be primary
What is required when performing COPY from S3 to Redshift?
Manifest File and IAM role
What is the command to copy Redshift data into S3?
UNLOAD
How can you configure S3 to Redshift connections without going over public internet?
Enhanced VPC routing
Can COPY decrypt S3 data as it is loaded into Redshift?
Yes, using hardware accelerated SSL
If loading a tall but narrow table to Redshift, what should you attempt to do for efficiency?
Try to use only one COPY command (metadata is added for each COPY command)
How do you copy a Redshift snapshot to another region?
- Create KMS Key in destination region
- Specify unique name for your snapshot copy grant
- Specify the KMS Key for which you’re creating the copy grant
- In the source region, Enable copying of snapshots to the copy grant you created
What is Redshift DBLINK?
Connects Redshift to PostgreSQL (which could be on RDS)
MUST be in the same Availability Zone
Can data be imported from DynamoDB to Redshift?
Yes
What is Redshift Workload Management (WLM)?
Prioritizes short, fast queries vs long, slow queries
How can you configure Redshift WLM?
Redshift Console, CLI, or API
What is Redshift Concurrency Scaling?
Automatically adds cluster capacity to handle increases in concurrent read queries
How do Redshift WLM and Concurrency Scaling interact?
WLM queues can manage which queries are sent to concurrency scaling clusters
How many queues can be created with Redshift Automatic WLM?
8 (default of 5)
Is concurrency raised or lowered on large queries in Automatic WLM?
Lowered
How many queues can be created with Redshift Manual WLM?
8 (default 1)
What is the default concurrency of the default queue in Redshift Manual WLM?
5
What is the maximum concurrency level in Redshift Manual WLM?
50
What is query queue hopping?
Timed out queries automatically hop to another queue and retry
What is Redshift Short Query Acceleration (SQA)?
Prioritizes short queries. Alternative to WLM.
What statements does Redshift SQA support?
CREATE TABLE AS, and SELECT statements
How does Redshift SQA predict query execution time?
Machine Learning
What is the Redshift VACUUM command?
Recovers space from deleted rows?
What are the four types of Redshift VACUUM commands?
FULL, DELETE ONLY, SORT ONLY, REINDEX (reanalyzes interleaved sort keys)
What is Elastic Resize in Redshift?
Quickly add or remove nodes of the same type. Low downtime. For some types, you can only double of halve the nodes.
What is Classic Resize in Redshift?
Change node type and/or number of nodes. Can lead to hours or days of read-only.
What is Redshift Snapshot, restore, resize?
Used to keep cluster available during a Classic resize. Minimizes downtime.
What are Redshift RA3 nodes?
Allow you scale compute and storage capacity independently
What is Redshift Data Lake Export?
Unloads Redshift to S3 in Parquet format
What are some advantages of Parquet?
2x faster, 6x smaller, automatically partitioned, compatible with many services (Spectrum, Athena, EMR, Sagemaker)
What does ACID stand for?
Atomicity, Consistency, Isolation, Durability
What port does Kibana run on?
5601
Do you have to use Glue Data Catalogs when using Athena?
No, you can use standard Athena Data Catalogs
What language does Glue’s ETL engine use?
Python
Can Athena invoke SageMaker models?
Yes
Is Athena’s Data Catalog Hive metastore compatible?
Yes
When should you use Redshift?
Many different sources, highly structured, single source of truth, stored for long periods of time, performant on large sizes of data
When should you use Athena?
Don’t want to worry about formatting or infrastructure, quick queries for troublehsooting, ad-hoc
When should you use EMR?
Need a wide variety of custom processing tasks, fine grained control over your clusters, custom code
What is Federated query in Athena?
allows you to run SQL queries across variety of relational, non-relational, and custom data sources. A unified way to run SQL queries across various data stores.
Can Athena read from compressed files?
Yes
Can Hive Query be run on Athena?
No (only Presto is supported)
What is SerDe?
Serializer/Deserializer; libraries that tell Hive how to interpret data formats; also used by Athena
What needs to be done to add data to a partitioned table in Athena?
ALTER TABLE ADD PARTITION
Can Athena access an S3 bucket in another account?
Yes