Analytics Flashcards
What is AWS Glue?
It performs discovery on the underlying schema of your data.
It also performs custom ETL jobs.
What is stored in the Glue Data Catalog?
Your table definitions or schemas. All the original data is still in S3.
What does Glue really allow you to do?
Query your unstructured data in S3 like it is structured data.
What is Hive?
It runs on EMR and allows you to run SQL like queries.
Can a Hive metastore be used in Glue?
Yes
Can a Glue Data Catalog be used in Hive?
Yes
What does enabling Job Metrics do in AWS Glue?
It helps you understand the maximum DPU that you need for your Glue job.
Where can you plot the Glue Job Metrics maximum needed executors versus maximum allocated?
In the Glue console. You do not need cloudwatch for this.
What is a dynamic frame in AWS Glue?
A collection of dynamic records
How can I remove outliers in my data in AWS Glue ETL?
Use the filter transformation
Can you join data in AWS Glue ETL?
Yes
How do you find matches or duplicates in your data in AWS Glue when there is no common unique identifier?
Use the FindMatchesML transformation.
Can you convert formats in Glue?
Yes
What does ResolveChoice do in AWS Glue?
It deals with Ambuguities in your data, eg., two columns named price.
What options does ResolveChoice in AWS Glue have?
Make_cols - Makes columns
Cast - Casts to a specific type
Make_Struct Creates structure that contains each data type
Project: Projects every type to a given type
How do you modify your Glue Data Catalog when you added a new partition to your data?
Run the enableUpdateCatalog and PartitionKeys option.
How do you modify your Glue Data Catalog when you added a new Schema or table to your data?
EnableUpdateCatalog / updateBehavior
How are you billed in AWS Glue?
By the second.
How are you billed for development endpoints in AWS Glue?
By the minute.
When you want to use Hive or Pig, what ETL engine should you use as a matter of best practice?
EMR
Can Glue ingest streaming data?
Yes. From Kinesis or Kafka
What is Glue Data Quality?
Rules for your data quality, if the threshold is exceeded the job can stop or a cloudwatch alarm can be triggered.
What language does Glue Data Quality support?
DQDL
What are recipes in Glue Data Brew?
They are transformations that can be saved and applied to other jobs.
Can you create a dataset from RedShift or Snowflake in Glue Data Brew?
Yes
What is AWS Lake Formation?
AWS Managed Data Lake
Do Glue and Lake Formation have overlapping functionality?
Yes. Anything that can be done in Glue can also be done in LakeFormation.
What AWS services can talk to Lake Formation?
Athena, RedShift, EMR
If I want to add someone from another account to my DataLake, how would I do this?
The recipient must be set up as a data lake administrator.
What is required to access encrypted data catalogs in Lake Formation?
IAM permissions
What are governed tables in Lake Formation?
They support ACID Transactions with your data lake.
Can Lake Formation support Streaming data?
Yes, using governed tables. It can accept streams from Kinesis.
Does Lake Formation have row and cell level security?
Yes
Can Lake formation support SAML?
YES
What are policy tags in Lake Formation?
They are used on databases, tables, or columns and can be used for security. aka only admins can see the users table.
What is a Data Filter in Lake Formation?
They provide column, row, or cell-level security.
When are Data Filters applied in Lake Formation?
When granting SELECT permissions on Table
What is Athena?
A SQL interface for your data in AWS S3
When you need to add partitions after the fact in Athena, what command needs to be run?
MSCK REPAIR TABLE
How do you Optimize your Athena table after using ACID transactions
OPTIMIZE TABLE REWRITE DATA USING BIN_PACK to compact
How are Athena fine grained controls managed?
AWS IAM
Does Spark support streaming?
Yes. Kinesis, Kafka, EMR
Can you convert data into another format using Athena?
Yes. It can use Parquet or Orc. It can also use GZIP or Snappy
Can spark streaming support Kinesis?
Yes
Can spark stream into RedShift?
Yes
Can Athena use Spark?
Yes. You can run a notebook within the Athena console and select Spark.
What are Athena Federated Queries?
It allows you to query sources other than S3.
CloudWatch
DynamoDB,
OpenSearch
RDS
Etc..
Do Athena Federated Queries support Views?
Yes. They are stored in Glue.
Do Athena Federated Queries support cross-account data sources?
Yes.
What is EMR?
A managed Hadoop framework on EC2
What is a benefit of EMR over Glue?
More granular server access
What are the nodes in an EMR cluster?
Master Node
Core Node
Task Node
What is the EMR Master Node
Manages the cluster
What is an EMR Core Node
Hosts HDFS data and runs tasks.
What is a Task Node?
Only runs tasks, but does not store data.
Can you start your EMR clusters as part of a data pipeline?
Yes, using AWS Data Pipelines
What is the block size of HDFS?
128mb
What is EMR Managed Scaling?
Scales all your instances regardless of type.
How does EMR Managed Scaling scale?
It adds core nodes first and then task nodes to the maximum specified.
How does EMR Managed Scaling scale down?
It removes task nodes first and then core nodes.
In EMR Managed scaling, are On-Demand or Spot Instances scaled down first?
Spot
Can EMR run as serverless?
Yes
How large can a Kinesis Data Record Be?
1MB
What is the throughput from the Kinesis Producer to the Stream?
1MBPS or 1000 messages per second per shard
What is the max retention in Kinesis Data Streams?
365 days
Does Kinesis allow replay?
Yes
What are the capacity modes for Kinesis Data Streams
Provisioned
On-Demand
Do you have to manage capacity for Kinesis using On-Demand mode?
No.
What is the default capacity for Kinesis on-demand mode
4MB or 4K records per second.
What is the best use case for the Kinesis Producer SDK?
low throughput, higher latency
If I want to use an asynchronous call to put data into Kinesis which producer would I use?
The KPL.
What is batching in Kinesis Producer Library?
Aggregation, Data is sent when a threshold is met. Allows you to go over the 1000 records per second limit.
How do you adjust the buffer time in Kinesis Data Streams?
RecordMaxBufferedTime
If an application cannot tolerate latency, what Kinesis producer should be used?
The SDK.. Batching would be problematic and cause latency.
Can Spark be a consumer of Kinesis?
Yes
How many Kinesis GetRecords API calls can be made per shard per second?
5
How much data Kinesis GetRecords API return?
Up to 10MB of data.
How do you handle Checkpointing in the KCL?
DynamoDB.
What does it mean when you get an ExpiredIteratorException in the Kinesis Client Library?
You are checkpointing and Dynamo DB was throttled. Need more WCU
Can Lambda perform light ETL for Kinesis Data Streams?
Yes!
What is the latency for Kinesis enhanced fanout latency?
70ms
What is the latency for Kinesis standard consumer latency?
200ms
Can Kinesis resharding be done in parallel?
No. It takes a few seconds per shard.
How can duplicates end up in the Kinesis Producer?
Network timeouts
How can duplicates end up in the Kinesis Consumers?
Resharding events
Starting the application
Worker instances are added or removed
application is deployed
What is the max record size for Kinesis Data Firehose?
1MB
How do you perform data transformation in Kinesis Data Streams
Lambda
What are the three main targets for Kinesis Data Streams?
S3
Redshift (using S3 copy.. No direct integration)
OpenSearch
Custom destinations that use HTTP endpoints
Can you archive data coming into Kinesis Data Firehose?
Yes. This can be stored to S3. All or just failed records.
Does Kinesis Data Firehose automatically Scale?
Yes
Does Kinesis Data Firehose perform data conversions?
Yes.. To parquet
Does Kinesis Data Firehose support compression?
Yes
Can Spark or the KCL read from Kinesis Data Firehose?
No. Not possible
How is the buffer in Kinesis Data Firehose configured?
By time or Size
What are the default sizes
32MB or 2minutes
How do you send data in real-time to OpenSearch?
Kinesis Data Streams
What is Managed Service for Apache Flink?
Managed instance of Apache Flink to process data streams.
What are common sources for Apache Flink in AWS?
Kinesis
Kafka
Can Flink send data to Kinesis?
Yes, Using Data Streams or Firehose
What is Managed Service for Kafka Connect?
A plugin service for other services. Works with Redshift, S3, Opensearch, etc…
Can MSK be run as serverless?
Yes
What is OpenSearch?
Petabyte scale analysis and reporting. fundamentally a search engine.
Does Opensearch support visualizations?
Yes. Quicksight is more robust.
In OpenSearch, Indexes are split into ____________
Shards
How many shards does an OpenSearch index have?
2 primary
2 replica
In OpenSearch, can you scale up or down without downtime?
Yes
Do Master Nodes in OpenSearch hold or process data?
No. They only manage the cluster
How do you perform backups in OpenSearch?
Snapshot to S3
What is UltraWarm storage in OpenSearch?
Uses S3 caching.
Good for log Data
Best for indices with few writes.
In OpenSearch, What is Index State Management?
It deletes old indexes after a period of time.
Moved them from storage type to storage type
Automates snapshots
Reduces replica count
Index Rollups
Can you have cross-cluster replication in OpenSearch?
Yes
What is the leader index in OpenSearch?
The master copy of your index. The follower is the index following it for replication.
How many OpenSearch master nodes should you have?
Three
Can openSearch run as serverless?
Yes
What are some popular quicksight data sources?
Redshift
Aurora
Athena
What is SPICE in Quicksight?
Is a parallel in memory calculation engine to accelerate queries on large datasets.
How much SPICE does each user get?
10MB
In Quicksight, what version will I need for column level security?
Enterprise
Does QuickSight only access data from within the same region?
Yes
How does QuickSight get around the data from within the same region limitation?
Create a security group with an inbound rule of the IP range of the quicksights servers.
If I want to use an elastic network interface to put Quicksight in the same VPC as Redshift, what version of Quicksight will I need?
Enterprise
Can Active Directory be used with Quicksight?
Yes, but only with Enterprise
Do you get Encryption at Rest with Quicksight Standard?
No, only Enterprise.