Analytics Flashcards
What does Glue create when it scans your unstructured data in S3?
It creates metadata which can be used to query the data.
What is Hive?
It allows you to run SQL like queries from EMR
Can you import a Hive metastore into AWS Glue?
Yes. You can also import AWS Glue metadata into Hive.
How can you increase the performance of a Spark Job?
Provision additional DPUs (Data processing units).
How can you determine how many DPUs you will need for your job?
Enable job metrics to understand the maximum capacity in DPUs that you will need.
Where are Glue errors reported?
Cloudwatch
How can you schedule Glue jobs?
Glue Scheduler.. This is the most straight forward approach.
What is a DynamicFrame in AWS Glue?
a collection of dynamicRecords
What is a DynamicRecord in AWS Glue?
They are records that are self-describing and have a schema.
Using native AWS Glue functionality, how can you drop fields or null fields?
DropFields or DropNullFields transformation
Using AWS Glue, how can you select a subset of records during your ETL process?
Using filter transformation
How can you enrich your data from another source in AWS Glue?
Use the join transformation
What does the Map transformation in AWS Glue do?
It allows you to add fields, delete fields, and perform external lookups.
What does the FindMatches ML transformation do in AWS Glue?
It identifies duplicate or matching records in your dataset. Even when the records do not have a common identifier.
What format conversions can AWS Glue support?
CSV, JSON, Avro, Parquet, ORC, XML
What does AWS Glue ResolveChoice do?
It deals with ambiguities in your DynamicFrame and returns a new one. example is two fields with the same name.
How do you update your Glue Data Catalog?
You can re-run the crawler or have a script use enableUpdateCatalog / updateBehavior
What are AWS Glue Development endpoints?
They allow you to use a notebook to develop your ETL script. They are launched in a VPC and can be used with SageMaker notebook or Zeppelin.
What do AWS Job bookmarks do?
It keeps track of where you left off so you are not reprocessing old data. Works with S3 sources and relational databases. It only works with new rows in a database, not updated ones. The primary key also needs to be sequential.
Can you start a step function from an AWS Glue event?
Yes
How are you billed using AWS Glue?
You are billed by the second.
How are AWS Glue development endpoints billed?
By the minute
If you want to use engines like Hive or Pig, what AWS service is the best fit?
EMR. Glue is based on Spark
Can AWS Glue process streaming data?
Yes. It can do this from Kinesis or Kafka.
Can AWS Glue clean and transform streaming data in-flight?
yes
What is AWS Glue Studio?
It is a visual interface for ETL workflows
Where can you view the status of AWS Glue Jobs running?
In the Glue Studio Monitoring console
What is AWS Glue Data Quality?
It evaluates the data based on rules that you set. It uses DQDL for custom development.
What is AWS Glue Data Brew?
A visual preparation tool for transforming data.
What are Glue Data Brew sources?
S3, data warehouse, database
Where does Glue Data Brew output data?
S3
What is a recipe in Data Brew?
It is a saved set of transformations that can be applied to any dataset
Can you define data quality rules in Data Brew?
Yes
How does Redshift and Snowflake integrate with Data Brew?
You can use custom SQL to create datasets
Does Data brew integrate with KMS?
Yes, but only with customer master keys KMS SSE-C
Can you schedule a job in Data Brew?
Yes
How can you remove PII in Data Brew
Substitution
Shuffling
Deterministic_Encrypt
Probablistic Encryption
NULL or DELETE
MASK OUT
HASH
What are AWS Event Bridge Batch conditions?
It only fires an event when a specific number of events or seconds within a time period are exceeded.
What is AWS Lake Formation?
It makes it easy to set up a secure data lake in days.
What can you do in Lake Formation?
Anything that you can do in Glue. It is built on Glue.
What AWS services can query lake formation?
Athena, Redshift, and EMR
Can you have multiple accounts accessing Lake Formation?
Yes. The recipient must be a data lake administrator. You can leverage AWS RAM for this as well.
Does Lake Formation support manifests?
No
What are AWS Lake Formation Governed Tables?
They support ACID transactions across multiple tables. This cannot be changed once enabled. Also works with Kinesis streaming data.
How does Lake Formation optimize storage performance?
Automatic compaction
How can you control access to Lake Formation data?
Granular row and column level access
Other than IAM, what else can Lake Formation tie into for permissions?
SAML or external AWS accounts
What are Lake formation policy tags attached to?
Databases, tables, or columns
What are AWS Lake Formation Filters?
They provide column, row, or cell level security. This done when granting select permissions on tables.
What is AWS Athena?
A query service for your data in S3. Data all stays in S3
Is Athena serverless?
yes
What data formats are splitable for parallel processing in Athena?
ORC, Parquet, and Avro
What are Athena Workgroups?
They organize users, teams, and applications into groups. You can control access and track costs by workgroup. They integrate with IAM, CoudWatch, and SNS
Can you set query limits in Athena by using workgroups?
Yes. You can limit how much data is returned.
Are Athena canceled queries billable?
Yes. Only failed queries are not billable.
Are CREATE / ALTER / DROP queries billable in Athena?
No
What can you do to save money querying data with Athena?
Use a columnar format such as ORC or Parquet. You will scan less data.
Do large files perform better in Athena?
Yes. A small number of large files performs better than a large number of small files.
What should you run when you partition after the fact in Athena?
Run MSCK REPAIR TABLE
If you want to ensure your table is ACID compliant in Athena, what table type should you use?
ICEBERG
What are Athena time travel operations?
You can recover data recently deleted with a SELECT statement
What should you do if your ACID transactions in Athena are getting slower over time
Optimize table command using bin_pack where catalog = N
How granular does Athena get with permissions?
Database and table level.
What can you use to query Spark directly?
Spark SQL
Does Spark have machine learning capabilities?
Yes, using MLLib
Can you process streaming data with Spark?
Yes. It integrates with Kinesis and Kafka
Can you change data formats in Athena?
Yes, using CTAS and the format attribute.
What is Spark Structured Streaming?
It just keeps appending to a table and you query by using windows of time.
Can Spark support Redshift?
Yes
Can you run a Jupyter notebook with Spark within the Athena console?
Yes
What is AWS EMR?
A managed hadoop framework that runs on EC2. Uncludes Spark, HBase, Presto, Flink, Hive, and more
What are EMR notebooks
Browser based development in a notebook.
Whare are the EMR node types?
Master, Core, and Task node
Where does data persist in EMR?
On the core nodes in HDFS
Do EMR task nodes store data?
No
What is a good strategy to reduce EMR costs?
Use spot instances for task nodes since they do not persist data.
What is a transient EMR cluster?
One that terminates once all the steps are complete. Good for cost savings.
What is a long running cluster?
One that must be manually terminated. A good use of reserved instances for cost savings.
When do you configure frameworks and applications in EMR?
When the cluster is launched.
How can you run EMR jobs directly?
By connecting to the master node or submitting jobs via ordered steps in the console.
When using EMR, where can you store data for it to persist?
S3
How can you schedule the start of your EMR cluster?
The AWS Data Pipeline
Is HDFS persistent?
No.
What is the default block size in HDFS
128MB
What is EMRFS
Allow you to access S3 as if it was HDFS
What is EMRFS Consistent View?
Uses for consistency and uses DynamoDB to track consistency. S3 is now strongly consistent in 2021.
Can you use EBS for HDFS?
Yes. It will be ephemeral though. Can only be attached when launching the cluster.
How is EMR billed?
By the hour
How do you increase processing capacity in EMR?
You can add task nodes on the fly as long as you do not also need to increase storage capacity.
How do you increase processing and storage capacity in EMR?
Resize the cluster core nodes.
What is EMR Managed Scaling?
Adds core nodes and then task nodes up to the max units specified. It also scales down to your configured value.
When scaling down, which EMR nodes get removed first.
Spot nodes (task and then core)
Can you specify the resources needed for your job in EMR Serverless?
Yes. Without configuring this, EMR will calculate the value on its own.
Is EMR multi-region?
No
How does EMR Serverless Application Lifecycles move from step to step.
API calls. This is not automatic!!
Can EMR run on EKS?
Yes. It can run alongside other applications.
What is a record made up of in Kinesis Data Streams when it is sent from the producer?
A partition key and a datablob (up to 1MB)
How fast is a shard in Kinesis Data Streams when being sent from the producer to the stream?
1MB per second or 1000msg per second per shard.
What is a record made up of in Kinesis Data Streams when it is sent to the consumer?
A partition key, sequence numvber, and a datablob (up to 1MB)
How fast is a shard in Kinesis Data Streams when being sent from the stream to the consumer?
Shared Mode
2MB per second per shard across all shards.
Enhanced Mode
2MB per second per shard per consumer
What is the maximum retention for a Kinesis Data Stream?
Between 1 and 365 days
Can you replay data in Kinesis?
Yes.
What is the provisioned capacity mode in Kinesis Data Streams?
You choose the number of shards and scale manually.
What is the on-demand capacity mode in Kinesis Data Streams?
Automatically scales based on observed throughput. This is 4MB/s or 4K per second
How can you increase throughput using the Kinesis Producer SDK?
Using Putrecords for batching.
What is the best use case for the Kinesis Producer SDK?
Low throughput, higher latency.
What managed AWS sources send to Kinesis Data Streams?
CloudWatch, AWS IoT, and Kinesis Data Analytics
What APIs are included in the Kinesis Producer Library?
Synchronous and asynchronous
Does Kinesis Producer Library support record compression?
No
How do you add delay in Kinesis Producer Library batching?
RecordMaxBufferedTime
Can Apache Spark consume Kinesis Data Streams?
Yes
What is the maximum amount of data returned by the Kinesis SDK GetRecords function?
10MB or up to 10000 records
What is the Maximum GetRecords API calls per shard per second?
5
What is checkpointing in the Kinesis Client Library?
It marks your progress.
When you are checkpointing using the KCL and you recieve the ExpiredIteratorException, what does this mean?
You need to increase the WCU of DynamoDB
What is the Kinesis Connector Library?
It sends data to S3, DynamoDb, RedShift, Opensearch, etc.. It lives on an EC2 instance. Kind of deprecated.
Why is Kinesis Enhanced Fanout fast?
It uses HTTP/2 to push to consumers.
What is the latency when Kinesis Enhanced Fanout is enabled?
Less than 70ms
When should you use Kinesis Standard Consumers?
When there is a low number of consumers
You can tolerate 200ms latency
Cost effective
When should you use Kinesis Enhanced Fan Out Consumers?
when you have multiple consumer applications for the same stream
Low Latency
Higher Cost
What is the default limit of consumers per data stream when using enhanced fan-out in Kinesis?
20, but you can ask for a service request to increase it.
What happens when you split a hot shard in Kinesis?
Two new shards are created
The old shard will go away when the data expires
What happens when you merge a shard in Kinesis?
One shard is created
The old shards will go away when the data expires
What can cause out of order shards in Kinesis?
Resharding can cause this. make sure you read entirely from the parent before reading from the new records. This is built into the KCL
Can Kinesis Resharding be done in Parallel?
No
How many resharding operations can be performed at once?
One.. This is a problem when you have thousands of shards.
What can cause duplicates from your Kinesis Producer?
Network Timeouts. Use unique IDs to deduplicate records on the consumer side.
What use cases can cause a consumer duplicate in Kinesis?
A worker terminates unexpectedly
A worker instance is added or removed
Shards are merged or split
The application is deployed
What can you do to fix duplicate consumer records in Kinesis?
make your application idempotent
Handle duplicates at the final destination
When using a Kinesis Data Stream, how can you transform the data before storing it in S3?
With a Lambda in Kinesis Data Firehose
Can Kinesis Firehose write to Redshift?
Yes. It loads to S3 first and hen issues a COPY command.
Can Kinesis Data Firehose write to openSearch?
Yes
Can Firehose deliver to custom locations?
Yes as long as there is an HTTP endpoint
Can you store data sent into Kinesis Firehose?
Yes. All or failed data can be stored in S3 before the data is sent to S3 in a batch write.
What is the minimum latency for Firehose?
60 seconds
Can Firehose perform data conversions?
Limted, but yes. JSON to ORC but only for S3.. Others are done using Lambda.
Can Firehose compress your data before sending it to S3?
Yes using Gzip, zip, or snappy
Can Spark or Kinesis Client Library read from Data Firehose?
No
What determines when records are sent in Kinesis Data Firehose?
The buffer size and buffer time. Whichever limit is hit first.
What are the minimum values for Kinesis buffer size and time?
Buffer size is a few MB
Buffer time is 1 minute
If you need real-time data made searchable using kinesis, what would you use?
Kinesis streams with a Lambda to send the data to OpenSearch
What is a Cloudwatch subscription filter?
A subscription filter allows you to connect to other AWS services like Lambda, Data Streams, etc..
Can Kinesis Data Analytics send to Lambda?
Yes. This can be used to encrypt, translate to another format, aggregate rows, etc..
What can Kinesis Data Analytics integrate with that Firehose cannot?
Dynamo DB, Aurora, SNS, SQS, Cloudwatch
What is Kinesis Data Analytics now called?
Managed Service for Apache Flink
What can you use in Managed Service for Apache Flink to access SQL?
Table API
What are some good use cases for Managed Service for Apache Flink?
Streaming ETL
Continuous metric generation
Responsive analytics
What is Kinesis Analytics Schema Discovery
It analyzes the schema real-time
What is Kinesis Data Analytics RANDOM_CUT_FOREST?
It detects anomalies in your data.
What is AWS MSK?
Managed streaming for Apache Kafka. An alternative to Kinesis.
What is the maximum message size for AWS MSK?
10MB. This is much larger than Kinesis Data Streams at 1MB
Can you persist data in AWS MSK?
Yes. It uses EBS volumes and is more flexible than Kinesis Data Streams.
Can you control who writes to a topic in AWS MSK?
Yes. This can be done using:
Mutual TLS and Kafka ACLs
IAM Access control
SASL/SCRAM and Kafka ACLs
What is AWS MSK Connect?
It allows you to connect to other AWS services for delivery such as S3, Redshift, Opensearch, etc..
Can AWS MSK be Serverless?
Yes.
What is AWS OpenSearch
Used to be known as Elasticsearch. Petabyte scale analysis and reporting .
What are good use cases for OpenSearch?
Full-Text searching
Log analytics
Application Monitoring
Security Analytics
What are Types in OpenSearch?
They define the schema and mapping shared by documents.
What are Indices in OpenSearch?
An index. They contain inverted indices that you search across everything within them at once.
What is the structure of an index in OpenSearch?
They are split into shards and documents are hashed to a particular shard. Shards can be on different nodes in a cluster.
Can you offload reads in OpenSearch?
Yes, using replicas.
What is a domain in OpenSearch?
It is essentially the cluster.
How can you back your data up in OpenSearch?
Snapshot to S3
Does OpenSearch support resource or identity based policies?
Both. It also supports request signing and IP based policies.
How can you allow access to opensearch through a VPC to external users?
Using Cognito / SAML, Reverse Proxy, SSH, VPC Direct Connect, or a VPN
What type of storage does an OpenSearch data node use by default?
Hot storage. This is an instance store or EBS volume.
What is UltraWarm storage in OpenSearch
It uses S3 and Caching
Best for indices with few writes (log data / immutable data)
slower performance
requires a dedicated master node
What is Cold storage in OpenSearch
Uses S3
Best for periodic research or forensic analysis on older data
Must have dedicated master node
UltraWarm must also be enabled
Can storage data in OpenSearch bet migrated between storage types?
Yes
What is Index State Management in OpenSearch?
Automates index management policies:
automates snapshots
deletes indices over a period of time
Move indices from hot to cold over time
Reduce Replica Count
How often are index state management policies run in OpenSource?
Every 30 - 48 minutes
What are index Rollups in OpenSearch?
They roll up old data into summarized indices. New index may have fewer fields. Good to save on storage.
What are index transforms in OpenSearch?
Like rollups, but purpose is to create a different view to analyze the data differently.
Can you replicate data across clusters in OpenSearch?
Yes.
What is a follower index in OpenSearch?
It pulls from the leader index to replicate data.
How do you copy indices from cluster to cluster on demand in OpenSearch?
Remote Reindex
What is the best practice for master nodes?
Have three
What should you do when you see JVMMemory Pressure Errors in OpenSearch?
Delete old or unused indices.
What is a big pro for OpenSearch Serverless?
On-Demand autoscaling
What are the two collection types in OpenSearch Serverless?
search or time series
What are the sources of QuickSight?
Redshift, Aurora, RDS, Athena, OpenSearch, IoT Analytics, Your own database, raw files like csv, excel, log files, etc…
Can QuickSight perform ETL?
Very light ETL.
What is Quicksight Spice?
Your datasets get imported into spice. Each user gets 10GB of Spice. It accelerates large queries.
What happens when importing data from Athena to Spice takes more than 30 minutes?
It times out.
What is a good use case for Quicksight.
Ad-hoc exploration and visualization
Dashboards and KPIs
Does Quicksight support MFA?
Yes
Does QuickSight support row and column level security?
Yes. Row level security is available in standard, but column level security is only available in the enterprise edition.
What data security permissions need to be added to Quicksight?
You need to make sure QuickSight can access your data.
You need to create IAM policies that restrict what data in S3 users can see.
Can quicksight access RedShift data in other regions?
No. Quicksight can only acces Redshift data in the same region.
How do you access RedShift data to get data from another region using quicksight standard?
Use an inbound security group to allow access to Redshift from the Quicksight IP range.
If you want to keep QuickSight in a private VPC, what version do you need?
Enterprise Edition
How do you access RedShift data to get data from another region using enterprise?
Use private subnets and peering connections. Route tables will tie it together. It can be used for cross account access using transit Gateway
If you want to use an Active Directory connector for quicksight, what version do you need?
Enterprise edition
Can you use customer managed keys in Quicksight?
No. Enterprise edition allows you to use KMS.
What is Quicksight Q?
An NLP interface on top of QuickSight
Can Spice be added to a user?
Yes. It is billed by additional GB of spice needed.
Is encryption at-rest included in the standard version of QuickSight?
No
Can you embed dashboards into 3rd party apps using QuickSight?
Yes, using the Javascript SDK
What needs to be done for embedded dashboards to work on a 3rd party site using QuickSight?
Domain Whitelisting
What ML capabilities does QuickSight have?
anomaly detection
forecasting - seasonality and trends over times. imputes missing values
autonarratives - a story of your data in paragraph format.
Suggested insights - helps decide which feature is right for your dataset.