Azure DP-201 Flashcards
What are the tiers of Azure Blob Storage?
- Hot: for frequently used data, high storage costs, low read/write cost
- Cool: after 30 days, lower storage costs, higher read/write costs
- Archive: after 180 days
What are the 5 levels of consistency in Cosmos DB?
- Strong
- Bounded Staleness
- Session
- Consistent Prefix
- Eventual
What is the recommended file size for an Azure Data Lake Gen1 that requires POSIX permissions and enables diagnostics logging for auditing?
250 mb or greater
What is horizontal partitioning?
aka Sharding
Data is partitioned horizontally to distribute rows across a scaled out data tier. The schema is identical on all participating databases.
*** Which data storage solution should you recommend, if you need to represent data by using nodes and relationships in graph structures?
Cosmos DB
What are the distribution types
Hash-distributed
Round-robin
Replicate
What is Azure Synapse Analytics?
Formerly SQL Data warehouse
Azure Synapse is an analytics service that brings together enterprise data warehousing and Big Data analytics
In Azure Databricks, how would you keep an interactive cluster configuration even after it has been terminated for more than 30 days?
an administrator can pin a cluster to the cluster list
What are the core storage services in the Azure Storage platform?
- Azure blobs
- Azure Files
- Azure Queues
- Azure Tables
- Azure Disks
Choosing Data Abstraction methods:
https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage#choose-data-abstraction
What is the best data format for Spark jobs?
Parquet
Datasets vs. Dataframes
DataFrames:
Best choice in most situations.
Provides query optimization through Catalyst.
Whole-stage code generation.
Direct memory access.
Low garbage collection (GC) overhead.
Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming.
DataSets:
Good in complex ETL pipelines where the performance impact is acceptable.
Not good in aggregations where the performance impact can be considerable.
Provides query optimization through Catalyst.
Developer-friendly by providing domain object programming and compile-time checks.
Adds serialization/deserialization overhead.
High GC overhead.
Breaks whole-stage code generation.
What data models does Cosmos DB support?
document, key-value, graph, and column-family data models.
You work for a transportation logistics company. You are incurring large costs in the transformation step of your big data architecture. What is a possible way to reduce this cost?
Use Polybase.
PolyBase allows for ELT instead of ETL
What are two benefits of Databricks?
- It can utilize multiple API’s.
2. it can visualize individual pieces of code
What is Data Masking?
A way to hide sensitive data from users that should not have access.
Examples: Social Security number, credit card number
What are reasons to use Data Masking?
- Protect non-production data
- Protect against insider threats
- Comply with regulatory requirements
What are use cases for SQL Database Auditing?
- Retain Audit Trails (see who has accessed the service)
- Report on event activity (visualize audit trails)
- Analyze (spot trends or unusual activity)
You work for a retail sales chain. Your marketing department needs to access client data to design marketing promotions. Concerns have been raised about access to the data. What is the most appropriate solution to protect the data and allow the marketing department to function?
Data Masking
This would protect sensitive data while still granting the marketing department access.
What is defense in depth?
A layered approach to security. This is a replacement to the Zero Trust Model (all or nothing model)
What is the difference between BLOB Storage and Data Lake Gen2?
Data Lake Gen2 has a hierarchical namespace (the collection of objects and files are organized into directories and sub-directories). Similar to file explorer on your computer.
What are the two options Azure offers for relational cloud data store (RDBMS)?
- SQL Database
2. Azure Synapse (SQL Data Warehouse)
What Azure big data service is best for transaction processing of relational data?
SQL Database
What are advantages of SQL Database?
- Consistent data that can handle complex queries
- for transactional processing
- Single source data capture
- Scales vertically
- for relational data
What are advantages of SQL Data Warehouse (Synapse)?
- Parallel processing
- multiple relational source data capture
- handles complex queries
- scales horizontally
What are benefits of Cosmos DB?
- global replication
- multi-model
- for non-relational data
What are the 5 levels of consistency for Cosmos DB?
- Strong (Best consistency; Most expensive)
- Bounded Staleness
- Session
- Consistent Prefix
- Eventual (Weak consistency; Least Expensive)
What are the options for storing non-relational data in Azure?
- Cosmos DB
- Data Lake Gen2
- BLOB storage
What are the two types of partitioning in Cosmos DB?
Logical
Physical
Logical Partitions are based on:
Partition Keys
*** What are things to consider when developing a partition key?
- should be a property that will exist on every object
- anticipate top queries
- avoid fans
- Keys are immutable. They cannot change
What is Polybase used for?
importing/exporting data between Azure BLOB storage and Synapse (Data Warehouse)
T/F Data Factory can ingest both structured and unstructured data.
True
What is Data Factory?
- An orchestration service.
- Primary method for ingesting data into an Azure architecture.
- Responsible for moving and monitoring the data
T/F Data Factory can be used for both ETL and ELT
True
*** What is a cluster in Databricks?
a group of compute resources
What are the languages available in Databricks?
R, SQL, Python, Scala, Java
T/F Databricks can be used for streaming and bratch processing.
True
What is Databricks used for?
Exploration and visualization of data
What are components of Databricks?
Cluster: compute resources
Workspace: “filing cabinet” for Databricks work
Notebooks: “folders” that contain cells
Cells: individual pieces of code
Libraries: packages that provide additional functionality
Tables: where structured data is stored
What are ways to recover from failed queries when streaming in Databricks?
- enable creating checkpoints
- configure jobs to restart on failure
- recover after changes
How do you optimize Databricks jobs using scheduler pools?
Group jobs into pools by weight.
By default, all queries run in a fair scheduler pool (first in first out). By grouping into pools by weight, you can allow more important jobs to go through first.
How do you optimize Databricks jobs using configuration settings?
use “compute-optimized” instances.
What are Watermark Policies in Databricks?
A way to set thresholds for late data coming in from input streams.
What are methods to optimize streaming in Databricks
- enable autoscaling
- optimize configuration settings
- group jobs into pools by weights
- recover from query failures
*** How are you charged for Cosmos DB?
Storage
Throughput
Which are appropriate questions for determining what solution should be used for ingesting and moving data?
- How cost sensitive is the project?
- What is the end result of the data?