DP-200 Study Notes Flashcards
What is a data schema?
It’s a skeleton of your data and how it relates data to itself.
What is the difference between a physical schema and a logical schema?
A physical schema is on disk.
The logical schema is tables, views.
What’s the difference between a snowflake schema and a star schema
Star Schema has one fact table and 1 level of dimension tables. It’s not normalised and very easy to query.
Snowflake has multiple levels of dimnension tables. It’s normalised and much slower to query (because you need to make joins)
What’s the general difference between a Azure Data Lake, SQL Database and a SQL Data Warehouse
A Data Warehouse is used for complex queries with some strong business incentive.
Database has very recent and topical data such as how many current orders are there in the business.
A data lake is where you store all your data before its refinement (unstructured)
Briefly describe NoSQL
- No Schema
- Best when there is a large amount of data when there relationship changes and the data changes over time
- Horizontal Scaling (shards out across machines)
- Typically JSON, key-value, graph and column-wide data
Briefly describe a Data Lake
Data in its original form, sitting and waiting until it is needed for processing.
Ingest from source into Data Lake using Data Factory. Databricks can be used to process it for reporting.
What are the 5 consistency levels of CosmosDB
Strong, Boundless Staleness, Session, Consistent Prefix, Eventual.
What does it mean that CosmosDB is multimodel?
It supports SQL API, Cassandra API, CosmosDB API, Gremlin API, Tables API.
What is the SLA and latency of CosmosDB?
SLA is 99.999%
Latency is <10ms
How many copies of your data in cosmosDB?
Copies of your data is = n*4 (where n is the number of regions you are in)
What is the idea behind Logical Partitions
Use a key that spreads across partitions well
10gb per partitions
Throughput costs depend on the container #, keys and workload distribution
What is Azure Table Storage?
It’s NoSQL, key-value store, commonly used for large amounts of structured data. Storing datasets that don’t require complex joins or queries.
Used when you don’t need global replication (when you might use CosmosDB)
What is Azure Data Warehouse?
A repo of non-redundant data
ETL takes data from the lake into the warehouse
Cloud is a good place for this for the scalability and the cost and performance
Fact table contains all the dimensions
Dimensions are the attributes
How does Polybase useful for data warheousing in Azure?
Polybase allows a data warehouse to be used as an ELT. You load the data into the warehouse before you transform it.
What are the three types of shading used in a data warehouse?
Hash, Round Robin, Replicate
What’s the difference between hash, round robin and replicate
Hash has the highest query performance for large tables
Round Robin is the easiest to create for staging tables
Replicate - fastest
Is Azure SQL Database vertical or horizontal scaling?
It is vertical scaling
What are the three options for Azure SQL Database?
Single Database, elastic pool or managed instance
When do you use a single database in Azure SQL Database?
You use Single database when you have one set of resources.
When do you use elastic pool in Azure SQL Database?
Elastic pool is a group of databases that share resources.
What is a managed instance of Azure SQL Database?
A managed instance is similar to the sharing of elastic pool resources but has functionality similar to on-prem
What are the two pricing models for Azure SQL Database?
You can buy vCore or DTU
What’s the difference between DTUs and vCores?
DTUs are pre-configured solutions. If you have more than 300 DTUs it is better to switch to vCores. vCores is better for high cost databases or you need to monitor and trim costs. DTUs are best when running inexpensive databases.
How does DTUs and vCores diffe in terms of pricing?
With DTUs you pay one fixed price for your compute as well as data storage and backup retention.
What is hyper scale in Azure SQL Database?
Hyperscale available with vCores, allows databases to grow to 100 times their size.
Can you use vCores or DTUs for elastic pools in Azure SQL Database?
Yes, you can use both
When should you use SQL Database Managed Instance?
It’s ideal for migrating from on-prem to Azure Cloud. It has native vent integration. It has automated backups but it is more expensive than elastic pool.
Explain Azure SQL Database Backups
You can automate backups via read-access geo-redundant storage.
Backups are done full weekly, differential every 12 hours and transactions every 5-10 minutes
Backups are good for restoring deleted tables. Restoring to a different region, and restoring to a point in time.
Log term retention is for longer term storage as blob.
What is dynamic data masking?
DDM limits who can see data items in real time
How do you implement DDM?
Using the Azure portal or adding a rule in powershelgl
Should you use DDM with other security techniques?
Yes!
What does the following datamask do?
EMAIL varchar(100) MASKED WITH (FUNCTION = ‘email()’) NULL
This shows the first letter of the email and nothing else
What does the following datamask do?
FirstName varchar(100) MASKED WITH (FUNCTION = ‘partial(1, “XXXXXXX”, 0)’) NULL
Gets the first letter of the FirstName and masks the rest with X.
What does it mean to encrypt data at rest?
Transparent data encryption - encrypting at the file level, preventing the data from being copied to another location to be modified. Done by using AES and Triple Data Encryption
CosmosDB uses key storage and encrypts at rest by default.
DataLake - on by default and the keys are managed by you.
What is encrypt data in motion?
TLS/SSL are used for transferring data between the cloud and the customer.
Perfect forward secrecy (PFS) is also used to proctect data between customers client systems and cloud
Shared access signatures
* Delegates accèss to Azure Storage objects
DataLake
HTTPS is only protocol available for REST
What is batch processing?
It’ s a method fo running high volume repetitive jobs without user interaction
- In a bank, nightly run of all the transactions
- Challenge is the input data not being correct. What happens if an error occurs.
- Batch processing - data factory, usql, spark, pig or hive
- Extract Transformation and Load is stored into a structured data warehouse
What is Apache Spark?
Cluster computing to solve a problem and work on big data projects.
What is a driver process in Apache Spark?
Driver process maintains information about maintains information about spark application, responding to program or input and analysis’s and schedules work across the cluster.
What does an executor process do in Apache Spark?
Executor does the work that the driver process tell it to do.
Can multiple applications work on the same cluster in Apache Spark?
Yes!
What is RDD in Apache Spark?
RDD is resilient distributed datasets. Fault tolerant data that exists on several nodes.
Fundamental data structure of apache spark.
What are transformations and action in Apache Spark?
- Data is only loaded when necessary
- Transformations only occur when the driver requests information
- Returns a value after running computation
What is data bricks in Azure?
Jupiter Notebooks basically.
What is Polybase and what is it used for?
Allows you to bypass ETL process in favour of the ELT processes.
Polybase performs the steps within the data warehouse.
What are the steps involved in Polybase?
- Extract the source data into text files
- Land the data into Data Lake or Azure Blob
- Prepare the data for loading
- Load the data into SQL Data Warehouse staging table with Polybase
- Transform the data
- Insert the data into production tables
What is Stream Analytics?
Real time sending of data
What are some use cases of stream analytics?
Security or fraud detection, traffic sensors, healthcares
Where does stream analytics get data from usually?
Azure Event hubs or IoT hubs
What are the 3 components to Stream Analytics job?
Input, query and output
What are queries based on in Stream analytics?
Based on SQL and uses different windowing functions
What is the idea behind windowing in Stream analytics?
You have to put some sort of boundary on streaming data.
Explain a tumbling window?
A tumbling window has a fixed length. The next window is placed right after the end of the previous one on the time axis. Tumbling windows do not overlap and span the whole time domain, i.e. each event is assigned to exactly one window
Explain a hopping window?
A hopping window is similar to tumbling window. You have a set time for the window, however there is now a “hop” or an overlap that allow you two different windows.
Explain a sliding window?
Sliding windows are, for example, used to compute moving averages.
The window moves across and calculates the number of events per window etc.
Explain a session window?
A session window is used to group events that happen together. There is usually a group of them and then a gap.
A session window will have a size, and any events that occur within that frame are added to it.
When are events triggered in Data factory
At certain times of day or when an event occurs
What are linked services used for in Data Factory?
They’re used for connecting to the dataset
How do you monitor Storage Accounts in Azure?
Use Azure Storage Analytics for metrics about storage services and logs for blobs, queues and tables.
Useful for tracking requests, analysing usage trends and diagnosing issues with storage.
Good for tracking requests and analysing usage trends
Metrics is nice fancy graphs
Diagnostic settings is basically settings
Alerts allow you set alerts if usage is getting high
What’s the difference between Azure Monitor and Azure Storage Analytics?
Azure Monitor is a centralised place for all of your monitoring. It assists with troubleshooting issues. Diagnose, trace and debug failures in near real time. Optimizes performance, reduces bottlenecks and reduces costs.
- Application monitoring - performance and functioning of code
- Guest OS monitoring data - operating system
- Azure resource monitoring
- Azure subscription -verall health of Azure and your subscription
- Azure tenant monitoring data
Logs and Metrics are the two pieces of Azure Monitor. Metrics = graphs. Logs = output
Can create a rule to tell Azure monitor to take corrective action and notify you of critical conditions. Can get this notification via email or sms.
Autoscale, allows you to have the right amount of load when necessary to handle how much you need based on Azure Monitor.
How do you monitor in Azure SQL Database?
Key metrics - CPU usage, wait statistics (why are queries waiting on resources), I/O usage - I/O limits of underlying storage, memory usage.
Tools include
- Metrics charts: for DTU consumptions and resources approaching max utilization
- SQL Server Management Studio - performance dashboard to monitor top resource consuming queries.
- SQL Database Advisor - view recommendations for creating and dropping indexes, parameterising queries and fixing schema issues
- Azure SQL Intelligent Insights - automatic monitoring of database performance
- Automatic tuning - let Azure automatically apply changes
How do you monitor in SQL Data Warehouse?
SQL Data Warehouse has the same metrics and you use mostly Azure monitor.
What are the three ways to gather metrics?
- performance metrics on the metrics page on the side panel
- Performance metrics using Azure monitoring
- Performance metrics on the Account page
How do you setup alerts in CosmosDB?
You can setup alerts in the portal. Rules can be configured through CosmosDB alert rules.
Alert rules can be setup in the side panel. With the resource, condition and action flow.
What are the four main monitoring items in Stream Analytics?
The four main ones are
- SU % utilitization (More SU, the CPU used to run your job - How well are we using the streaming units?)
- runtime errors (should have 0 - can test to see)
- watermark delay - reliable signals of job health (input and output). Calculated by looking at when an event first occurs and then the processing, and then the output.
- input deserialisation error - if input stream has malformed messages such as a missing semicolon etc.
- alerts can be setup in resource, condition, and action flow.
What does monitoring in data factory look like?
You can configure alerts by defining logic, severity and notification type.
No code required. Stored run data is kept for 45 days.
Use Azure Monitor for complex queries and monitoring across multiple data factories. Can create test alert for pipeline fail alerts in data factory. Advanced stuff can be done in Azure Monitor.
How is monitoring like in Azure Databricks?
Different than other products.
Ganglia is used to check if the cluster is healthy.
Azure monitor is recommended but isn’t configured natively for data bricks. Grifana is open source graphing tool to visualise.
How do you optimise Synapse Analytics (SQL Data Warehouse)?
- Have you checked for Advisor recommendations?
- Polybase - use polybase for ELT and use CTAS
- Hash-distrubte large tables
- Remember the rule of 60 (don’t over partition).
- Minimize transaction size
- Reduce query result sizes
- Minimize possible column size
- Maximise throughput by breaking gzip into 60+ files.
How do you optimise data lake?
Hard to do.
Parallel copy in data factory will maximise parallelisation.
Keep file sizes between 256mb to 100gb where possible.
How do you optimise Stream Analytics?
Review the SU utilisation %metric
- 80% is the redline
- Start with 6 SUs for queries not using Partition by
- Outputs need to be partitioned
- Events should go to the same partition of your input
- Partition BY should be used in all steps
How do you optimise sql database?
Intelligent Performance tab
* Summary of database performance and the first stop for optimisation
Performance Recommendations
* Indexes to create or drop
* Schema issues identified in the database
* When queries can benefit from being parametrised queries
Query Performance Insight
- insight into DTU consumption
- CPU consuming queriries
- Drill down into detail on query results
Automatic tuning
* ML for appropriate tuning