DP-203 Flashcards

1
Q

Which roles does the user account used to sign into Azure need to be a member of to create a Data Factory instance?

A

Contributor and owner OR administrator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

To create and manage Data Factory objects including datasets, linked services, pipelines, triggers, and integration runtimes, which requirements must be met?

A

To create and manage child resources in the Azure portal, you must belong to the Data Factory Contributor role at the resource group level or above.

To create and manage resources with PowerShell or the SDK, the contributor role at the resource level or above is sufficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 3 types of Integration runtime?

A

Azure
Self-hosted
Azure-SSIS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which Azure Data Factory component orchestrates a transformation job or runs a data movement command?

A

Activities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You are moving data from an Azure Data Lake Gen2 store to Azure Synapse Analytics. Which Azure Data Factory integration runtime would be used in a data copy activity?

A

Azure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Azure IR used for?

A

When moving data between Azure data platform technologies, the Azure Integration runtime is used when copying data between two Azure data platform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the self-hosted IR used for?

A

Self-hosted IR is used when working with data movement from private networks to the cloud and vice versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Azure-SSIS IR used for?

A

Azure-SSIS IR is used when you lift and shift existing SSIS workload.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an integration runtime?

A

An integration runtime provides the bridge between the activity and linked services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In Azure Data Factory authoring tool, where would you find the Copy data activity?

A

The Move & Transform section contains activities that are specific to Azure Data Factory copying data and defining data flows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

You want to ingest data from a SQL Server database hosted on an on-premises Windows Server. What integration runtime is required for Azure Data Factory to ingest data from the on-premises server?

A

A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private network. It also can dispatch transform activities against compute resources in an on-premises network or an Azure virtual network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

By default, how long are the Azure Data Factory diagnostic logs retained for?

A

45 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which transformation in the Mapping Data Flow is used to routes data rows to different streams based on matching conditions?

A

A Conditional Split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which transformation is used to load data into a data store or compute resource?

A

A Sink transformation allows you to choose a dataset definition for the destination output data. You can have as many sink transformations as your data flow requires.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How long does Databricks retain cluster configuration information?

A

Up to 70 all-purpose clusters terminated in the last 30 days and up to 30 job clusters recently terminated by the job scheduler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to keep an all-purpose cluster configuration in Databricks for more than 30 days after it has been terminated?

A

An administrator can pin a cluster to the cluster list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the 2 cluster types?

A

All-purpose cluster or a job cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is an All-purpose cluster?

A

All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a job cluster?

A

Job clusters terminate when your job ends, reducing resource usage and cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the 3 cluster modes?

A

Standard, High Concurrency, and Single Node. Most regular users use Standard or Single Node clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a standard cluster used for?

A

Standard clusters are ideal for processing large amounts of data with Apache Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a single node cluster used for?

A

Single Node clusters are intended for jobs that use small amounts of data or non-distributed workloads such as single-node machine learning libraries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a high concurrency cluster used for?

A

High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling autoscaling for High Concurrency clusters.

24
Q

Sharding pattern: What is hash?

A

A hash-distributed table can deliver the highest query performance for joins and aggregations on large tables.

25
Q

Sharding pattern: What is replicate?

A

A replicated table provides the fastest query performance for small tables.

26
Q

Sharding pattern: What is round-robin?

A

A round-robin table is the most straightforward table to create and deliver fast performance when used as a staging table for loads.

27
Q

Which sharding pattern is best for small dimension tables with less than 2GB of storage after compression?

A

Replicated

28
Q

Which sharding pattern is best for temporary/staging tables or tables with no obvious joining key or good candidate column?

A

Round-robin (Default)

29
Q

Which sharding pattern is best for large-dimension tables and fact tables?

A

Hash

30
Q

Name the 5 types of window functions in Stream Analytics

A
Tumbling
Hopping
Sliding
Session
Snapshot
31
Q

When should a ‘Tumbling’ window be used?

A

Used to segment a data stream into DISTINCT time segments. I.e count of tweets per timezone every 10 seconds

32
Q

When should a ‘Hopping’ window be used?

A

Hop forward in time by a fixed window, like a tumbling window that can overlap. I.e Every 5 seconds give me the count of tweets over the last 10 seconds.

33
Q

When should a ‘Sliding’ window be used?

A

Output events only for points in time when the content of the window changes. I.e Give me the count of tweets for all topics which are tweeted more than 10 times in the last 10 seconds.

34
Q

When should a ‘Session’ window be used?

A

Group events that arrive at similar times, filtering out periods of time where there is no data. I.e Count of tweets that occur within 5 minutes of each other.

35
Q

When should a ‘Snapshot’ window be used?

A

Groups events that have the same timestamp. I.e Give me the count of tweets with the same topic type that occurres at exactly the same time.

36
Q

How do you set up ‘Dynamic Data Masking’?

A

You set up a dynamic data masking policy in the Azure portal by selecting the Dynamic Data Masking blade under Security in your SQL Database configuration pane. This feature cannot be set using portal for SQL Managed Instance.

37
Q

In DDM what is ‘Default’?

A
  • Use XXXX or fewer Xs if the size of the field is less than 4 characters for string data types (nchar, ntext, nvarchar).
  • Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
  • Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
  • For SQL variant, the default value of the current type is used.
  • For XML the document is used.
  • Use an empty value for special data types (timestamp table, hierarchyid, GUID, binary, image, varbinary spatial types).
38
Q

In DDM what is ‘Credit Card’?

A

Masking method, which exposes the last four digits of the designated fields and adds a constant string as a prefix in the form of a credit card.

XXXX-XXXX-XXXX-1234

39
Q

In DDM what is ‘Email’?

A

Masking method, which exposes the first letter and replaces the domain with XXX.com using a constant string prefix in the form of an email address.

aXX@XXXX.com

40
Q

In DDM what is ‘Random number’?

A

Masking method, which generates a random number according to the selected boundaries and actual data types. If the designated boundaries are equal, then the masking function is a constant number.

41
Q

In DDM what is ‘Custom text’?

A

Masking method, which exposes the first and last characters and adds a custom padding string in the middle. If the original string is shorter than the exposed prefix and suffix, only the padding string is used.
prefix[padding]suffix

42
Q

What is ‘Always Encrypted’?

A

Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases.

43
Q

What is ‘Deteministic’ encryption?

A

Deterministic encryption always generates the same encrypted value for any given plain text value. Using deterministic encryption allows point lookups, equality joins, grouping and indexing on encrypted columns. However, it may also allow unauthorized users to guess information about encrypted values by examining patterns in the encrypted column, especially if there’s a small set of possible encrypted values, such as True/False, or North/South/East/West region. Deterministic encryption must use a column collation with a binary2 sort order for character columns.

44
Q

What is ‘Randomised Encryption?’

A

Randomized encryption uses a method that encrypts data in a less predictable manner. Randomized encryption is more secure, but prevents searching, grouping, indexing, and joining on encrypted columns.

45
Q

How would you configure ‘Always Encrypted?’

A

SSMS + PS
1) Provisioning column master keys, column encryption keys and encrypted column encryption keys with their corresponding column master keys.

2) Creating key metadata in the database.
3) Creating new tables with encrypted columns
4) Encrypting existing data in selected database columns

46
Q

What is ‘Transparent Data Encryption TDE’?

A

Transparent Data Encryption (TDE) encrypts SQL Server, Azure SQL Database, and Azure Synapse Analytics data files. This encryption is known as encrypting data at rest.

47
Q

What steps would you take to enable TDE?

A

Create a master key.

Create or obtain a certificate protected by the master key.

Create a database encryption key and protect it by using the certificate.

Set the database to use encryption.

48
Q

In DL Gen 2 when would you use premium tier?

A

For scenarios that require low latency.

49
Q

Mike is creating an Azure Data Lake Storage Gen2 account. He must configure this account to be able to process analytical data workloads for best performance. Which option should he configure when creating the storage account?

A

On the Advanced tab, set the Hierarchical Namespace to Enabled.

If you want to enable the best performance for analytical workloads in Data Lake Storage Gen2, then on the Advanced tab of the Storage Account creation set the Hierarchical Namespace to Enabled.

50
Q

In which phase of big data processing is Azure Data Lake Storage located?

A

Store

51
Q

What are the key differences when creating tables in Synapse Analytics?

A

In Synapse, you do not have foreign keys and unique value constraints like you do in SQL Server. Since these rules are not enforced at the database layer, the jobs used to load data have more responsibility to maintain data integrity.

52
Q

In Synapse, how would you specify distribution type?

A

DISTRIBUTION = REPLICATE

DISTRIBUTION = HASH([SalesOrderNumber])

53
Q

What distribution option would you use for a product dimension table that will contain 1,000 records in Synapse Analytics?

A

DISTRIBUTION = REPLICATE.

Replicate will result in a copy of the table on each compute node, which performs well with joins to the distributed fact table.

54
Q

What distribution option would be best for a sales fact table that will contain billions of records?

A

DISTRIBUTION = HASH([SalesOrderNumber]).

Hash distribution provides good read performance for a large table by distributing records across compute nodes based on the hash key.

55
Q

What is the difference between a star schema and a snowflake schema?

A

All dimensions in a star schema join directly to the fact table (denormalized) while some dimension tables in a snowflake schema are normalized.

A star schema is highly denormalized so that the fact table joins directly to dimension; a snowflake schema normalizes some dimensions into multiple tables such as DimProduct, DimProductSubcategory, and DimProductCategory.