DP-900 Flashcards

1
Q

How can we classify data?

A

structured, semi-structured, or unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is structured data?

A

Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is semi-structured data?

A

Semi-structured data is information that has some structure, but which allows for some variation between entity instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some common formats of semi-structured data?

A

JSON

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are unstructured data?

A

Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give me some examples of unstructured data?

A

documents, images, audio and video data, and binary files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What two categories of a data store do we have?

A

File stores and databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What should one consider if we are to use a file store or a database?

A
  • The type of data being stored (structured, semi-structured, or unstructured).
  • The applications and services that will need to read, write, and process the data.
  • The need for the data files to be readable by humans, or optimized for efficient storage and processing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Delimited Text Files?

A

Data is often stored in plain text format with specific field delimiters and row terminators. The most common format for delimited data is comma-separated values (CSV) in which fields are separated by commas, and rows are terminated by a carriage return / new line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of files can one store in a file store?

A
  • Delimitted text files
  • JSON
  • XML
  • BLOB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is some of the popular optimized file formats?

A
  • Avro
  • ORC
  • Parquet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Avro?

A

Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is ORC?

A

ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Parquet

A

Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What optimized file format should we use to compress data and to minimizing storage and network bandwidth requirements

A

Avro

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What optimized file format should we use to optimize read and write operations in apache hive

A

ORC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What optimized file format should we use that specialices in storing and processing nested data types efficiently

A

Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is normalization of data?

A

The elimination of duplicate data values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of non relational databases do we have?

A
  • Key-value databases
  • Document databases
  • Column family databases
  • Graph databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is key-value type in non relational database?

A

Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Document type in non relational database?

A

Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Column family type in non relational database?

A

Column family databases, which store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Graph databases type in non relational database?

A

Graph databases, which store entities as nodes with links to define relationships between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Online Transactional Processing (OLTP)?

A

OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How does a OLTP system accomplish it’s goal?

A

ACID scematics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does the ACID stand for?

A

Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. For example, a transaction that involved debiting funds from one account and crediting the same amount to another account must complete both actions. If either action can’t be completed, then the other action must fail.
Consistency – transactions can only take the data in the database from one valid state to another. To continue the debit and credit example above, the completed state of the transaction must reflect the transfer of funds from one account to the other.
Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state. For example, while the transaction to transfer funds from one account to another is in-process, another transaction that checks the balance of these accounts must return consistent results - the balance-checking transaction can’t retrieve a value for one account that reflects the balance before the transfer, and a value for the other account that reflects the balance after the transfer.
Durability – when a transaction has been committed, it will remain committed. After the account transfer transaction has completed, the revised account balances are persisted so that even if the database system were to be switched off, the committed transaction would be reflected when it is switched on again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

When is OLTP systems typically used?

A

OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is LOB?

A

Line of buisness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a common architecture of an enterprise-scale analytics?

A
  1. Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
  2. Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
  3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
  4. The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a data lake?

A

Data lakes are common in large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is data warehouse?

A

Data warehouses are an established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is data lakehouses?

A

Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse. The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is OLAP model?

A

An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address. Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Which data model should we use if we want to drill up/down?

A

OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are three key job roles that deal with data in most organizations and what do they do??

A
  • Database administrators manage databases, assigning permissions to users, storing backup copies of data and restore data in the event of a failure.
  • Data engineers manage infrastructure and processes for data integration across the organization, applying data cleaning routines, identifying data governance rules, and implementing pipelines to transfer and transform data between systems.
  • Data analysts explore and analyze data to create visualizations and charts that enable organizations to make informed decisions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What types of Azure SQL do we have?

A
  • Azure SQL Database
  • Azure SQL Managed Instance
  • Azure SQL VM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Azure SQL Database?

A

Azure SQL Database – a fully managed platform-as-a-service (PaaS) database hosted in Azure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is Azure SQL Managed Instance

A

Azure SQL Managed Instance – a hosted instance of SQL Server with automated maintenance, which allows more flexible configuration than Azure SQL DB but with more administrative responsibility for the owner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is Azure SQL VM?

A

Azure SQL VM – a virtual machine with an installation of SQL Server, allowing maximum configurability with full management responsibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What open source elational databases does Azure SQL support?

A
  • Azure Database for MySQL
  • Azure Database for MariaDB
  • Azure Database for PostgreSQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is Azure Database for MySQL and when are they used?

A

Azure Database for MySQL - a simple-to-use open-source database management system that is commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is Azure Database for MariaDB and when are they used?

A

Azure Database for MariaDB - a newer database management system, created by the original developers of MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another popular commercial database management system).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is Azure Database for PostgreSQLand when are they used?

A

Azure Database for PostgreSQL - a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What type of storage does an storage account support?

A
  • File store
  • Tables
  • Blob containers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is Azure Data Factory?

A

Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and transform data. You can integrate your pipelines with other Azure services, enabling you to ingest data from cloud data stores, process the data using cloud-based compute, and persist the results in another data store.

Azure Data Factory is used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical data stores with data from transactional systems across the organization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is Azure Synapse Analytics?

A

Data engineers can use Azure Synapse Analytics to create a unified data analytics solution that combines data ingestion pipelines, data warehouse storage, and data lake storage through a single service.
Azure Synapse Analytics is a comprehensive, unified Platform-as-a-Service (PaaS) solution for data analytics that provides a single service interface for multiple analytical capabilities. Including:
- Pipelines - based on the same technology as Azure Data Factory.
- SQL - a highly scalable SQL database engine, optimized for data warehouse workloads.
- Apache Spark - an open-source distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.
- Azure Synapse Data Explorer - a high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is Azure Databricks?

A

Azure Databricks is a fully managed first-party service that enables an open data lakehouse in Azure. Azure Databricks is an Azure-integrated version of the popular Databricks platform, which combines the Apache Spark data processing platform with SQL database semantics and an integrated management interface to enable large-scale data analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is Azure HDInsights?

A

Azure HDInsight is a full-spectrum, managed cluster platform which simplifies running big data frameworks in large volume and velocity using Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Hadoop, and more in your Azure environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is Apache Spark?

A

Apache Spark - a distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.

50
Q

What is Apache Hadoop

A

Apache Hadoop - a distributed system that uses MapReduce jobs to process large volumes of data efficiently across multiple cluster nodes. MapReduce jobs can be written in Java or abstracted by interfaces such as Apache Hive - a SQL-based API that runs on Hadoop.

51
Q

What is Apache HBase?

A

Apache HBase - an open-source system for large-scale NoSQL data storage and querying.

52
Q

What is Apache Kafka?

A

Apache Kafka - a message broker for data stream processing.

53
Q

What is Azure Stream Analytics?

A

Azure Stream Analytics is a real-time stream processing engine that captures a stream of data from an input, applies a query to extract and manipulate data from the input stream, and writes the results to an output for analysis or further processing.

Data engineers can incorporate Azure Stream Analytics into data analytics architectures that capture streaming data for ingestion into an analytical data store or for real-time visualization.

54
Q

Azure Data Explorer?

A

Azure Data Explorer is a standalone service that offers the same high-performance querying of log and telemetry data as the Azure Synapse Data Explorer runtime in Azure Synapse Analytics.

Data analysts can use Azure Data Explorer to query and analyze data that includes a timestamp attribute, such as is typically found in log files and Internet-of-things (IoT) telemetry data.

55
Q

What is Microsoft Purview?

A

Microsoft Purview provides a solution for enterprise-wide data governance and discoverability. You can use Microsoft Purview to create a map of your data and track data lineage across multiple data sources and systems, enabling you to find trustworthy data for analysis and reporting.

Data engineers can use Microsoft Purview to enforce data governance across the enterprise and ensure the integrity of data used to support analytical workloads.

56
Q

What is Microsoft Fabric?

A

Microsoft Fabric is a unified Software-as-a-Service (SaaS) analytics platform based on open and governed lakehouse that includes functionality to support:
- Data ingestion and ETL
- Data lakehouse analytics
- Data warehouse analytics
- Data Science and machine learning
- Realtime analytics
- Data visualization
- Data governance and management

57
Q

What are some of the dialects of SQL and where are they used?

A

Transact-SQL (T-SQL). This version of SQL is used by Microsoft SQL Server and Azure SQL services.

pgSQL. This is the dialect, with extensions implemented in PostgreSQL.

PL/SQL. This is the dialect used by Oracle. PL/SQL stands for Procedural Language/SQL.

58
Q

What is the three main logical groups in SQL?

A

Data Definition Language (DDL)
Data Control Language (DCL)
Data Manipulation Language (DML)

59
Q

What SQL statements include in the DDL?

A
  • CREATE: Create a new object in the database, such as a table or a view.
  • ALTER: Modify the structure of an object. For instance, altering a table to add a new column.
  • DROP: Remove an object from the database.
  • RENAME: Rename an existing object.
60
Q

What SQL statements include in the DCL(Data control language) and what do they do?

A
  • GRANT: Grant permission to perform specific actions
  • DENY: Deny permission to perform specific actions
  • REVOKE: Remove a previously granted permission
61
Q

What SQL statemnts include in the DML(Data Manipulation Language) statements?

A
  • SELECT: Read rows from a table
  • INSERT: Insert new rows into a table
  • UPDATE: Modify data in existing rows
  • DELETE: Delete existing rows
  • MERGE: The MERGE statement in SQL can handle inserts, updates, and deletes all in a single transaction without having to write separate logic for each of these.
62
Q

What is a view in regards to a relational database?

A

A view is a virtual table based on the results of a SELECT query. You can think of a view as a window on specified rows in one or more underlying tables.

63
Q

What is stored procedures?

A

A stored procedure defines SQL statements that can be run on command. Stored procedures are used to encapsulate programmatic logic in a database for actions that applications need to perform when working with data.

64
Q

What is an index in regards to a relational database?

A

An index helps you search for data in a table. Think of an index over a table like an index at the back of a book. A book index contains a sorted set of references, with the pages on which each reference occurs. When you want to find a reference to an item in the book, you look it up through the index. You can use the page numbers in the index to go directly to the correct pages in the book. Without an index, you might have to read through the entire book to find the references you’re looking for.

65
Q

What are the negatives of using indexes in a relational database?

A

An index consumes storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations.

66
Q

What is Azure SQL Edge?

A

Azure SQL Edge - A SQL engine that is optimized for Internet-of-things (IoT) scenarios that need to work with streaming time-series data.

67
Q

What is the difference between Azure SQL Server on Virtual Machine vs Managed Instance?

A

Since a managed instance runs on a server in the cloud we have complete control over the instance. But it automates backups, software patching, database monitoring, and other general tasks, that we would generally need in a virtual machine.

68
Q

What is the difference between an elastic pool vs a single database in Azure SQL database?

A

Single database option enables you to quickly set up and run a single SQL Server database. You create and run a database server in the cloud, and you access your database through this server. But in an elastic pool by default multiple databases can share the same resources, such as memory, data storage space, and processing power through multiple-tenancy

69
Q

What notable feature does Azure MariaDB have?

A

One notable feature of MariaDB is its built-in support for temporal data. A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.

70
Q

What notable feature does Azure PostgreSQL have?

A

PostgreSQL is a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties. The database management system is extensible; you can add code modules to the database, which can be run by queries. Another key feature is the ability to store and manipulate geometric data, such as lines, circles, and polygons.

71
Q

How can we monitor queries in Azure PostgreSQL?

A

Azure Database for PostgreSQL records information about queries run against databases on the server, and saves them in a database named azure_sys. You query the query_store.qs_view view to see this information, and use it to monitor the queries that users are running. This information can prove invaluable if you need to fine-tune the queries performed by your applications.

72
Q

What is Azure DataLake Storage Gen2?

A

Azure Data Lake Store (Gen1) is a separate service for hierarchical data storage for analytical data lakes, often used by so-called big data analytical solutions that work with structured, semi-structured, and unstructured data stored in files. Azure Data Lake Storage Gen2 is a newer version of this service that is integrated into Azure Storage; enabling you to take advantage of the scalability of blob storage and the cost-control of storage tiers, combined with the hierarchical file system capabilities and compatibility with major analytics systems of Azure Data Lake Store.

73
Q

How do we enable Azure Data Lake Store Gen2 file system?

A

To create an Azure Data Lake Store Gen2 files system, you must enable the Hierarchical Namespace option of an Azure Storage account.

74
Q

What network protocol does Azure Files support?

A
  • Server Message Block (SMB) file sharing is commonly used across multiple operating systems (Windows, Linux, macOS).
  • Network File System (NFS) shares are used by some Linux and macOS versions. To create an NFS share, you must use a premium tier storage account and create and configure a virtual network through which access to the share can be controlled.
75
Q

What must each row in an Azure Table have?

A

All rows in a table must have a unique key (composed of a partition key and a row key)

76
Q

What is Azure Cosmos DB for Apache Cassandra?

A

Azure Cosmos DB for Apache Cassandra is compatible with Apache Cassandra, which is a popular open source database that uses a column-family storage structure. Column families are tables, similar to those in a relational database, with the exception that it’s not mandatory for every row to have the same columns.

77
Q

What is Azure Cosmos DB for Apache Gremlin

A

Azure Cosmos DB for Apache Gremlin is used with data in a graph structure; in which entities are defined as vertices that form nodes in connected graph.

78
Q

What is ELT?

A

Is a data pipeline architecture where you Extract, Load and then transform

79
Q

What is ETL?

A

Is a data pipeline architecture where you Extract, Transform and then load

80
Q

What are cubes in regards to analytics?

A

Data model in which numeric data values are aggregated across one or more dimensions (for example, to determine total sales by product and region).The model encapsulates the relationships between data values and dimensional entities to support “drill-up/drill-down” analysis.

81
Q

In a data warehous what is a dimension table?

A

In data warehousing, a dimension table is a database table that stores attributes describing the facts in a fact table

82
Q

In a data warehous what is a fact table?

A

Where the regular data is stored

83
Q

What is a star schema?

A

Star schema in data warehouse is when we have fact and dimentions tables

84
Q

What is a snowflake schema?

A

It is the same as star chema with having fact and dimentions tables but though it’s often extended into a snowflake schema by adding additional tables related to the dimension tables to represent dimensional hierarchies

85
Q

What is data lakehouse?

A

Combines the functionalityt of a data warehouse and a data lake

86
Q

Which open-source distributed processing engine does Azure Synapse Analytics include?

A

Apache Spark

87
Q

What is batch streaming?

A

Batch processing, in which multiple data records are collected and stored before being processed together in a single operation.

88
Q

What is stream processing?

A

Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur.

89
Q

What are the advantages of using batch processing?

A
  • Large volumes of data can be processed at a convenient time.
  • It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.
90
Q

What are the disadvatages of uising batch processing?

A
  • The time delay between ingesting the data and getting the results.
  • All of a batch job’s input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt. The input data must be carefully checked before the job can be run again. Even minor data errors can prevent a batch job from running.
91
Q

When should we use stream processing?

A

Stream processing is ideal for time-critical operations that require an instant real-time response. For example, a system that monitors a building for smoke and heat needs to trigger alarms and unlock doors to allow residents to escape immediately in the event of a fire.

92
Q

What real time analytics in Azure services do we have?

A
  • Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query, and write the results to an output.
  • Spark Structured Streaming: An open-source library that enables you to develop complex streaming solutions on Apache Spark based services, including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.
  • Azure Data Explorer: A high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.
93
Q

What services do we have for stream processing?

A
  • Azure Event Hubs: A data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once.
  • Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but which is optimized for managing event data from Internet-of-things (IoT) devices.
  • Azure Data Lake Store Gen 2: A highly scalable storage service that is often used in batch processing scenarios, but which can also be used as a source of streaming data.
  • Apache Kafka: An open-source data ingestion solution that is commonly used together with Apache Spark. You can use Azure HDInsight to create a Kafka cluster.
94
Q

What are sinks in regards to processing of data?

A

Where the data is sent to after processing it.

95
Q

What are some of the services in Azure that we can sink to?

A
  • Azure Event Hubs: Used to queue the processed data for further downstream processing.
  • Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.
  • Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database table for querying and analysis.
  • Microsoft Power BI: Used to generate real time data visualizations in reports and dashboards.
96
Q

What is Azure Stream Analysis cluster?

A

If your stream process requirements are complex or resource-intensive, you can create a Stream Analysis cluster, which uses the same underlying processing engine as a Stream Analytics job, but in a dedicated tenant (so your processing is not affected by other customers) and with configurable scalability that enables you to define the right balance of throughput and cost for your specific scenario.

97
Q

What is Apache Spark?

A

Apache Spark is a distributed processing framework for large scale data analytics. Spark can be used to run code (usually written in Python, Scala, or Java) in parallel across multiple cluster nodes, enabling it to process very large volumes of data efficiently. Spark can be used for both batch processing and stream processing.

98
Q

Which Azure services support Apache Spark?

A

Azure Synapse Analytics
Azure Databricks
Azure HDInsight

99
Q

How do we process streaming data on Spark?

A

By using the Spark Structured Streaming library

100
Q

What is a delta lake?

A

Delta Lake is an open-source storage layer that adds support for transactional consistency, schema enforcement, and other common data warehousing features to data lake storage. It also unifies storage for streaming and batch data, and can be used in Spark to define relational tables for both batch and stream processing.

101
Q

What is the PowerBI desktop?

A

A Microsoft Windows application in which you can import data from a wide range of data sources, combine and organize the data from these sources in an analytics data model, and create reports that contain interactive visualizations of the data.

102
Q

What is PowerBI Service?

A

A cloud service in which reports can be published and interacted with by business users. You can also do some basic data modeling and report editing directly in the service using a web browser, but the functionality for this is limited compared to the Power BI Desktop tool. You can use the service to schedule refreshes of the data sources on which your reports are based, and to share reports with other users. You can also define dashboards and apps that combine related reports in a single, easy to consume location.

103
Q

What is attribute hierarchies?

A

Added properties to a table for example year, date etc.I

104
Q

In powerBI what is a fact table?

A

The numeric measures that will be aggregated by the various dimensions in the model are stored in Fact tables. Each row in a fact table represents a recorded event that has numeric measures associated with it

105
Q

in PowerBI what is the dimentions table?

A

Dimension tables represent the entities by which you want to aggregate numeric measures

106
Q

When should we display data as text and tables?

A

Tables and text are often the simplest way to communicate data. Tables are useful when numerous related values must be displayed, and individual text values in cards can be a useful way to show important figures or metrics.

107
Q

When should we display Bar and column charts?

A

Bar and column charts are a good way to visually compare numeric values for discrete categories.

108
Q

When should we display Line charts?

A

Line charts can also be used to compare categorized values and are useful when you need to examine trends, often over time.

109
Q

When should we display Pie charts?

A

Pie charts are often used in business reports to visually compare categorized values as proportions of a total.

110
Q

When should we display scatter plots?

A

Scatter plots are useful when you want to compare two numeric measures and identify a relationship or correlation between them.

111
Q

When should we display maps?

A

Maps are a great way to visually compare values for different geographic areas or locations.

112
Q

What is a linked service?

A

This is essentially a connection string that defines the connection information needed for Azure Data Factory to connect to external resources. Think of it as a way to link your Azure Data Factory to different data sources or computing resources. In your case, you would need a linked service to connect to the location where your Microsoft Excel files are stored (like an Azure Blob storage or a SharePoint site) and another one where you want to store the Parquet files.

113
Q

What is Azure Data Explorer?

A

A high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.

114
Q

Which data model element represents the entities by which you want to aggregate measures in Microsoft Power BI?

A

Dimensions. Because Dimensions are used to aggregate data. Fact tables contain measures that are aggregated, not values to aggregate by.

115
Q

When should we display cards?

A

A card shows a single value and is useful for highlighting important metrics.

116
Q

Which Azure SQL option should we use if we want a serverless managed model?

A

Azure SQL Database

117
Q

Which data service allows you to use every feature of Microsoft SQL Server in the cloud?

A

SQL Server on an Azure Virtual Machines running Windows

118
Q

Which type of database should you use to store sequential data in the fastest way possible?

A

time series

119
Q

Which two types of file store data in columnar format?

A

Parquet and ORC

120
Q
A