db-900 core data concepts Flashcards
db-900 azure data fundamentals
What are the three ways you can categorize data?
Structured
Semi–structured
Unstructured
What is tabular data?
data that is stored as rows and columns, in one or more ‘table’.
A row represents an entity and a column represents an attribute of that entity.
What makes data ‘structured’?
it is tabular and adheres to a fixed schema.
What makes data ‘semi–structured’?
it contains entities which have some regularly occuring attributes but there is variation. Sometimes those attributes are missing or there are multiple values for a givent attribute, etc
What is an example of a format that is useful for ‘semi structured’ data?
JSON – because it allows you to define fields for an entity but does not need to adhere to a predefined schema.
What are some examples of ‘unstructured’ data?
audio, video, and images
What are the two broad categories of data stores?
File stores and Databases
What are some common ways to store files?
BLOB, CSV, XML, JSON, and optimized file formats like: Avro, ORC, and Parquet
What is XML?
a human readable semistructured format that stores data in tags.
what is replacing XML?
JSON
What is the best format for storing large objects like videos, audio, and images?
BLOB
Binary Large Object
How is file storage different from a database?
The difference is that one deals with records rather than files
What is NoSQL?
databases that are not relational
What are the 4 common types of non–relational databases?
Key–value
Document
Column Family
Graph
In a key–value database, what format does the value have to be in?
In this type of database, it doesn’t matter what the format of the value is. It can be numerical, text, etc
In a document database, what format does the value have to be in?
JSON
What are the two types of data processing?
Transactional and Analytical
What is OLTP?
Online Transaction Processing
What does OLTP track?
Transactions, which are often CRUD operations
What does a transaction ensure?
ACIDity
What does ACID stand for?
Atomicity
Consistency
Isolation
Durability
What is atomicity?
All sub–components of a transaction must succeed in order for the transaction to take place. It is binary, either all of it completes or none of it does.
How do you know a transaction is consistent?
When a transaction takes the database from one valid state to another valid state. If you were to transfer funds from one account to another, the total number of funds remains the same, because it is subtracted from one and added to another
How do you know your transactions are ‘isolated’?
When the transaction does not interfere with another transaction. If I run a transaction to transfer funds from one account to another, and I also run a transaction to get the number of funds from all accounts, that second transaction should not get one account total before the transfer and one account total after the transfer.
What proves that a transaction was durable?
Once the transaction is committed, it persists. If the database it turned on and off, the change remains.
What is a data lake?
It is used to process large volumes of file based data
What is an olap model?
Online Analytical Processing model
ETL takes the data from where to where?
From operational data to the data lake, warehouse, or lakehouse
What is a data warehouse?
A database optimized for analytics queries (read operations)
What does CRUD stand for?
Create
Retrieve
Update
Delete
What is a lakehouse
combines the flexible and scalable storage with relational querying semantics
What kind of denormalization takes place when oltp data is transferred to a lakehouse?
Relational Data will contain duplicate data across rows.
What are the 3 main roles in Data?
Database Administrator
Data Engineer
Data Analyst
What does a database administrator do?
They are responsible for the design, implementation, and maintenance of databases. They do things like update the databases and manage permissions, and they are responsible for the performance and reliability of the databases.
What does a data engineer do?
They are responsible for building data workloads for databases and file stores that take transactional data and make them available for analytics. They own the management and monitoring of data pipelines to ensure that data loads perform as expected.
What does a data analyst do?
They investigate and transform data into reports and visualizations to provide insights for valuable business questions
What is Azure SQL?
A group of relational database solutions built on the SQL Server engine
What is Azure?
Azure is a collection of cloud based IT solutions
What is Azure SQL Database?
Fully managed Platform-as-a-service product, which provides the least flexible configuration
What is Azure SQL Managed Instance?
A hosted instance of SQL Server which provides automated maintenance, allowing more configuration flexibility than SQL DB
What is Azure SQL VM?
A virtual machine with SQL Server installed, allowing for the maximum amount of configuration, also the most aount of responsibility for the DBA
What Azure products are offered for open-source relational databases?
MySQL
MariaDB
Postgres
What is Azure Cosmos?
A global scale non-relational (noSQL) database which supports storing documents as JSONs, key-value pairs, column family tables, and graphs.
Sometimes DBAs have to manage this, but usually the software engineers do.
Often Data Engineers will need to extract data from here for a data lakehouse
What is Azure Storage?
A cloud service that allows you to store data in BLOB containers, file shares, and tables
What would a data engineer do with Azure Storage?
They would likely use it as a data lake
What is Azure Data Factory?
A service to define and schedule data pipelines to transfer and transform data. It can be integrated with other Azure products
What would a data engineer do with Azure Data Factory?
They would use it to build ETL pipelines that take operational data and populate data warehouses for analytics solutions
What is Azure Synapse Analytics?
Comprehensive PaaS for analytics
What does Synapse Analytics include?
Pipelines
SQL
Apache Spark
Synapse Analytics Data Explorer
What is Synapse Analytics pipelines?
Same technology as Azure Data Factory
What is SQL?
a highly scalable SQL database engine, optimized for data warehouse workloads (read queries)
What is Apache Spark?
An open source distributed data processing system that allows for the integration of APIs using python, sql, java, and scala
What is Synapse Analytics Data Explorer?
Uses the Kusto Query Language to provide extremely fast analytics processing optimized for realtime telemetry and log data
what can data engineers use Azure Synapse Analytics for?
They will use it to build comprehensive data analytics solutions for ingest pipelines, lake storage, and warehouse storage
what can data analysts use Azure Synapse Analytics for?
They can use sql and spark through interactive notebooks and integrate with Azure Machine Learning and Power BI to create models
What is Azure Databricks?
An Azure integrated version of a popular platform which combines Apache Spark and SQL database semantics for large scale analytics
what can data analysts use Azure Databricks for?
They can use the native notebook support to provide browser friendly data analysis
what can data engineers use Azure Databricks for?
They’ll use it to create analytical data stores
What is Azure HDInsight?
This provides Azure hosted clusters for apache technologies
What is Apache Hadoop?
Write map-reduce jobs in Java or Apache Hive to process large volumes of data
What is Apache HBase?
Query NoSQL data at a large scale with this
What is Apache Kafka?
a message broker for data stream processing
Data engineers can use Azure HDInsight for what?
They can use this to support big data processing jobs that use multiple Apache technologies
What is Azure Stream Analytics?
Captures a stream of data, applies queries/transformations to it, writes the results for analytics or further processing
What can data engineers do with Azure Stream Analytics?
They can use this to write ETL pipelines for analytical data stores
What is Azure Data Explorer?
query log and telemetry data fast with this standalone version of the Synapse product
Data analysts can use Azure Data Explorer for what?
They can easily analyze timestamped log data
What is Microsoft Purview?
Enterprise solution for governance and discoverability, helping people find the data they need
What is Microsoft Fabric?
SaaS lakehouse platform that includes:
ETL
lakehouse analytics
warehouse analytics
data science and machine learning
realtime analytics
data visualization
data governance and management
Data engineers can use Microsoft Purview for what?
They will enforce data governance and ensure integrity of data
Structured
Semi–structured
Unstructured
What are the three ways you can categorize data?
data that is stored as rows and columns, in one or more ‘table’.
A row represents an entity and a column represents an attribute of that entity.
What is tabular data?
it is tabular and adheres to a fixed schema.
What makes data ‘structured’?
it contains entities which have some regularly occuring attributes but there is variation. Sometimes those attributes are missing or there are multiple values for a givent attribute, etc
What makes data ‘semi–structured’?
JSON – because it allows you to define fields for an entity but does not need to adhere to a predefined schema.
What is an example of a format that is useful for ‘semi structured’ data?
audio, video, and images
What are some examples of ‘unstructured’ data?
File stores and Databases
What are the two broad categories of data stores?
BLOB, CSV, XML, JSON, and optimized file formats like: Avro, ORC, and Parquet
What are some common ways to store files?
a human readable semistructured format that stores data in tags.
What is XML?
JSON
what is replacing XML?
BLOB
Binary Large Object
What is the best format for storing large objects like videos, audio, and images?
The difference is that one deals with records rather than files
How is file storage different from a database?
databases that are not relational
What is NoSQL?
Key–value
Document
Column Family
Graph
What are the 4 common types of non–relational databases?
In this type of database, it doesn’t matter what the format of the value is. It can be numerical, text, etc
In a key–value database, what format does the value have to be in?
JSON
In a document database, what format does the value have to be in?
Transactional and Analytical
What are the two types of data processing?
Online Transaction Processing
What is OLTP?
Transactions, which are often CRUD operations
What does OLTP track?
ACIDity
What does a transaction ensure?
Atomicity
Consistency
Isolation
Durability
What does ACID stand for?
All sub–components of a transaction must succeed in order for the transaction to take place. It is binary, either all of it completes or none of it does.
What is atomicity?
When a transaction takes the database from one valid state to another valid state. If you were to transfer funds from one account to another, the total number of funds remains the same, because it is subtracted from one and added to another
How do you know a transaction is consistent?
When the transaction does not interfere with another transaction. If I run a transaction to transfer funds from one account to another, and I also run a transaction to get the number of funds from all accounts, that second transaction should not get one account total before the transfer and one account total after the transfer.
How do you know your transactions are ‘isolated’?
Once the transaction is committed, it persists. If the database it turned on and off, the change remains.
What proves that a transaction was durable?
It is used to process large volumes of file based data
What is a data lake?
Online Analytical Processing model
What is an olap model?
From operational data to the data lake, warehouse, or lakehouse
ETL takes the data from where to where?
A database optimized for analytics queries (read operations)
What is a data warehouse?
Create
Retrieve
Update
Delete
What does CRUD stand for?
combines the flexible and scalable storage with relational querying semantics
What is a lakehouse
Relational Data will contain duplicate data across rows.
What kind of denormalization takes place when oltp data is transferred to a lakehouse?
Database Administrator
Data Engineer
Data Analyst
What are the 3 main roles in Data?
They are responsible for the design, implementation, and maintenance of databases. They do things like update the databases and manage permissions, and they are responsible for the performance and reliability of the databases.
What does a database administrator do?
They are responsible for building data workloads for databases and file stores that take transactional data and make them available for analytics. They own the management and monitoring of data pipelines to ensure that data loads perform as expected.
What does a data engineer do?
They investigate and transform data into reports and visualizations to provide insights for valuable business questions
What does a data analyst do?
A group of relational database solutions built on the SQL Server engine
What is Azure SQL?
Azure is a collection of cloud based IT solutions
What is Azure?
Fully managed Platform-as-a-service product, which provides the least flexible configuration
What is Azure SQL Database?
A hosted instance of SQL Server which provides automated maintenance, allowing more configuration flexibility than SQL DB
What is Azure SQL Managed Instance?
A virtual machine with SQL Server installed, allowing for the maximum amount of configuration, also the most aount of responsibility for the DBA
What is Azure SQL VM?
MySQL
MariaDB
Postgres
What Azure products are offered for open-source relational databases?
A global scale non-relational (noSQL) database which supports storing documents as JSONs, key-value pairs, column family tables, and graphs.
Sometimes DBAs have to manage this, but usually the software engineers do.
Often Data Engineers will need to extract data from here for a data lakehouse
What is Azure Cosmos?
A cloud service that allows you to store data in BLOB containers, file shares, and tables
What is Azure Storage?
They would likely use it as a data lake
What would a data engineer do with Azure Storage?
A service to define and schedule data pipelines to transfer and transform data. It can be integrated with other Azure products
What is Azure Data Factory?
They would use it to build ETL pipelines that take operational data and populate data warehouses for analytics solutions
What would a data engineer do with Azure Data Factory?
Comprehensive PaaS for analytics
What is Azure Synapse Analytics?
Pipelines
SQL
Apache Spark
Synapse Analytics Data Explorer
What does Synapse Analytics include?
Same technology as Azure Data Factory
What is Synapse Analytics pipelines?
a highly scalable SQL database engine, optimized for data warehouse workloads (read queries)
What is SQL?
An open source distributed data processing system that allows for the integration of APIs using python, sql, java, and scala
What is Apache Spark?
Uses the Kusto Query Language to provide extremely fast analytics processing optimized for realtime telemetry and log data
What is Synapse Analytics Data Explorer?
They will use it to build comprehensive data analytics solutions for ingest pipelines, lake storage, and warehouse storage
what can data engineers use Azure Synapse Analytics for?
They can use sql and spark through interactive notebooks and integrate with Azure Machine Learning and Power BI to create models
what can data analysts use Azure Synapse Analytics for?
An Azure integrated version of a popular platform which combines Apache Spark and SQL database semantics for large scale analytics
What is Azure Databricks?
They can use the native notebook support to provide browser friendly data analysis
what can data analysts use Azure Databricks for?
They’ll use it to create analytical data stores
what can data engineers use Azure Databricks for?
This provides Azure hosted clusters for apache technologies
What is Azure HDInsight?
Write map-reduce jobs in Java or Apache Hive to process large volumes of data
What is Apache Hadoop?
Query NoSQL data at a large scale with this
What is Apache HBase?
a message broker for data stream processing
What is Apache Kafka?
They can use this to support big data processing jobs that use multiple Apache technologies
Data engineers can use Azure HDInsight for what?
Captures a stream of data, applies queries/transformations to it, writes the results for analytics or further processing
What is Azure Stream Analytics?
They can use this to write ETL pipelines for analytical data stores
What can data engineers do with Azure Stream Analytics?
query log and telemetry data fast with this standalone version of the Synapse product
What is Azure Data Explorer?
They can easily analyze timestamped log data
Data analysts can use Azure Data Explorer for what?
Enterprise solution for governance and discoverability, helping people find the data they need
What is Microsoft Purview?
SaaS lakehouse platform that includes:
ETL
lakehouse analytics
warehouse analytics
data science and machine learning
realtime analytics
data visualization
data governance and management
What is Microsoft Fabric?
They will enforce data governance and ensure integrity of data
Data engineers can use Microsoft Purview for what?