db-900 core data concepts Flashcards
db-900 azure data fundamentals
What are the three ways you can categorize data?
Structured
Semi–structured
Unstructured
What is tabular data?
data that is stored as rows and columns, in one or more ‘table’.
A row represents an entity and a column represents an attribute of that entity.
What makes data ‘structured’?
it is tabular and adheres to a fixed schema.
What makes data ‘semi–structured’?
it contains entities which have some regularly occuring attributes but there is variation. Sometimes those attributes are missing or there are multiple values for a givent attribute, etc
What is an example of a format that is useful for ‘semi structured’ data?
JSON – because it allows you to define fields for an entity but does not need to adhere to a predefined schema.
What are some examples of ‘unstructured’ data?
audio, video, and images
What are the two broad categories of data stores?
File stores and Databases
What are some common ways to store files?
BLOB, CSV, XML, JSON, and optimized file formats like: Avro, ORC, and Parquet
What is XML?
a human readable semistructured format that stores data in tags.
what is replacing XML?
JSON
What is the best format for storing large objects like videos, audio, and images?
BLOB
Binary Large Object
How is file storage different from a database?
The difference is that one deals with records rather than files
What is NoSQL?
databases that are not relational
What are the 4 common types of non–relational databases?
Key–value
Document
Column Family
Graph
In a key–value database, what format does the value have to be in?
In this type of database, it doesn’t matter what the format of the value is. It can be numerical, text, etc
In a document database, what format does the value have to be in?
JSON
What are the two types of data processing?
Transactional and Analytical
What is OLTP?
Online Transaction Processing
What does OLTP track?
Transactions, which are often CRUD operations
What does a transaction ensure?
ACIDity
What does ACID stand for?
Atomicity
Consistency
Isolation
Durability
What is atomicity?
All sub–components of a transaction must succeed in order for the transaction to take place. It is binary, either all of it completes or none of it does.
How do you know a transaction is consistent?
When a transaction takes the database from one valid state to another valid state. If you were to transfer funds from one account to another, the total number of funds remains the same, because it is subtracted from one and added to another
How do you know your transactions are ‘isolated’?
When the transaction does not interfere with another transaction. If I run a transaction to transfer funds from one account to another, and I also run a transaction to get the number of funds from all accounts, that second transaction should not get one account total before the transfer and one account total after the transfer.
What proves that a transaction was durable?
Once the transaction is committed, it persists. If the database it turned on and off, the change remains.
What is a data lake?
It is used to process large volumes of file based data
What is an olap model?
Online Analytical Processing model
ETL takes the data from where to where?
From operational data to the data lake, warehouse, or lakehouse
What is a data warehouse?
A database optimized for analytics queries (read operations)
What does CRUD stand for?
Create
Retrieve
Update
Delete
What is a lakehouse
combines the flexible and scalable storage with relational querying semantics
What kind of denormalization takes place when oltp data is transferred to a lakehouse?
Relational Data will contain duplicate data across rows.
What are the 3 main roles in Data?
Database Administrator
Data Engineer
Data Analyst
What does a database administrator do?
They are responsible for the design, implementation, and maintenance of databases. They do things like update the databases and manage permissions, and they are responsible for the performance and reliability of the databases.
What does a data engineer do?
They are responsible for building data workloads for databases and file stores that take transactional data and make them available for analytics. They own the management and monitoring of data pipelines to ensure that data loads perform as expected.
What does a data analyst do?
They investigate and transform data into reports and visualizations to provide insights for valuable business questions
What is Azure SQL?
A group of relational database solutions built on the SQL Server engine
What is Azure?
Azure is a collection of cloud based IT solutions
What is Azure SQL Database?
Fully managed Platform-as-a-service product, which provides the least flexible configuration
What is Azure SQL Managed Instance?
A hosted instance of SQL Server which provides automated maintenance, allowing more configuration flexibility than SQL DB
What is Azure SQL VM?
A virtual machine with SQL Server installed, allowing for the maximum amount of configuration, also the most aount of responsibility for the DBA
What Azure products are offered for open-source relational databases?
MySQL
MariaDB
Postgres
What is Azure Cosmos?
A global scale non-relational (noSQL) database which supports storing documents as JSONs, key-value pairs, column family tables, and graphs.
Sometimes DBAs have to manage this, but usually the software engineers do.
Often Data Engineers will need to extract data from here for a data lakehouse
What is Azure Storage?
A cloud service that allows you to store data in BLOB containers, file shares, and tables
What would a data engineer do with Azure Storage?
They would likely use it as a data lake
What is Azure Data Factory?
A service to define and schedule data pipelines to transfer and transform data. It can be integrated with other Azure products
What would a data engineer do with Azure Data Factory?
They would use it to build ETL pipelines that take operational data and populate data warehouses for analytics solutions
What is Azure Synapse Analytics?
Comprehensive PaaS for analytics
What does Synapse Analytics include?
Pipelines
SQL
Apache Spark
Synapse Analytics Data Explorer
What is Synapse Analytics pipelines?
Same technology as Azure Data Factory
What is SQL?
a highly scalable SQL database engine, optimized for data warehouse workloads (read queries)
What is Apache Spark?
An open source distributed data processing system that allows for the integration of APIs using python, sql, java, and scala
What is Synapse Analytics Data Explorer?
Uses the Kusto Query Language to provide extremely fast analytics processing optimized for realtime telemetry and log data
what can data engineers use Azure Synapse Analytics for?
They will use it to build comprehensive data analytics solutions for ingest pipelines, lake storage, and warehouse storage
what can data analysts use Azure Synapse Analytics for?
They can use sql and spark through interactive notebooks and integrate with Azure Machine Learning and Power BI to create models
What is Azure Databricks?
An Azure integrated version of a popular platform which combines Apache Spark and SQL database semantics for large scale analytics