Core data concepts Flashcards
What 3 factors influence the file format used for certain data?
- The type of data (structured, semi-structured, unstructured)
- The applications and services that need to read, write and process the data
- Whether files need to be readable by humans or optimized for some other factor, like storage
What is the general form of a csv file?
What are 3 other examples of that?
Delimited text file
Tab delimited (TSV), space delimited and fixed-width data.
What are blob files?
Binary large object files. Binary files that are meant for application and don’t have some human-readable encoding.
What is AVRO? When is it a good file format?
Row-based file format. Header is stored in json, the data in binary. Good for compressing data and minimizing storage and network bandwidth requirements.
What is ORC?
A columnar file format that organizes its columns into stripes. A stripe contains an index for the rows in the stripe, the data for each row and statistical information (count, max, etc) for each column.
What is parquet? When is it a good file format?
A columnar file format consisting of row groups. Every column is stored in one row group with chunks of data, with metadata describing these chunks.
Parquet specializes in storing and processing nested data, and supports compression and encoding.
What are 4 common types of nonrelational databases?
Key-value, (json) document, column family (tabular with column groups), graph
What are CRUD operations?
The transactional operations create, retrieve, updata and delete
What are OLTP and OLAP?
Online transactional processing (often rows) and online analytical processing (often columns)
What is ACID?
An acronym related to processing transactional data: atomicity, consistency, isolation and durability.
What are a data lake, data warehouse and data lakehouse?
Data lake: place to store high volumes of raw data
Data warehouse: data storage optimized for analytical reading
Data lakehouse combines the two
What 3 versions of Azure SQL are there?
Azure SQL database, PaaS
Azure SQL Managed instance, automated maintenance and more responsibilities for the owner
Azure SQL VM, IaaS, een VM met SQL Server erop
Name 3 open source databases offered on Azure
MySQL, MariaDB, PostgreSQL
What is an index?
A way to speed up searching through a table. It’s extra data stored as a tree.
What is automated and what is manual when choosing for Azure SQL Managed Instance?
Automated: backup, patching, database monitoring, other general tasks
Manual: security, resource allocation
What is the difference between Azure SQL Database single database vs elastic pool?
With an elastic pool, by default, multiple database can share the same pool of resources, while a single database is more isolated.
What is the data migration assistant?
A tool that can determine compatibility when migrating Azure SQL Database, migrating Azure SQL Managed instance or upgrading SQL server
What is one notable feature supported by MariaDB?
temporal data
What are 3 notable features supported by PostgreSQL?
Custom data types with non-relation properties, code modules that can be run by queries and the abilitiy to store and manipulate geometric data
In Azure blob storage, what can the folder structure do?
Very little, it’s just virtual. No support for access control or bulk operations
What 3 types of blobs does Azure blob storage support?
Block blobs for discrete, large binary objects that change infrequently
Page blobs, optimized for read and write, used for virtual disk storage for VMs
Append blobs, a block blob optimzed for appends that doesn’t support update or delete
How do you create a data lake?
What is one limitation of this process?
Enable “Hierarchical namespace” on an Azure storage account. You can upgrade anytime, but you can’t revert from a datalake to a regular storage account
What are the 4 key benefits of Microsoft OneLake in Fabric?
- Organization-wide data lake
- Distributed ownership and collaboration
- Open and compatible, built on Delta Parquet
- Easy to navigate via OneLake file explorer
What is cognitive analytics?
Using certain AI services to transform data
What are the 2 performance tiers for Azure files?
Standard: hard-disk based in a datacenter
Premium: solid-state disks
What 2 common network protocols are supported by Azure files?
Server Message Block (SMB): commonly used across multiple operating systems
Network File System (NFS): used by some Linux and MacOS versions, only available on premium
What is Azure tables?
A NoSQL storage solution using key/value data items. All rows have a unique key, timestamp for last update and variable columns
What is used for fast access for Azure tables? In what 4 ways does that improve scalability and performance?
Partitions. Partitions are independent, can grow and shrink and a table can have an number of partitions.
You can include your partition key in your searches.
What are the 2 elements of an Azure Table storage key?
Partition key and row key
What 6 database engines does Azure Cosmos DB support?
Azure Cosmos DB for:
1. NoSQL (native), 2. MongoDB, 3 PostGreSQL, 4 Table (Azure Table Storage), 5 Apache Cassandra, 6 Apache Gremlin
What is Microsoft Purview?
A unified solution for data governance, protection and management.