Azure Data Fundamentals Flashcards by Michal Lewandowski

Storage Account

Azure Storage is a core Azure service that enables you to store data in:
Blob containers - scalable, cost-effective storage for binary files.
File shares - network file shares such as you typically find in corporate networks.
Tables - key-value storage for applications that need to read and write data values quickly.

How well did you know this?

Not at all

Perfectly

Blob Storage

How well did you know this?

Not at all

Perfectly

Tables

How well did you know this?

Not at all

Perfectly

Files

How well did you know this?

Not at all

Perfectly

Entity

Data structures in which this data is organized often represents Each entity typically has one or more attributes, or characteristics

How well did you know this?

Not at all

Perfectly

Structured Data

Fixed Schema, tabular, often stored in db’s,

How well did you know this?

Not at all

Perfectly

Semi Structured Data

data is information that has some structure, but which allows for some variation between entity instances; One common format for semi-structured data is JavaScript Object Notation (JSON)

How well did you know this?

Not at all

Perfectly

Unstructured

Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure.

How well did you know this?

Not at all

Perfectly

Data Stores

there are 2 - files stores and databases

How well did you know this?

Not at all

Perfectly

BLOB

Binary Large Objects such as video, audio, images, and application specific documents, stored as raw binary files

How well did you know this?

Not at all

Perfectly

Optimized File Formats

Avro, ORC, Parquet
While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed.

How well did you know this?

Not at all

Perfectly

Avro

Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.

How well did you know this?

Not at all

Perfectly

ORC (Optimized Row Columnar)

organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.

How well did you know this?

Not at all

Perfectly

Parquet

another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

How well did you know this?

Not at all

Perfectly

Data Documents

Collective form in which data exist; common types:
Datasets, Databases, Datastores, Data Warehouses, Notebooks

How well did you know this?

Not at all

Perfectly

Dataset

Study These Flashcards

logical grouping of data

Database

Study These Flashcards

structured or semi-structured data that can be quickly access and searches;

Datastores

Study These Flashcards

unstructured or semi-strucutred data (example Azure Data Lake)

Non-Relational Databases

Study These Flashcards

often referred to as NoSQL databases. There are 4 common types: Key-value; document (similar to key-value where the value is the JSON document); column family; graph

Data Lake Store (Gen 2)

Study These Flashcards

hierarchical data storage for analytical data lakes, work with structure, semi, and unstructured; To create an Azure Data Lake Store Gen2 files system, you must enable the Hierarchical Namespace option of an Azure Storage account.

Azure File Storage

Study These Flashcards

Many on-premises systems comprising a network of in-house computers make use of file shares. A file share enables you to store a file on one computer, and grant access to that file to users and applications running on other computers. This strategy can work well for computers in the same local area network, but doesn’t scale well as the number of users increases, or if users are located at different sites.
Azure Files enables you to share up to 100 TB of data in a single storage account.
Offers Standard and Premium tiers

Azure Table Storage

Study These Flashcards

NoSQL storage solution, key-value, The key in an Azure Table Storage table comprises two elements; the partition key that identifies the partition containing the row, and a row key that is unique to each row in the same partition.
Azure Table Storage allows you to store key-value data as the cheapest per GB rate. Cosmos DB also have a key-value option, but is not as cost effective.

Storage Account

Study These Flashcards

needs to be set up first before other storage services are added

Azure Cosmos DB

Study These Flashcards

Azure Cosmos DB is a highly scalable cloud database service for NoSQL data; example of NoSQL database; Cosmos DB uses indexes and partitioning to provide fast read and write performance and can scale to massive volumes of data. You can enable multi-region writes, adding the Azure regions of your choice to your Cosmos DB account so that globally distributed users can each work with data in their local replica. Includes API’s for Mongo DB, Table API (key-value tables), Cassandra API (column family storage), Gremlin API (graph structures)

Synapse Analytics

unified, end-to-end solution for large scale data analytics, combines benefts of SQL Server, data lake and open source Apache Spark. All services within this can be managed through a single Azure Synapse Studio.

Data Explorer

standalone service for efficiently analyzing data

Kusto Query Language

to query data explorer tables

Manual Sharding

With a relational data store, when data transaction volumes get too high and database performance suffers, which of the following is a common yet difficult method of distributing those data transactions over multiple servers

Paginated Report

designed to be printed or shared; formated to fit well on page; display all the data in table; pixel perfect;

Dynamic Data Masking

Dynamic data masking (DDM) limits sensitive data exposure by masking it to non-privileged users. It can be used to greatly simplify the design and coding of security in your application. Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to specify how much sensitive data to reveal with minimal impact on the application layer. DDM can be configured on designated database fields to hide sensitive data in the result sets of queries. With DDM the data in the database is not changed.

Transaction Optimized Storage

provides better a cost profile for frequently accessed files. It costs more to store the file, but much less to access it. And it supports Standard Tier of storage

Azure Data Factory

has the ability to move data from one source to another destination with processing along the way. Azure Data Migration is more about simple movement of data from one source to another.

Azure Cache for Redis

Azure Cache for Redis provides an in-memory data store based on the Redis software. Redis improves the performance and scalability of an application that uses backend data stores heavily. It's able to process large volumes of application requests by keeping frequently accessed data in the server memory, which can be written to and read from quickly. Redis brings a critical low-latency and high-throughput data storage solution to modern applications.

Column-store data

data organized in to columns, faster at aggregating values for analytics; NoSQL store of SQL-Like databases; great for vast amount of data; great when you only need a few columns;

Balanced Tree

a common data structure for storing index

Data Consistency

when data being kept in two different place and whether the data exactly match; Strongly consistent - every time you query you get consistent data; Eventually Consistent - when you request data you may get back inconsistent data within 2 seconds

Datamart

subset of data warehouse; has single business focus;

Data Lake

centralized storage repository that holds a vast amount of raw data (big data) in either semi or unstructured format; Hording for data scientist;

Azure Data Fundamentals Flashcards

(38 cards)