1. Structured Data (tabular format) 2. Semi-Structured Data(JSON,XML,HTML) 3. Un-Structured Data

1. Azure Blob Storage (40 GB) Limited Storage (All kinds of data) 2. Azure Data Lake Gen 2 (Unlimited) (All Kinds of data) 3. Azure SQL DB(only Structured data) 4. Azure Cosmos DB- No-SQL(Structured, Semi-Structured) 5. Azure Elastic Pool (Group of DB) (Structured) 6. Azure Synapse Analytics(DataWarehouse) (Structured)

Cloud Computing Concepts Flashcards by Thomas Reddy

What is BigData?

How well did you know this?

Not at all

Perfectly

Types of Data

Structured Data (tabular format)
Semi-Structured Data(JSON,XML,HTML)
Un-Structured Data

How well did you know this?

Not at all

Perfectly

Azure Storages (90+)

Azure Blob Storage (40 GB) Limited Storage (All kinds of data)
Azure Data Lake Gen 2 (Unlimited) (All Kinds of data)
Azure SQL DB(only Structured data)
Azure Cosmos DB- No-SQL(Structured, Semi-Structured)
Azure Elastic Pool (Group of DB) (Structured)
Azure Synapse Analytics(DataWarehouse) (Structured)

How well did you know this?

Not at all

Perfectly

ADB

ADB is developed on ETL and supports ELT as well

How well did you know this?

Not at all

Perfectly

ADB Components

Resource Group: it is like a folder, the ADB session resides in it
–> WorkSpace : It provides interface fro all the users to collaberatively in a single session
————> Cluster: Group of machines working as a single machine

How well did you know this?

Not at all

Perfectly

ADB

Azure Databricks is a cloud service that provides a scalable platform for data analytics using Apache Spark.

How well did you know this?

Not at all

Perfectly

Cluster

Group of machines working like a single machine. Max worker nodes 100,000
Types:
1. All Purpose Compute
2. Job Compute
3. SQL Warehouse
4. Pools

How well did you know this?

Not at all

Perfectly

All Purpose Cluster

All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster. It can be created with or without a pool.

How well did you know this?

Not at all

Perfectly

Job Cluster

It is automatically created and deleted by the user without human intervention when the job is created . Job clusters terminate when your job ends, reducing resource usage and cost.

How well did you know this?

Not at all

Perfectly

Workspace

A collaborative env which allows multiple users to work in a single session.

How well did you know this?

Not at all

Perfectly

Magic Commands

Are used to change the language of the notebook
1. %python, %py
2. %scala
3. %r
4. %sql

How well did you know this?

Not at all

Perfectly

Dynamic data masking

Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It is a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.

How well did you know this?

Not at all

Perfectly

Apache Spark Architecture

It contains 3 major components
1. Master Node(Driver Node)
2. Cluster Manager
3. Worker Node

How well did you know this?

Not at all

Perfectly

Workspace Assets

In ADB workspace, we can manage different assets
* Cluster
* Notebook
* Jobs
* Libraries
* Folders
* Models
* Experiments

Workspace: A Repository with folder like structure that contains all the azure datbricks assets.

How well did you know this?

Not at all

Perfectly

On Demand Instaces

on-demand instances in Azure are like renting a computer whenever you need it without any long-term commitments. You pay for the time you use the computer, and you can start or stop it whenever you want. These instances are flexible and can be scaled up or down based on your needs.

How well did you know this?

Not at all

Perfectly

On Spot Instances

Study These Flashcards

Azure’s Spot Virtual Machines offer discounted pricing by using spare capacity in Azure’s data centers. However, if there’s high demand for this capacity from paying customers or for Azure’s own needs, Azure can reclaim these Spot VMs.

When Azure reclaims a Spot VM:

Eviction Notice: Azure gives a 30-second eviction notice to the Spot VM.
Deallocation: The VM is gracefully deallocated, giving you a short time window to save work or perform any necessary shutdown procedures.

What is the purpose of Spark Context?

Study These Flashcards

Spark Context serves as the entry point to Spark, managing connections to a Spark cluster and coordinating job execution.

In which Spark version was Spark Session introduced?

Study These Flashcards

Spark Session was introduced in Spark 2.0 as an evolution of Spark Context and SQL Context.

What functionalities did Spark Context primarily handle?

Study These Flashcards

Spark Context was mainly responsible for RDD-based operations, managing resources, and interacting with the Spark cluster.

What is the broader scope of Spark Session in comparison to Spark Context?

Study These Flashcards

Spark Session covers a wider scope by providing a unified entry point, supporting DataFrames, Datasets, and simplifying interactions with structured data.

How does Spark Session differ in its API evolution from Spark Context?

Study These Flashcards

While Spark Context was the primary entry point in older Spark versions, Spark Session is the recommended entry point in newer versions, offering a higher-level API and unifying different context functionalities.

wget

Study These Flashcards

Similar to cURL, wget retrieves content from web servers but with a focus on downloading files. It’s capable of recursive downloads, supports resuming interrupted downloads, and works well for fetching entire websites or specific files.

What is the primary API provided by Spark for working with structured data?

Study These Flashcards

The primary API for working with structured data in Spark is the DataFrame API.

Which Spark API is used for working with distributed collections of data?

Study These Flashcards

The Resilient Distributed Dataset (RDD) API is used for working with distributed collections of data in Spark.

Which API in Spark provides higher-level abstractions and optimizations for working with structured data?

DataFrames and Datasets APIs provide higher-level abstractions and optimizations for working with structured data compared to RDDs.

What is the main programming language used with Spark APIs?

Spark supports APIs in multiple languages, but the primary language is Scala, followed by Java, Python, and R.

Which Spark API provides SQL-like querying capabilities for working with data?

The DataFrame API offers SQL-like querying capabilities, allowing users to run SQL queries programmatically on distributed data.

What are the characteristics of RDDs in Spark?

RDDs are immutable, fault-tolerant, and lazily evaluated. They can be rebuilt from lineage in case of failure.

How can you create an RDD in Spark?

RDDs can be created by parallelizing an existing collection in memory, by loading data from external storage (like HDFS, S3), or by transforming an existing RDD using operations like map, filter, etc.

What operations can you perform on RDDs in Spark?

RDDs support two types of operations: transformations (like map, filter, reduceByKey) and actions (like count, collect, saveAsTextFile) enabling data transformation and computation.

How does RDD lineage contribute to fault tolerance?

RDD lineage tracks the sequence of transformations, enabling Spark to recompute lost partitions by reapplying these transformations in case of node failures.

Cloud Computing Concepts Flashcards

(32 cards)