Cloud Computing Concepts Flashcards
What is BigData?
Types of Data
- Structured Data (tabular format)
- Semi-Structured Data(JSON,XML,HTML)
- Un-Structured Data
Azure Storages (90+)
- Azure Blob Storage (40 GB) Limited Storage (All kinds of data)
- Azure Data Lake Gen 2 (Unlimited) (All Kinds of data)
- Azure SQL DB(only Structured data)
- Azure Cosmos DB- No-SQL(Structured, Semi-Structured)
- Azure Elastic Pool (Group of DB) (Structured)
- Azure Synapse Analytics(DataWarehouse) (Structured)
ADB
ADB is developed on ETL and supports ELT as well
ADB Components
Resource Group: it is like a folder, the ADB session resides in it
–> WorkSpace : It provides interface fro all the users to collaberatively in a single session
————> Cluster: Group of machines working as a single machine
ADB
Azure Databricks is a cloud service that provides a scalable platform for data analytics using Apache Spark.
Cluster
Group of machines working like a single machine. Max worker nodes 100,000
Types:
1. All Purpose Compute
2. Job Compute
3. SQL Warehouse
4. Pools
All Purpose Cluster
All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster. It can be created with or without a pool.
Job Cluster
It is automatically created and deleted by the user without human intervention when the job is created . Job clusters terminate when your job ends, reducing resource usage and cost.
Workspace
A collaborative env which allows multiple users to work in a single session.
Magic Commands
Are used to change the language of the notebook
1. %python, %py
2. %scala
3. %r
4. %sql
Dynamic data masking
Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It is a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.
Apache Spark Architecture
It contains 3 major components
1. Master Node(Driver Node)
2. Cluster Manager
3. Worker Node
Workspace Assets
In ADB workspace, we can manage different assets
* Cluster
* Notebook
* Jobs
* Libraries
* Folders
* Models
* Experiments
Workspace: A Repository with folder like structure that contains all the azure datbricks assets.
On Demand Instaces
on-demand instances in Azure are like renting a computer whenever you need it without any long-term commitments. You pay for the time you use the computer, and you can start or stop it whenever you want. These instances are flexible and can be scaled up or down based on your needs.
On Spot Instances
Azure’s Spot Virtual Machines offer discounted pricing by using spare capacity in Azure’s data centers. However, if there’s high demand for this capacity from paying customers or for Azure’s own needs, Azure can reclaim these Spot VMs.
When Azure reclaims a Spot VM:
Eviction Notice: Azure gives a 30-second eviction notice to the Spot VM.
Deallocation: The VM is gracefully deallocated, giving you a short time window to save work or perform any necessary shutdown procedures.
What is the purpose of Spark Context?
Spark Context serves as the entry point to Spark, managing connections to a Spark cluster and coordinating job execution.
In which Spark version was Spark Session introduced?
Spark Session was introduced in Spark 2.0 as an evolution of Spark Context and SQL Context.
What functionalities did Spark Context primarily handle?
Spark Context was mainly responsible for RDD-based operations, managing resources, and interacting with the Spark cluster.
What is the broader scope of Spark Session in comparison to Spark Context?
Spark Session covers a wider scope by providing a unified entry point, supporting DataFrames, Datasets, and simplifying interactions with structured data.
How does Spark Session differ in its API evolution from Spark Context?
While Spark Context was the primary entry point in older Spark versions, Spark Session is the recommended entry point in newer versions, offering a higher-level API and unifying different context functionalities.
wget
Similar to cURL, wget retrieves content from web servers but with a focus on downloading files. It’s capable of recursive downloads, supports resuming interrupted downloads, and works well for fetching entire websites or specific files.
What is the primary API provided by Spark for working with structured data?
The primary API for working with structured data in Spark is the DataFrame API.
Which Spark API is used for working with distributed collections of data?
The Resilient Distributed Dataset (RDD) API is used for working with distributed collections of data in Spark.
Which API in Spark provides higher-level abstractions and optimizations for working with structured data?
DataFrames and Datasets APIs provide higher-level abstractions and optimizations for working with structured data compared to RDDs.
What is the main programming language used with Spark APIs?
Spark supports APIs in multiple languages, but the primary language is Scala, followed by Java, Python, and R.
Which Spark API provides SQL-like querying capabilities for working with data?
The DataFrame API offers SQL-like querying capabilities, allowing users to run SQL queries programmatically on distributed data.
What are the characteristics of RDDs in Spark?
RDDs are immutable, fault-tolerant, and lazily evaluated. They can be rebuilt from lineage in case of failure.
How can you create an RDD in Spark?
RDDs can be created by parallelizing an existing collection in memory, by loading data from external storage (like HDFS, S3), or by transforming an existing RDD using operations like map, filter, etc.
What operations can you perform on RDDs in Spark?
RDDs support two types of operations: transformations (like map, filter, reduceByKey) and actions (like count, collect, saveAsTextFile) enabling data transformation and computation.
How does RDD lineage contribute to fault tolerance?
RDD lineage tracks the sequence of transformations, enabling Spark to recompute lost partitions by reapplying these transformations in case of node failures.