Cloud Computing Concepts Flashcards
What is BigData?
Types of Data
- Structured Data (tabular format)
- Semi-Structured Data(JSON,XML,HTML)
- Un-Structured Data
Azure Storages (90+)
- Azure Blob Storage (40 GB) Limited Storage (All kinds of data)
- Azure Data Lake Gen 2 (Unlimited) (All Kinds of data)
- Azure SQL DB(only Structured data)
- Azure Cosmos DB- No-SQL(Structured, Semi-Structured)
- Azure Elastic Pool (Group of DB) (Structured)
- Azure Synapse Analytics(DataWarehouse) (Structured)
ADB
ADB is developed on ETL and supports ELT as well
ADB Components
Resource Group: it is like a folder, the ADB session resides in it
–> WorkSpace : It provides interface fro all the users to collaberatively in a single session
————> Cluster: Group of machines working as a single machine
ADB
Azure Databricks is a cloud service that provides a scalable platform for data analytics using Apache Spark.
Cluster
Group of machines working like a single machine. Max worker nodes 100,000
Types:
1. All Purpose Compute
2. Job Compute
3. SQL Warehouse
4. Pools
All Purpose Cluster
All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster. It can be created with or without a pool.
Job Cluster
It is automatically created and deleted by the user without human intervention when the job is created . Job clusters terminate when your job ends, reducing resource usage and cost.
Workspace
A collaborative env which allows multiple users to work in a single session.
Magic Commands
Are used to change the language of the notebook
1. %python, %py
2. %scala
3. %r
4. %sql
Dynamic data masking
Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It is a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.
Apache Spark Architecture
It contains 3 major components
1. Master Node(Driver Node)
2. Cluster Manager
3. Worker Node
Workspace Assets
In ADB workspace, we can manage different assets
* Cluster
* Notebook
* Jobs
* Libraries
* Folders
* Models
* Experiments
Workspace: A Repository with folder like structure that contains all the azure datbricks assets.
On Demand Instaces
on-demand instances in Azure are like renting a computer whenever you need it without any long-term commitments. You pay for the time you use the computer, and you can start or stop it whenever you want. These instances are flexible and can be scaled up or down based on your needs.