01 Understanding Big Data And PySpark Flashcards
Components of YARN
Resource manager, Node manager, Application manager
Process of YARN
Resource manager will take requests from the team client
Node manager will send node status reports
Application manager will allocate cpu and RAM for the request to process
Semi structured data
Key and value pair
RDBMS
Eg .JSON, .XML
Lakehouse
Spark without Hadoop
Process of HDFS
Main data will be broken down into blocks
Each block stored in 1data node
Name node will have all details about block and block location
features of distributed approach
horizontal
Easily scalable
Cheap
Fault tolerance
structured data
Rows and columns
COBOL & RDBMS
Map/Reduce
Dividing problems into simple parts and solving accordingly
Approach to solve big data problem
Monolithic & Distributed
Big data problems
Volume, velocity, variety
Unstructured data
No structure
E.g - images and audio
Monolithic
Vertical
Difficult to scale
No fault tolerance
Expensive
Data lake
Spark with Hadoop
Components HDFS
Name node
Data node
Hadoop Architecture
YARN, HDFS, Map/Reduce
Types of Data
structured, semi-structured, unstructured
Structure of DataLake
data lake is where all data is dumped; there is a middle layer named data warehouse where data from datalake after ETL is stored.
Data warehouse
Initially all the data would stored in lakehouse which led to data disturbances. So now day is tired in data warehouse after ETL from where data is give to powerbi.
Which type of distribution does apache spark have
Distributed distribution.
2 layers of spark ecosystems
Set of core layer
Set of DSL (domain specific language)
Layers of core layer
Distributed set of engine
Set of core API’s
What does distributed set of engine do
Runs dat processing workload.
Responsible for map reduce, fault tolerance.
What are set of core API’s
We can write logic in scala , R, java, python
What are set of DSL
They have 4 groups
Spar SQL & Df - allows to use sql to get data from db
Streaming - allow to process continuous and unbound data
Mllib- have ML Libraries
Graphx - has graph processing libraries.