01 Understanding Big Data And PySpark Flashcards
Components of YARN
Resource manager, Node manager, Application manager
Process of YARN
Resource manager will take requests from the team client
Node manager will send node status reports
Application manager will allocate cpu and RAM for the request to process
Semi structured data
Key and value pair
RDBMS
Eg .JSON, .XML
Lakehouse
Spark without Hadoop
Process of HDFS
Main data will be broken down into blocks
Each block stored in 1data node
Name node will have all details about block and block location
features of distributed approach
horizontal
Easily scalable
Cheap
Fault tolerance
structured data
Rows and columns
COBOL & RDBMS
Map/Reduce
Dividing problems into simple parts and solving accordingly
Approach to solve big data problem
Monolithic & Distributed
Big data problems
Volume, velocity, variety
Unstructured data
No structure
E.g - images and audio
Monolithic
Vertical
Difficult to scale
No fault tolerance
Expensive
Data lake
Spark with Hadoop
Components HDFS
Name node
Data node
Hadoop Architecture
YARN, HDFS, Map/Reduce