01 Understanding Big Data And PySpark Flashcards by charita rallabhandi

Components of YARN

Resource manager, Node manager, Application manager

How well did you know this?

Not at all

Perfectly

Process of YARN

Resource manager will take requests from the team client

Node manager will send node status reports

Application manager will allocate cpu and RAM for the request to process

How well did you know this?

Not at all

Perfectly

Semi structured data

Key and value pair

RDBMS

Eg .JSON, .XML

How well did you know this?

Not at all

Perfectly

Lakehouse

Spark without Hadoop

How well did you know this?

Not at all

Perfectly

Process of HDFS

Main data will be broken down into blocks

Each block stored in 1data node

Name node will have all details about block and block location

How well did you know this?

Not at all

Perfectly

features of distributed approach

horizontal

Easily scalable

Cheap

Fault tolerance

How well did you know this?

Not at all

Perfectly

structured data

Rows and columns

COBOL & RDBMS

How well did you know this?

Not at all

Perfectly

Map/Reduce

Dividing problems into simple parts and solving accordingly

How well did you know this?

Not at all

Perfectly

Approach to solve big data problem

Monolithic & Distributed

How well did you know this?

Not at all

Perfectly

Big data problems

Volume, velocity, variety

How well did you know this?

Not at all

Perfectly

Unstructured data

No structure

E.g - images and audio

How well did you know this?

Not at all

Perfectly

Monolithic

Vertical

Difficult to scale

No fault tolerance

Expensive

How well did you know this?

Not at all

Perfectly

Data lake

Spark with Hadoop

How well did you know this?

Not at all

Perfectly

Components HDFS

Name node

Data node

How well did you know this?

Not at all

Perfectly

Hadoop Architecture

YARN, HDFS, Map/Reduce

How well did you know this?

Not at all

Perfectly

Types of Data

structured, semi-structured, unstructured

Structure of DataLake

data lake is where all data is dumped; there is a middle layer named data warehouse where data from datalake after ETL is stored.

Data warehouse

Initially all the data would stored in lakehouse which led to data disturbances. So now day is tired in data warehouse after ETL from where data is give to powerbi.

Which type of distribution does apache spark have

Distributed distribution.

2 layers of spark ecosystems

Set of core layer
Set of DSL (domain specific language)

Layers of core layer

Distributed set of engine
Set of core API’s

What does distributed set of engine do

Runs dat processing workload.
Responsible for map reduce, fault tolerance.

What are set of core API’s

We can write logic in scala , R, java, python

What are set of DSL

They have 4 groups
Spar SQL & Df - allows to use sql to get data from db
Streaming - allow to process continuous and unbound data
Mllib- have ML Libraries
Graphx - has graph processing libraries.

Advantages of spark ecosystem

1. Abstraction - hides the fact that we are working on cluster of computers 2. Unifies platform- combines capabilities of all DSL & API 3. easy to use

Types of Structured Data

1. Time Series 2. Cross Sectional 3. Panel Data

what is Time Series data?

data collected for a single observational unit over a period of time.

what is Cross Sectional data

data collected for a multiple observational unit at a point of time.

what is Panel Data

data collected for a multiple observational unit over a period of time.