01 Understanding Big Data And PySpark Flashcards

1
Q

Components of YARN

A

Resource manager, Node manager, Application manager

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Process of YARN

A

Resource manager will take requests from the team client

Node manager will send node status reports

Application manager will allocate cpu and RAM for the request to process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Semi structured data

A

Key and value pair

RDBMS

Eg .JSON, .XML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Lakehouse

A

Spark without Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Process of HDFS

A

Main data will be broken down into blocks

Each block stored in 1data node

Name node will have all details about block and block location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

features of distributed approach

A

horizontal

Easily scalable

Cheap

Fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

structured data

A

Rows and columns

COBOL & RDBMS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Map/Reduce

A

Dividing problems into simple parts and solving accordingly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Approach to solve big data problem

A

Monolithic & Distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Big data problems

A

Volume, velocity, variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Unstructured data

A

No structure

E.g - images and audio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Monolithic

A

Vertical

Difficult to scale

No fault tolerance

Expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data lake

A

Spark with Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Components HDFS

A

Name node

Data node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hadoop Architecture

A

YARN, HDFS, Map/Reduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types of Data

A

structured, semi-structured, unstructured

17
Q

Structure of DataLake

A

data lake is where all data is dumped; there is a middle layer named data warehouse where data from datalake after ETL is stored.

18
Q

Data warehouse

A

Initially all the data would stored in lakehouse which led to data disturbances. So now day is tired in data warehouse after ETL from where data is give to powerbi.

19
Q

Which type of distribution does apache spark have

A

Distributed distribution.

20
Q

2 layers of spark ecosystems

A

Set of core layer
Set of DSL (domain specific language)

21
Q

Layers of core layer

A

Distributed set of engine
Set of core API’s

22
Q

What does distributed set of engine do

A

Runs dat processing workload.
Responsible for map reduce, fault tolerance.

23
Q

What are set of core API’s

A

We can write logic in scala , R, java, python

24
Q

What are set of DSL

A

They have 4 groups
Spar SQL & Df - allows to use sql to get data from db
Streaming - allow to process continuous and unbound data
Mllib- have ML Libraries
Graphx - has graph processing libraries.

25
Q

Advantages of spark ecosystem

A
  1. Abstraction - hides the fact that we are working on cluster of computers
  2. Unifies platform- combines capabilities of all DSL & API
  3. easy to use
26
Q

Types of Structured Data

A
  1. Time Series
  2. Cross Sectional
  3. Panel Data
27
Q

what is Time Series data?

A

data collected for a single observational unit over a period of time.

28
Q

what is Cross Sectional data

A

data collected for a multiple observational unit at a point of time.

29
Q

what is Panel Data

A

data collected for a multiple observational unit over a period of time.