Module 2_ 1. Plotting for exploratory data analysis (EDA) Flashcards

1
Q

What are Pair-plots?

A

-Humans can’t visualize beyond 3d, so we use a small hack
-We plot every possible pair of features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the advantages and limitations of Pair-plots?

A

Advantages:
-Helps in visualizing and analyzing data
-Gives a sense of how the data is distributed
-Can be used to decide which pair of features best separate the data points

Limitations:
-Can’t be used for higher dimensions like 100-D, 500-D, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are histograms?

A

Histograms are basically type of bar chart showing frequency/number of observations within different numerical ranges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we need histograms?

A

-1-D scatter plots are hard to interpret since it is difficult to see the overlapping points.
-So to make things more interpretable we use histograms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is PDF? What does it show?

A

-A smooth approximation of the histogram is called Probability Density Function (PDF)
-PDF shows the density of points and not the number of points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you obtain a PDF?

A

PDF is obtained by performing Kernel Density Estimation (KDE) on histograms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Area under PDF = ?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is CDF? How is it calculated?

A

-CDF gives us the percentage of points within a particular value
-CDF can be calculated by calculating the area under the curve of PDF
-Another way to do so is by using the below formula:
Number of points within a particular value/Total number of points
Eg. Suppose 41 setosa flowers have PL < 1.6 and total no. of setosa flowers is 50 then using the above formula we can calculate as below:
41/50=0.82
Which means 82% of flowers have PL <1.6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False.
Both PDF and CDF can be very handy in calculating threshold values for simple if-else models.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mean(μ) =

A

(x1+x2+…+xn)/n
Also written as below:
Σ Xi * (1/n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Variance(σ^2) =

A

(1/n) Σ (Xi -μ)^2
a.k.a Average Distance Square

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standard Deviation(σ) =

A

√ ((1/n) Σ (Xi -μ)^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Median =

A

For odd no. of elements in list
-Sort
-Pick middle value
OR if even no. of elements in list
-Sort
-Take average of middle two values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Percentiles and Quantiles?

A

Percentile
-A measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall
Quantile
-Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way
Eg. 25th,50th,75th,100th ——> Quantiles
Eg. Delivery times = {1,1.5,2,…….} —> 10k data points
95th Percentile : 4 days
99th Percentile : 5.6 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is IQR? Why is it useful?

A

-Inter Quartile Range (IQR) is defined as the difference between the 75th and 25th percentiles of the data
-75th Percentile - 25th Percentile
-Gives a range where the central 50% points lie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Median Absolute Deviation(MAD)?

A

-Median Absolute Deviation(MAD) is defined as the median of the absolute deviations from the data’s median
-median(|Xi - median|)
-Similar to std-dev but here we use median instead of mean hence its more robust
Eg.
-(1,1,2,2,4,6,9)
-median =2 —–> (1,1,0,0,2,4,7)
-(0,0,1,1,2,4,7)
MAD = 1

17
Q

What are Box plots?

A

X-axis ——> type of flower
Y-axis ——> Petal length
-Whiskers
-25th percentile,50th percentile,75th percentile

18
Q

What are Violin plots?

A

-Box plot + PDF
-Whiskers
-25th percentile,50th percentile,75th percentile