Module 2_ 1. Plotting for exploratory data analysis (EDA) Flashcards
What are Pair-plots?
-Humans can’t visualize beyond 3d, so we use a small hack
-We plot every possible pair of features
What are the advantages and limitations of Pair-plots?
Advantages:
-Helps in visualizing and analyzing data
-Gives a sense of how the data is distributed
-Can be used to decide which pair of features best separate the data points
Limitations:
-Can’t be used for higher dimensions like 100-D, 500-D, etc.
What are histograms?
Histograms are basically type of bar chart showing frequency/number of observations within different numerical ranges.
Why do we need histograms?
-1-D scatter plots are hard to interpret since it is difficult to see the overlapping points.
-So to make things more interpretable we use histograms.
What is PDF? What does it show?
-A smooth approximation of the histogram is called Probability Density Function (PDF)
-PDF shows the density of points and not the number of points
How do you obtain a PDF?
PDF is obtained by performing Kernel Density Estimation (KDE) on histograms.
Area under PDF = ?
1
What is CDF? How is it calculated?
-CDF gives us the percentage of points within a particular value
-CDF can be calculated by calculating the area under the curve of PDF
-Another way to do so is by using the below formula:
Number of points within a particular value/Total number of points
Eg. Suppose 41 setosa flowers have PL < 1.6 and total no. of setosa flowers is 50 then using the above formula we can calculate as below:
41/50=0.82
Which means 82% of flowers have PL <1.6
True or False.
Both PDF and CDF can be very handy in calculating threshold values for simple if-else models.
True
Mean(μ) =
(x1+x2+…+xn)/n
Also written as below:
Σ Xi * (1/n)
Variance(σ^2) =
(1/n) Σ (Xi -μ)^2
a.k.a Average Distance Square
Standard Deviation(σ) =
√ ((1/n) Σ (Xi -μ)^2)
Median =
For odd no. of elements in list
-Sort
-Pick middle value
OR if even no. of elements in list
-Sort
-Take average of middle two values
What are Percentiles and Quantiles?
Percentile
-A measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall
Quantile
-Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way
Eg. 25th,50th,75th,100th ——> Quantiles
Eg. Delivery times = {1,1.5,2,…….} —> 10k data points
95th Percentile : 4 days
99th Percentile : 5.6 days
What is IQR? Why is it useful?
-Inter Quartile Range (IQR) is defined as the difference between the 75th and 25th percentiles of the data
-75th Percentile - 25th Percentile
-Gives a range where the central 50% points lie