Module 2_ 3. Probability and Statistics Flashcards
Eg. Rolling a fair dice
Is it an example of continuous random variable or discrete random variable?
discrete random variable
Eg. Measuring height of a randomly picked student
Is it an example of continuous random variable or discrete random variable?
continuous random variable
What is the difference between population and sample? Explain with example.
Suppose we need to calculate the average height of people in the world.
If we go by population we will consider all the heights of 7 billion people and calculate the mean using the below formula:
μ = (1/7B) Σ hi
If we go by sample, we will consider a subset of the heights of 7 billion people (like take only 1000 heights) and calculate the mean using the below formula:
x̄ = (1/1000) Σ hi
As sample size increases,
x̄ ≈ μ
Does Gaussian distribution occur in real world? If yes give 2 examples.
YES
- SL and PL of iris flowers.
- Heights and weights of people in real world.
If X follows Gaussian distribution and has mean(μ) and variance(σ^2), then write it in mathematical form.
X ~ N(μ, σ^2)
What is the 68-95-99 rule?
- In range [μ - σ, μ + σ], 68.2% of points lie
- In range [μ - 2σ, μ + 2σ], 95% of points lie
- In range [μ - 3σ, μ + 3σ], 99.7% of points lie
What is the mathematical formulation of Gaussian distribution?
P(X=x) = P(x) = (1/σ√(2π)) exp{-(x-μ)^2/(2σ^2)}
Simplifying the above equation,
Let μ=0, σ^2=1
P(X=x) = P(x) = (1/√(2π)) exp{(-1/2)x^2}
After removing constants,
P(x) = y = exp{-x^2}
As x moves away from μ, y reduces exponentially
True or False.
PDF of Gaussian distribution is symmetric.
True
What is Kurtosis? What is the formula for Excess kurtosis?
Kurtosis - Measure of tailedness and not peakedness
Excess Kurtosis —-> Kurtosis - 3
What is standard normal variate(Z)?
Z ~ N(0,1) where μ=0 and σ^2=1
What is standardization?
Standardization is converting any gaussian distribution with a finite mean(μ) and variance(σ^2) to a standard normal variate.
x’i = (xi - μ)/σ ———————-> x’i ~ N(0,1)
Now we can say 68.2% of these converted points lie between -1 and 1
What is Kernel Density Estimation(KDE)?
For each point in the sample space, a gaussian kernel is drawn(with the point being mean) and also area with higher density of points will have higher height in PDF.
What is sampling distribution?
Lets say we take m random samples each of size n.
Lets say n=30
S1, S2, S3, …………, Sm (m-samples)
x̄1, x̄2, x̄3, …………, x̄m are the means of m samples
Then x̄i belongs to a distribution called as the “Sampling distribution of sample means”
What is Central Limit Theorem(CLT)?
If X has finite mean(μ) and variance(σ^2),
——–> S1, S2, S3,………..,Sm (m samples of size n)
——–> x̄1, x̄2, x̄3,……….., x̄m (sample means)
——–> x̄i ~ N(μ, (σ^2)/n) as n->∞
here σ^2 is the variance of original data
CLT is powerful because it works on data having any kind of distribution which has a finite mean(μ) and variance(σ^2)
Note: CLT doesn’t work for pareto distribution since it has infinite mean and variance
Also in real world when n >= 30 things start falling in place and sampling distribution of sample means becomes gaussian distribution
What are Q-Q plots? How to plot them?
Q-Q plots stand for Quantile-Quantile plots.
They can be used for comparing two distributions (X and Y) and finding out whether they have the same distribution.
Eg.Given X: x1,x2,….x500
Is X gaussian distributed?
Steps:
1.Sort xi’s and compute percentiles
——–> x1, x2, ………, x500
——–> Sort in ascending order
——–> Calculate percentiles
——–> x(1), x(2), …….., x(100)
- Y ~ N(0,1)
——–> y1, y2, ………., y1000
——–> Sort in ascending order
——–> Calculate percentiles
——–> y(1), y(2), …….., y(100) - Plot Q-Q plot using x(1), x(2), …….., x(100)
y(1), y(2), …….., y(100)
If all points lie on a straight line then we can say X and Y have similar distributions.
But we can’t conclude that X also has μ=0 and σ^2=1
Task: Order t-shirts for all employees (100k)
a. How many XL t-shirts should you order?
Domain knowledge :
height >= 180cm for XL t-shirt
height [160cm,180cm] for L t-shirt
Collect heights of 500 random employees.
heights ~ N(μ, σ^2)
Plot CDF.
Suppose from CDF we observed P(h >= 180cm) = 1%
So now we will order 1000 XL t-shirts i.e. 1% of 100K