Module 2_ 3. Probability and Statistics Flashcards
Eg. Rolling a fair dice
Is it an example of continuous random variable or discrete random variable?
discrete random variable
Eg. Measuring height of a randomly picked student
Is it an example of continuous random variable or discrete random variable?
continuous random variable
What is the difference between population and sample? Explain with example.
Suppose we need to calculate the average height of people in the world.
If we go by population we will consider all the heights of 7 billion people and calculate the mean using the below formula:
μ = (1/7B) Σ hi
If we go by sample, we will consider a subset of the heights of 7 billion people (like take only 1000 heights) and calculate the mean using the below formula:
x̄ = (1/1000) Σ hi
As sample size increases,
x̄ ≈ μ
Does Gaussian distribution occur in real world? If yes give 2 examples.
YES
- SL and PL of iris flowers.
- Heights and weights of people in real world.
If X follows Gaussian distribution and has mean(μ) and variance(σ^2), then write it in mathematical form.
X ~ N(μ, σ^2)
What is the 68-95-99 rule?
- In range [μ - σ, μ + σ], 68.2% of points lie
- In range [μ - 2σ, μ + 2σ], 95% of points lie
- In range [μ - 3σ, μ + 3σ], 99.7% of points lie
What is the mathematical formulation of Gaussian distribution?
P(X=x) = P(x) = (1/σ√(2π)) exp{-(x-μ)^2/(2σ^2)}
Simplifying the above equation,
Let μ=0, σ^2=1
P(X=x) = P(x) = (1/√(2π)) exp{(-1/2)x^2}
After removing constants,
P(x) = y = exp{-x^2}
As x moves away from μ, y reduces exponentially
True or False.
PDF of Gaussian distribution is symmetric.
True
What is Kurtosis? What is the formula for Excess kurtosis?
Kurtosis - Measure of tailedness and not peakedness
Excess Kurtosis —-> Kurtosis - 3
What is standard normal variate(Z)?
Z ~ N(0,1) where μ=0 and σ^2=1
What is standardization?
Standardization is converting any gaussian distribution with a finite mean(μ) and variance(σ^2) to a standard normal variate.
x’i = (xi - μ)/σ ———————-> x’i ~ N(0,1)
Now we can say 68.2% of these converted points lie between -1 and 1
What is Kernel Density Estimation(KDE)?
For each point in the sample space, a gaussian kernel is drawn(with the point being mean) and also area with higher density of points will have higher height in PDF.
What is sampling distribution?
Lets say we take m random samples each of size n.
Lets say n=30
S1, S2, S3, …………, Sm (m-samples)
x̄1, x̄2, x̄3, …………, x̄m are the means of m samples
Then x̄i belongs to a distribution called as the “Sampling distribution of sample means”
What is Central Limit Theorem(CLT)?
If X has finite mean(μ) and variance(σ^2),
——–> S1, S2, S3,………..,Sm (m samples of size n)
——–> x̄1, x̄2, x̄3,……….., x̄m (sample means)
——–> x̄i ~ N(μ, (σ^2)/n) as n->∞
here σ^2 is the variance of original data
CLT is powerful because it works on data having any kind of distribution which has a finite mean(μ) and variance(σ^2)
Note: CLT doesn’t work for pareto distribution since it has infinite mean and variance
Also in real world when n >= 30 things start falling in place and sampling distribution of sample means becomes gaussian distribution
What are Q-Q plots? How to plot them?
Q-Q plots stand for Quantile-Quantile plots.
They can be used for comparing two distributions (X and Y) and finding out whether they have the same distribution.
Eg.Given X: x1,x2,….x500
Is X gaussian distributed?
Steps:
1.Sort xi’s and compute percentiles
——–> x1, x2, ………, x500
——–> Sort in ascending order
——–> Calculate percentiles
——–> x(1), x(2), …….., x(100)
- Y ~ N(0,1)
——–> y1, y2, ………., y1000
——–> Sort in ascending order
——–> Calculate percentiles
——–> y(1), y(2), …….., y(100) - Plot Q-Q plot using x(1), x(2), …….., x(100)
y(1), y(2), …….., y(100)
If all points lie on a straight line then we can say X and Y have similar distributions.
But we can’t conclude that X also has μ=0 and σ^2=1
Task: Order t-shirts for all employees (100k)
a. How many XL t-shirts should you order?
Domain knowledge :
height >= 180cm for XL t-shirt
height [160cm,180cm] for L t-shirt
Collect heights of 500 random employees.
heights ~ N(μ, σ^2)
Plot CDF.
Suppose from CDF we observed P(h >= 180cm) = 1%
So now we will order 1000 XL t-shirts i.e. 1% of 100K
Task: Salaries
If X ~ N(μ, σ^2),
a. Calculate how many employees make a salary >= $100K?
b. Calculate how many employees have salary between [$50K, $70K] ?
a. Plot CDF to find out.
b. Plot CDF and calculate the difference between the two percentages.
If I don’t know the distribution but i know μ is finite and variance is non-zero and finite.
Task: Salaries
μ=$40K and σ=$10K
a. What % of individuals have salary in range of [$20K, $60K] ?
Chebyshev’s Inequality Formula:
P(|X - μ| >= kσ) <= (1/k^2) ——-> P(μ - kσ < X < μ + kσ) >= 1- (1/k^2)
20K = μ - 2σ
40K = μ
60K = μ + 2σ
P($20K < X < $60K) >= 1 - (1/2^2)
P($20K < X < $60K) >= 0.75
75% of individuals have salary in range of [$20K, $60K]
Explain Bernoulli Distribution.
Bernoulli Distribution:
Eg. X ——–> r.v. for getting heads in a coin toss
- Discrete distribution which has 2 outcomes
- Probability ———> P & (1 - P)
Explain Binomial Distribution.
Binomial Distribution:
Eg. X ——–> Coin tossed n times (n=10)
- Y ——-> No. of times of getting head
- Y ∈ {0, 1, 2, …….., 10}
Y ~ Binomial(n, P) ——-> n=no. of trials & P=probability of getting heads
What is Log Normal Distribution?
X ~ log-normal(μ, σ^2) ,
If log(X) ~ normal distribution
Note: As σ^2 increases, PDF becomes more skewed.
What are the applications of Log Normal Distribution?
- Length of comments posted in internet discussions.
- User’s dwell time on online articles.
- Salaries of people
In general, human behaviour mostly follows log-normal distribution.
How to find whether X ~ log-normal(μ, σ^2) ?
x1,x2,…..,xn ———–> log(x1), log(x2), ….., log(xn) ———–>yi’s
Now we can use QQ plot to determine if yi’s follow normal/gaussian distribution or not.
If they follow then X is log-normally distributed.
What is Power-law Distribution (a.k.a. Pareto distribution) ? Give some examples.
- Follows 80-20 rule
- 80% points lie in 20 % of the region and vice versa
- Have infinite mean & variance
Eg.
1. File size distribution in internet traffic(many small files & few large files).
2. Hard disk drive error rates.
How to check if distribution is pareto distribution?
- Q-Q plots
- log-log plots
What is Box-Cox transform?
Power-law/Pareto —-(box-cox transform)—-> Gaussion/Normal
- box-cox(X) —->lambda(λ)
- to calculate yi
- IF λ != 0, (xi-1)/λ
ELSE log(xi) —–> i.e. if λ=0
If λ=0,
xi ~ log-normal distribution
Suppose,
X : heights
Y : weights
Which three measures can be used to quantify the below type of relationships?
- As X increases, Y increases
- As X increases, Y decreases
- Co-variance
- Pearson co-relation coefficient
- Spearman rank co-relation coefficient
What is co-variance? What are its drawbacks/limitations?
Co-variance(X,Y) = (1/n) Σ (xi - μx) * (yi - μy)
If Co-variance(X,Y) = +ve ————> As X increases, Y increases
If Co-variance(X,Y) = -ve ————> As X increases, Y decreases
Drawbacks/Limitations:
1. If X = height in cm, Y = weight in kg, X’ = height in ft, Y’ = weight in lbs
then Co-variance(X,Y) != Co-variance(X’,Y’)
If we change the scale the covariance also changes which is bad
What is Pearson co-relation coefficient? What are its drawbacks/limitations?
Px,y = Co-variance(X,Y)/σxσy
where σx = √variance(X) and σy = √variance(Y)
If Px,y = +ve ————> As X increases, Y increases
If Px,y = -ve ————> As X increases, Y decreases
Drawbacks/Limitations:
1. Px,y = +1 only if linear relationship exists between X & Y.
So, if y=x^2, P<1 (even though its monotonically increasing).
2. Slope of straight line doesn’t affect the Px,y.
3. Complex relationships are not captured. Eg. sine wave
Fix —-> Spearman rank co-relation coefficient
What is Spearman rank co-relation coefficient?
X Y rx ry
s1 160 52 4 3
s2 150 166 2 4
s3 170 68 5 5
s4 140 46 1 1
s5 158 51 3 2
Here we are sorting X and Y and giving them ranks in ascending order.
We saw, Px,y ——> linear relationship
r = Prx,ry
This means Spearman rank co-relation between two variables is equal to the “Pearson co-relation” between the rank values of those two variables
- If as X increases, Y increases (linear or not doesn’t matter) —-> r =1
- If as X increases, Y decreases (linear or not doesn’t matter) —-> r =-1
Also Spearman rank co-relation is more robust to outliers than Pearson co-relation.
Explain Correlation vs Causation.
- “Correlation” does not imply “Causation”.
- Just because two random variables are correlated (eg. X increases, Y increases) doesn’t mean X causes Y or vice versa.
Eg. Graph of nobel laureates vs chocolate consumption.
Give 4 examples of how to use correlations?
- Is salary correlated with sq. footage of your home?
- Is no. of years of education correlated with income?
- E-commerce(Amazon):
- Time spent in 24 hrs vs money spent in 24 hrs
- # unique visitors in a day vs $ sales in a day - Medicine:
- Dosage of a drug vs Reduction in blood sugar
Explain confidence interval with example.
- A confidence interval, in statistics, refers to the probability that a population parameter will fall between a set of values for a certain proportion of times
Eg.
X ~ any distribution
Also X —–> heights of people ——-> {x1, x2, …….., x10} ——> random sample of size 10
Estimate population mean i.e. μ of X.
μ ≈ x̄ ——-> where x̄ is sample mean ——-> This is a point estimate. Not bad but we can do better
If we say, μ ∈ [162.1, 174.9] with 95% confidence
- Interval with some confidence value
- Richer than previous in terms of information
Note: If we repeat the sampling multiple times, each time we get a different value for x̄. In 95% of sampling experiments, μ will be between endpoints of C.I. calculated using x̄, but in 5% of cases it will not be. C.I. does not mean that μ lies in the interval with 95% probability.
How to compute confidence interval in case of gaussian distribution?
Say X ~ N(μ, σ^2)
Let μ = 168cm and σ = 5cm
We know from gaussian distribution that (μ-2σ, μ+2σ) contains 95% of my observations.
So we can say heights of people lie between [158, 178] with 95% confidence.
Similarly other values like 90%, 80%, etc. can be found using Normal dist. tables.
Eg. Suppose C=90%
(1 - C)/2 = (1 - 0.9)/2 = 5%
- Lie in [x’, x”] with 90% confidence
- x’, x” can be found by looking at the normal-dist. tables. All this data is tabulated.
How to compute confidence interval if we don’t know the distribution but we know that it has a finite mean and variance?
Case 1:
X~ some dist. with finite μ and σ^2
Q. What is the 95% C.I. of μ?
Let σ = 5cm
{x1, x2, ………, x10} ——–> somple of size(n) = 10
x̄ = sample mean = (1/10) Σ xi —-> n=10
As we learnt earlier using CLT, we can say,
x̄ ~ N(μ, (σ/√n))
Hence we can say that,
μ ∈ [x̄ - (2σ/√n), x̄ + (2σ/√n)] with 95% confidence
Case 2: It we dont know σ
Use students t-dist
x̄ ~ t(n-1)
Explain confidence interval using bootstrapping with example.
Task : Estimate 95% of C.I. for median of X using only the given sample of X
S = {x1, x2,………,xn}
——> using sampling with replacement using u(1,n) i.e. uniform random variable between 1 to n
Let k = 1000 and m <=n
- S1 : x1’, x2’, ……… , xm’ —-> m1 —> median of sample 1
- S2 : x1’, x2’, ……… , xm’ —-> m2 —> median of sample 2
:
:
: - Sk : x1’, x2’, ……… , xm’ —-> mk —> median of sample k
———> m1, m2, …….., m1000
———> sort ———>m1’<=m2’<=m3 …….., <=m1000’(increasing order)
———> 95% C.I = 950/1000 = 95%
Therefore 95% C.I is [m25, m975]
Explain hypothesis testing with example.
Task : Given a coin, determine if the coin is biased towards heads or not
- Test Statistic : Flip coin 5 times and count no. of heads = X
- Perform experiment —–> H H H H H ——-> X = 5 ——> This is our observation
Let H0 = Coin is not biased towards heads
P(observation | H0) = P(X=5 | Coin is not biased towards heads)
= 1/(2^5) = 1/32 ≈ 0.03 = 3%
P(observation | H0) is also called a p-value.
Typically, p-value < 5% is said to be small.
Here P(X=5 | H0) = 3%
So there is a 3% chance of getting 5 heads in 5 flips if coin is not biased.
3% —-> quite low
The observation is done practically, so it is the ground truth.
Hence, our assumption i.e. H0 may be incorrect.
So, we reject our null-hypothesis (H0) —-> Reject the idea that the coin in not biased
H0 : Coin is not biased —–> Null hypothesis
H1 : Coin is biased —–> Alternate hypothesis
Rejecting H0 means accepting H1
Rejecting H1 means accepting H0
So, we accept the fact that the coin is biased towards heads.
Explain resampling and permutation test with example.
Task : Determine if population mean of heights in two cities is same or not
Experiment : Measure the heights of 50 random people for each city.
Let μ1 and μ2 be sample means of both cities. say (162 and 167)
Test statistic : μ2 - μ1 = 167 - 162 = 5cm(X)
Null hypothesis (H0) : There is no difference in population mean of both cities
Computing P(X=5 | H0) :
1. Take all heights of both cities and put them together in a new set (S).
2. Randomly select 50 pts from S to S1 and remaining 50 to S2. This is resampling. Since S1 and S2 are coming from the same distribution (S) randomly, this will simulate 2 cities having same population mean or simulate null-hypothesis (H0).
Calculate μ1 and μ2 and also μ2 - μ1 = δ
3. Repeat 2nd step k no. of times
4. Sort δi’s in increasing order
δ1<=δ2<=……………<=δk (Our observed difference = 5cm)
Say k =1000 and our observed difference is at δ801. So 20% of sim. difference is greater than observed difference.
P(diff >= 5cm | H0) = 0.2 ———–> significant ———–> Accept H0
Explain how to perform K-S test for similarity of two distributions.
Let X1 and X2 be the two samples of size m and n respectively.
Also, let Dm,n be the maximum difference in their CDFs
Test Statistic : Dn,m = sup|F1,n(x) - F2,m(x)| ——-> maximum diff. in their CDFs
Null Hypothesis (H0) : X1 and X2 have the same distribution.
If, Dm,n > c(α) √((m+n)/mn)
then,
We reject our null hypothesis (H0) and conclude that X1 and X2 have diff distributions
else,
We accept H0
Note : α and c(α) values are taken from table
Eg. If we decide α=0.05 then the corresponding value for c(α) is taken from table
Explain proportional sampling with example.
d = [2.0, 6.0, 1.2, 5.8, 20.0]
Task : Pick an element amongst the n elements s.t. probability of picking an element is proportional to the di’s
Step1 :
a. s = Σ di = 35
b. di’ = di/s
—> d1’ = 0.0571
—> d2’ = 0.171428
—> d3’ = 0.0343
—> d4’ = 0.1657
—> d5’ = 0.5714
Here Σ di’ = 1
c. cumulative normalized sum
—> d1” = d1’ = 0.0571
—> d2” = d1” + d2 ‘= 0.228528
—> d3” = d2” + d3 ‘= 0.262828
—> d4” = d3” + d4 ‘= 0.428528
—> d5” = d4” + d5 ‘= 1
Step2 :
sample one value unif(0.0, 1.0)
r = numpy.random.uniform(0.0, 1.0, 1)
let r = 0.6
Step 3 :
Proportional sampling
if r <= d1”
return 1
elif r <= d2”
return 2
elif r <= d3”
return 3
:
:
: