Module 2_ 3. Probability and Statistics Flashcards

Question 1

Q

Eg. Rolling a fair dice
Is it an example of continuous random variable or discrete random variable?

Answer

A

discrete random variable

Question 2

Q

Eg. Measuring height of a randomly picked student
Is it an example of continuous random variable or discrete random variable?

Answer

A

continuous random variable

Question 3

Q

What is the difference between population and sample? Explain with example.

Answer

A

Suppose we need to calculate the average height of people in the world.
If we go by population we will consider all the heights of 7 billion people and calculate the mean using the below formula:
μ = (1/7B) Σ hi
If we go by sample, we will consider a subset of the heights of 7 billion people (like take only 1000 heights) and calculate the mean using the below formula:
x̄ = (1/1000) Σ hi

As sample size increases,
x̄ ≈ μ

Question 4

Q

Does Gaussian distribution occur in real world? If yes give 2 examples.

Answer

A

YES
- SL and PL of iris flowers.
- Heights and weights of people in real world.

Question 5

Q

If X follows Gaussian distribution and has mean(μ) and variance(σ^2), then write it in mathematical form.

Answer

A

X ~ N(μ, σ^2)

Question 6

Q

What is the 68-95-99 rule?

Answer

A

In range [μ - σ, μ + σ], 68.2% of points lie
In range [μ - 2σ, μ + 2σ], 95% of points lie
In range [μ - 3σ, μ + 3σ], 99.7% of points lie

Question 7

Q

What is the mathematical formulation of Gaussian distribution?

Answer

A

P(X=x) = P(x) = (1/σ√(2π)) exp{-(x-μ)^2/(2σ^2)}

Simplifying the above equation,
Let μ=0, σ^2=1
P(X=x) = P(x) = (1/√(2π)) exp{(-1/2)x^2}

After removing constants,
P(x) = y = exp{-x^2}
As x moves away from μ, y reduces exponentially

Question 8

Q

True or False.
PDF of Gaussian distribution is symmetric.

Question 9

Q

What is Kurtosis? What is the formula for Excess kurtosis?

Answer

A

Kurtosis - Measure of tailedness and not peakedness
Excess Kurtosis —-> Kurtosis - 3

Question 10

Q

What is standard normal variate(Z)?

Answer

A

Z ~ N(0,1) where μ=0 and σ^2=1

Question 11

Q

What is standardization?

Answer

A

Standardization is converting any gaussian distribution with a finite mean(μ) and variance(σ^2) to a standard normal variate.
x’i = (xi - μ)/σ ———————-> x’i ~ N(0,1)
Now we can say 68.2% of these converted points lie between -1 and 1

Question 12

Q

What is Kernel Density Estimation(KDE)?

Answer

A

For each point in the sample space, a gaussian kernel is drawn(with the point being mean) and also area with higher density of points will have higher height in PDF.

Question 13

Q

What is sampling distribution?

Answer

A

Lets say we take m random samples each of size n.
Lets say n=30
S1, S2, S3, …………, Sm (m-samples)
x̄1, x̄2, x̄3, …………, x̄m are the means of m samples
Then x̄i belongs to a distribution called as the “Sampling distribution of sample means”

Question 14

Q

What is Central Limit Theorem(CLT)?

Answer

A

If X has finite mean(μ) and variance(σ^2),
——–> S1, S2, S3,………..,Sm (m samples of size n)
——–> x̄1, x̄2, x̄3,……….., x̄m (sample means)
——–> x̄i ~ N(μ, (σ^2)/n) as n->∞

here σ^2 is the variance of original data

CLT is powerful because it works on data having any kind of distribution which has a finite mean(μ) and variance(σ^2)

Note: CLT doesn’t work for pareto distribution since it has infinite mean and variance

Also in real world when n >= 30 things start falling in place and sampling distribution of sample means becomes gaussian distribution

Question 15

Q

What are Q-Q plots? How to plot them?

Answer

A

Q-Q plots stand for Quantile-Quantile plots.
They can be used for comparing two distributions (X and Y) and finding out whether they have the same distribution.

Eg.Given X: x1,x2,….x500
Is X gaussian distributed?

Steps:
1.Sort xi’s and compute percentiles
——–> x1, x2, ………, x500
——–> Sort in ascending order
——–> Calculate percentiles
——–> x(1), x(2), …….., x(100)

Y ~ N(0,1)
——–> y1, y2, ………., y1000
——–> Sort in ascending order
——–> Calculate percentiles
——–> y(1), y(2), …….., y(100)
Plot Q-Q plot using x(1), x(2), …….., x(100)
y(1), y(2), …….., y(100)

If all points lie on a straight line then we can say X and Y have similar distributions.
But we can’t conclude that X also has μ=0 and σ^2=1

Question 16

Q

Task: Order t-shirts for all employees (100k)
a. How many XL t-shirts should you order?
Domain knowledge :
height >= 180cm for XL t-shirt
height [160cm,180cm] for L t-shirt

Answer

A

Collect heights of 500 random employees.
heights ~ N(μ, σ^2)
Plot CDF.
Suppose from CDF we observed P(h >= 180cm) = 1%
So now we will order 1000 XL t-shirts i.e. 1% of 100K

Question 17

Q

Task: Salaries
If X ~ N(μ, σ^2),
a. Calculate how many employees make a salary >= $100K?
b. Calculate how many employees have salary between [$50K, $70K] ?

Answer

A

a. Plot CDF to find out.
b. Plot CDF and calculate the difference between the two percentages.

Question 18

Q

If I don’t know the distribution but i know μ is finite and variance is non-zero and finite.
Task: Salaries
μ=$40K and σ=$10K
a. What % of individuals have salary in range of [$20K, $60K] ?

Answer

A

Chebyshev’s Inequality Formula:
P(|X - μ| >= kσ) <= (1/k^2) ——-> P(μ - kσ < X < μ + kσ) >= 1- (1/k^2)

20K = μ - 2σ
40K = μ
60K = μ + 2σ

P($20K < X < $60K) >= 1 - (1/2^2)
P($20K < X < $60K) >= 0.75
75% of individuals have salary in range of [$20K, $60K]

Question 19

Q

Explain Bernoulli Distribution.

Answer

A

Bernoulli Distribution:
Eg. X ——–> r.v. for getting heads in a coin toss
- Discrete distribution which has 2 outcomes
- Probability ———> P & (1 - P)

Question 20

Q

Explain Binomial Distribution.

Answer

A

Binomial Distribution:
Eg. X ——–> Coin tossed n times (n=10)
- Y ——-> No. of times of getting head
- Y ∈ {0, 1, 2, …….., 10}
Y ~ Binomial(n, P) ——-> n=no. of trials & P=probability of getting heads

Question 21

Q

What is Log Normal Distribution?

Answer

A

X ~ log-normal(μ, σ^2) ,
If log(X) ~ normal distribution

Note: As σ^2 increases, PDF becomes more skewed.

Question 22

Q

What are the applications of Log Normal Distribution?

Answer

A

Length of comments posted in internet discussions.
User’s dwell time on online articles.
Salaries of people

In general, human behaviour mostly follows log-normal distribution.

Question 23

Q

How to find whether X ~ log-normal(μ, σ^2) ?

Answer

A

x1,x2,…..,xn ———–> log(x1), log(x2), ….., log(xn) ———–>yi’s

Now we can use QQ plot to determine if yi’s follow normal/gaussian distribution or not.
If they follow then X is log-normally distributed.

Question 24

Q

What is Power-law Distribution (a.k.a. Pareto distribution) ? Give some examples.

Answer

A

Follows 80-20 rule
80% points lie in 20 % of the region and vice versa
Have infinite mean & variance

Eg.
1. File size distribution in internet traffic(many small files & few large files).
2. Hard disk drive error rates.

Question 25

Q

How to check if distribution is pareto distribution?

Answer

A

Q-Q plots
log-log plots

Question 26

Q

What is Box-Cox transform?

Answer

A

Power-law/Pareto —-(box-cox transform)—-> Gaussion/Normal

box-cox(X) —->lambda(λ)
to calculate yi
- IF λ != 0, (xi-1)/λ
ELSE log(xi) —–> i.e. if λ=0

If λ=0,
xi ~ log-normal distribution

Question 27

Q

Suppose,
X : heights
Y : weights
Which three measures can be used to quantify the below type of relationships?
- As X increases, Y increases
- As X increases, Y decreases

Answer

A

Co-variance
Pearson co-relation coefficient
Spearman rank co-relation coefficient

Question 28

Q

What is co-variance? What are its drawbacks/limitations?

Answer

A

Co-variance(X,Y) = (1/n) Σ (xi - μx) * (yi - μy)

If Co-variance(X,Y) = +ve ————> As X increases, Y increases
If Co-variance(X,Y) = -ve ————> As X increases, Y decreases

Drawbacks/Limitations:
1. If X = height in cm, Y = weight in kg, X’ = height in ft, Y’ = weight in lbs
then Co-variance(X,Y) != Co-variance(X’,Y’)

If we change the scale the covariance also changes which is bad

Question 29

Q

What is Pearson co-relation coefficient? What are its drawbacks/limitations?

Answer

A

Px,y = Co-variance(X,Y)/σxσy
where σx = √variance(X) and σy = √variance(Y)

If Px,y = +ve ————> As X increases, Y increases
If Px,y = -ve ————> As X increases, Y decreases

Drawbacks/Limitations:
1. Px,y = +1 only if linear relationship exists between X & Y.
So, if y=x^2, P<1 (even though its monotonically increasing).
2. Slope of straight line doesn’t affect the Px,y.
3. Complex relationships are not captured. Eg. sine wave

Fix —-> Spearman rank co-relation coefficient

Question 30

Q

What is Spearman rank co-relation coefficient?

Answer

A

X Y rx ry
s1 160 52 4 3
s2 150 166 2 4
s3 170 68 5 5
s4 140 46 1 1
s5 158 51 3 2

Here we are sorting X and Y and giving them ranks in ascending order.

We saw, Px,y ——> linear relationship

r = Prx,ry
This means Spearman rank co-relation between two variables is equal to the “Pearson co-relation” between the rank values of those two variables

If as X increases, Y increases (linear or not doesn’t matter) —-> r =1
If as X increases, Y decreases (linear or not doesn’t matter) —-> r =-1

Also Spearman rank co-relation is more robust to outliers than Pearson co-relation.

Question 31

Q

Explain Correlation vs Causation.

Answer

A

“Correlation” does not imply “Causation”.
Just because two random variables are correlated (eg. X increases, Y increases) doesn’t mean X causes Y or vice versa.
Eg. Graph of nobel laureates vs chocolate consumption.

Question 32

Q

Give 4 examples of how to use correlations?

Answer

A

Is salary correlated with sq. footage of your home?
Is no. of years of education correlated with income?
E-commerce(Amazon):
- Time spent in 24 hrs vs money spent in 24 hrs
- # unique visitors in a day vs $ sales in a day
Medicine:
- Dosage of a drug vs Reduction in blood sugar

Question 33

Q

Explain confidence interval with example.

Answer

A

A confidence interval, in statistics, refers to the probability that a population parameter will fall between a set of values for a certain proportion of times

Eg.
X ~ any distribution
Also X —–> heights of people ——-> {x1, x2, …….., x10} ——> random sample of size 10

Estimate population mean i.e. μ of X.

μ ≈ x̄ ——-> where x̄ is sample mean ——-> This is a point estimate. Not bad but we can do better

If we say, μ ∈ [162.1, 174.9] with 95% confidence
- Interval with some confidence value
- Richer than previous in terms of information

Note: If we repeat the sampling multiple times, each time we get a different value for x̄. In 95% of sampling experiments, μ will be between endpoints of C.I. calculated using x̄, but in 5% of cases it will not be. C.I. does not mean that μ lies in the interval with 95% probability.

Question 34

Q

How to compute confidence interval in case of gaussian distribution?

Answer

A

Say X ~ N(μ, σ^2)
Let μ = 168cm and σ = 5cm
We know from gaussian distribution that (μ-2σ, μ+2σ) contains 95% of my observations.

So we can say heights of people lie between [158, 178] with 95% confidence.

Similarly other values like 90%, 80%, etc. can be found using Normal dist. tables.

Eg. Suppose C=90%
(1 - C)/2 = (1 - 0.9)/2 = 5%
- Lie in [x’, x”] with 90% confidence
- x’, x” can be found by looking at the normal-dist. tables. All this data is tabulated.

Question 35

Q

How to compute confidence interval if we don’t know the distribution but we know that it has a finite mean and variance?

Answer

A

Case 1:
X~ some dist. with finite μ and σ^2
Q. What is the 95% C.I. of μ?
Let σ = 5cm
{x1, x2, ………, x10} ——–> somple of size(n) = 10
x̄ = sample mean = (1/10) Σ xi —-> n=10
As we learnt earlier using CLT, we can say,
x̄ ~ N(μ, (σ/√n))

Hence we can say that,
μ ∈ [x̄ - (2σ/√n), x̄ + (2σ/√n)] with 95% confidence

Case 2: It we dont know σ
Use students t-dist
x̄ ~ t(n-1)

Question 36

Q

Explain confidence interval using bootstrapping with example.

Answer

A

Task : Estimate 95% of C.I. for median of X using only the given sample of X

S = {x1, x2,………,xn}
——> using sampling with replacement using u(1,n) i.e. uniform random variable between 1 to n

Let k = 1000 and m <=n

S1 : x1’, x2’, ……… , xm’ —-> m1 —> median of sample 1
S2 : x1’, x2’, ……… , xm’ —-> m2 —> median of sample 2
:
:
:
Sk : x1’, x2’, ……… , xm’ —-> mk —> median of sample k

———> m1, m2, …….., m1000
———> sort ———>m1’<=m2’<=m3 …….., <=m1000’(increasing order)
———> 95% C.I = 950/1000 = 95%
Therefore 95% C.I is [m25, m975]

Question 37

Q

Explain hypothesis testing with example.

Answer

A

Task : Given a coin, determine if the coin is biased towards heads or not

Test Statistic : Flip coin 5 times and count no. of heads = X
Perform experiment —–> H H H H H ——-> X = 5 ——> This is our observation

Let H0 = Coin is not biased towards heads

P(observation | H0) = P(X=5 | Coin is not biased towards heads)
= 1/(2^5) = 1/32 ≈ 0.03 = 3%

P(observation | H0) is also called a p-value.
Typically, p-value < 5% is said to be small.

Here P(X=5 | H0) = 3%
So there is a 3% chance of getting 5 heads in 5 flips if coin is not biased.
3% —-> quite low
The observation is done practically, so it is the ground truth.
Hence, our assumption i.e. H0 may be incorrect.
So, we reject our null-hypothesis (H0) —-> Reject the idea that the coin in not biased
H0 : Coin is not biased —–> Null hypothesis
H1 : Coin is biased —–> Alternate hypothesis

Rejecting H0 means accepting H1
Rejecting H1 means accepting H0

So, we accept the fact that the coin is biased towards heads.

Question 38

Q

Explain resampling and permutation test with example.

Answer

A

Task : Determine if population mean of heights in two cities is same or not

Experiment : Measure the heights of 50 random people for each city.
Let μ1 and μ2 be sample means of both cities. say (162 and 167)

Test statistic : μ2 - μ1 = 167 - 162 = 5cm(X)

Null hypothesis (H0) : There is no difference in population mean of both cities

Computing P(X=5 | H0) :
1. Take all heights of both cities and put them together in a new set (S).
2. Randomly select 50 pts from S to S1 and remaining 50 to S2. This is resampling. Since S1 and S2 are coming from the same distribution (S) randomly, this will simulate 2 cities having same population mean or simulate null-hypothesis (H0).
Calculate μ1 and μ2 and also μ2 - μ1 = δ
3. Repeat 2nd step k no. of times
4. Sort δi’s in increasing order
δ1<=δ2<=……………<=δk (Our observed difference = 5cm)
Say k =1000 and our observed difference is at δ801. So 20% of sim. difference is greater than observed difference.

P(diff >= 5cm | H0) = 0.2 ———–> significant ———–> Accept H0

Question 39

Q

Explain how to perform K-S test for similarity of two distributions.

Answer

A

Let X1 and X2 be the two samples of size m and n respectively.
Also, let Dm,n be the maximum difference in their CDFs

Test Statistic : Dn,m = sup|F1,n(x) - F2,m(x)| ——-> maximum diff. in their CDFs

Null Hypothesis (H0) : X1 and X2 have the same distribution.

If, Dm,n > c(α) √((m+n)/mn)
then,
We reject our null hypothesis (H0) and conclude that X1 and X2 have diff distributions
else,
We accept H0

Note : α and c(α) values are taken from table
Eg. If we decide α=0.05 then the corresponding value for c(α) is taken from table

Question 40

Q

Explain proportional sampling with example.

Answer

A

d = [2.0, 6.0, 1.2, 5.8, 20.0]
Task : Pick an element amongst the n elements s.t. probability of picking an element is proportional to the di’s

Step1 :
a. s = Σ di = 35
b. di’ = di/s
—> d1’ = 0.0571
—> d2’ = 0.171428
—> d3’ = 0.0343
—> d4’ = 0.1657
—> d5’ = 0.5714
Here Σ di’ = 1
c. cumulative normalized sum
—> d1” = d1’ = 0.0571
—> d2” = d1” + d2 ‘= 0.228528
—> d3” = d2” + d3 ‘= 0.262828
—> d4” = d3” + d4 ‘= 0.428528
—> d5” = d4” + d5 ‘= 1

Step2 :
sample one value unif(0.0, 1.0)
r = numpy.random.uniform(0.0, 1.0, 1)
let r = 0.6

Step 3 :
Proportional sampling
if r <= d1”
return 1
elif r <= d2”
return 2
elif r <= d3”
return 3
:
:
: