Stats Interview Questions Flashcards

1
Q

What is the Central Limit Theorem?

A
  • The CLT states that the distribution of a sample from a population comprising a large sample size (N > 30) will have its mean normally distributed
  • Central Limit Theorem is widely used in the calculation of confidence intervals and hypothesis testing

Example – We want to calculate the average height of people in the world, and we take some samples from the general population, which serves as the data set. Since it is hard or impossible to obtain data regarding the height of every person in the world, we will simply calculate the mean of our sample.

By applying it several times, we will obtain the mean and their frequencies which we can plot on the graph and create a normal distribution. It will form a bell-shaped curve that will closely resemble the original data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the assumption of normality?

A

The assumption of normality dictates that the mean distribution across samples is normal. This is true across independent samples as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe Hypothesis Testing. How is the statistical significance of an insight assessed?

A

Hypothesis Testing in statistics is used to see if a certain experiment yields meaningful results, i.e. assess the statistical significance of an insight.

Example:
H0: “assume status quo, i.e. no effect, e.g. a strategy has no positive effect on the PnL”
H1: “observe significant effect, e.g. strategy has a positive effect on the PnL”

Then we assume H0 and compute the p-value (probability value), i.e. (x - mu)/sigma. If (x - mu)/sigma < alpha, we reject the H0. The rejection of the null hypothesis indicates that the results obtained are statistically significant.

Note, we used:
mu = sampling mean
sigma = sampling std = true std of the population / sqrt(N) [we use CLT here]
s = sample std = sqrt(Sum (x - \bar{x})^2 / (n - 1))
N = number of samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Standard normal distribution percentages

A

[mu - 3sigma, mu - 2sigma] = 2.35%
[mu - 2sigma, mu - sigma] = 13.5%
[mu - sigma, mu] = 34%
[mu, mu + sigma] = 34%
[mu + sigma, mu + 2sigma] = 13.5%
[mu + 2sigma, mu + 3sigma] = 2.35%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are observational and experimental data in statistics?

A

Observational data is derived from the observation of certain variables from observational studies. The variables are observed to determine any correlation between them.

Experimental data is derived from those experimental studies where certain variables are kept constant to determine any discrepancy or causality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Z-statistics vs. T-statistics

A

If s (sample std) is computed having N > 30 then the z-statistic is normally distributed, otherwise t-statistic is a t-distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an outlier?

A

Outliers can be defined as the data points within a data set that varies largely in comparison to other observations. Depending on its cause, an outlier can decrease the accuracy as well as the efficiency of a model. Therefore, it is crucial to remove them from the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to screen for outliers in a data set?

A

(i) Remove everything that has a z-score that is larger than 3std in either direction

(ii) Interquartile range (IQR), aka midspread, contains values that occur throughout the length of the middle of 50% of a data set. Formula: IQR = Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the meaning of an inlier?

A

An Inliner is a data point within a data set that lies at the same level as the others. It is usually an error and is removed to improve the model accuracy. Unlike outliers, inlier is hard to find and often requires external data for accurate identification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the meaning of KPI in statistics?

A

KPI is an acronym for a key performance indicator. It can be defined as a quantifiable measure to understand whether the goal is being achieved or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Pareto principle?

A

Also known as the 80/20 rule, the Pareto principle states that 80% of the effects or results in an experiment are obtained from 20% of the causes.

Example: 20% of sales come from 80% of customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Law of Large Numbers in statistics?

A

According to the law of large numbers, an increase in the number of trials in an experiment will result in a positive and proportional increase in the results coming closer to the expected value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How would you describe a ‘p-value’?

A

P-value in statistics is calculated during hypothesis testing, and it is a number that indicates the likelihood of data occurring by a random chance. If a p-value is 0.05 and is less than alpha, we can conclude that there is a probability of 5% that the experiment results occurred by chance, or you can say, 5% of the time, we can observe these results by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some of the properties of a normal distribution?

A
  • Symmetrical – The shape changes with that of parameter values
  • Unimodal – Has only one mode
  • Mean – the measure of central tendency
  • Central tendency – the mean, median, and mode lie at the centre, which means that they are all equal, and the curve is perfectly symmetrical at the midpoint
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the types of biases that you can encounter while sampling?

A

Sampling bias occurs when you lack the fair representation of data samples during an investigation or a survey. There are six types:
(i) Undercoverage bias
(ii) Observer Bias
(iii) Survivorship bias
(iv) Self-Selection/Voluntary Response Bias
(v) Recall Bias
(vi) Exclusion Bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Compute the confidence intervals for means

A

We’re interested in the interval estimation of mu.

Collect data:
Y1, …, Yn iid ~ N(mu, sigma^2),
where
mu is unknown
sigma^2 is known

What values of mu are believable given Y1, …, Yn?

MLE estimator: \bar{Y} = 1/n Sum Yi

The CLT tells us:
(\bar{Y} - mu) / (sigma / sqrt(n)) ~ N(0,1)

=> P(-z_{alpha/2} <= (\bar{Y} - mu) / (sigma / sqrt(n)) <= z_{alpha/2}) = 1 - alpha (e.g. 0.95)
<=> …
<=> P(\bar{Y} - z_{alpha/2} / sqrt(n) <= mu <= \bar{Y} + z_{alpha/2} / sqrt(n)) = 1 - alpha

Thus the random interval \bar{Y} +- z_{alpha/2} / sqrt(n) is a 1 - alpha (e.g. 95%) confidence interval (CI) for mu.

17
Q

How do you test for normality?

A

You could a box plot and check if it’s symmetric. Another method is to do a QQ plot (plot sorted sample quantiles vs true normal quantiles) and check if we get a 45 degree line. One more possibility is to look at the corresponding histogram plot.

18
Q

Regression Analysis – Linear Model Assumptions

A

(i) Linear relationship
[There exists a linear relationship between the independent variable, x, and the dependent variable, y]

(ii) Independence
[The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data]

(iii) Homoscedasticity
[The residuals have constant variance at every level of x]

(iv) Normality
[The residuals of the model are normally distributed]