Stats Interview Questions Flashcards
What is the Central Limit Theorem?
- The CLT states that the distribution of a sample from a population comprising a large sample size (N > 30) will have its mean normally distributed
- Central Limit Theorem is widely used in the calculation of confidence intervals and hypothesis testing
Example – We want to calculate the average height of people in the world, and we take some samples from the general population, which serves as the data set. Since it is hard or impossible to obtain data regarding the height of every person in the world, we will simply calculate the mean of our sample.
By applying it several times, we will obtain the mean and their frequencies which we can plot on the graph and create a normal distribution. It will form a bell-shaped curve that will closely resemble the original data set.
What is the assumption of normality?
The assumption of normality dictates that the mean distribution across samples is normal. This is true across independent samples as well.
Describe Hypothesis Testing. How is the statistical significance of an insight assessed?
Hypothesis Testing in statistics is used to see if a certain experiment yields meaningful results, i.e. assess the statistical significance of an insight.
Example:
H0: “assume status quo, i.e. no effect, e.g. a strategy has no positive effect on the PnL”
H1: “observe significant effect, e.g. strategy has a positive effect on the PnL”
Then we assume H0 and compute the p-value (probability value), i.e. (x - mu)/sigma. If (x - mu)/sigma < alpha, we reject the H0. The rejection of the null hypothesis indicates that the results obtained are statistically significant.
Note, we used:
mu = sampling mean
sigma = sampling std = true std of the population / sqrt(N) [we use CLT here]
s = sample std = sqrt(Sum (x - \bar{x})^2 / (n - 1))
N = number of samples
Standard normal distribution percentages
[mu - 3sigma, mu - 2sigma] = 2.35%
[mu - 2sigma, mu - sigma] = 13.5%
[mu - sigma, mu] = 34%
[mu, mu + sigma] = 34%
[mu + sigma, mu + 2sigma] = 13.5%
[mu + 2sigma, mu + 3sigma] = 2.35%
What are observational and experimental data in statistics?
Observational data is derived from the observation of certain variables from observational studies. The variables are observed to determine any correlation between them.
Experimental data is derived from those experimental studies where certain variables are kept constant to determine any discrepancy or causality.
Z-statistics vs. T-statistics
If s (sample std) is computed having N > 30 then the z-statistic is normally distributed, otherwise t-statistic is a t-distribution
What is an outlier?
Outliers can be defined as the data points within a data set that varies largely in comparison to other observations. Depending on its cause, an outlier can decrease the accuracy as well as the efficiency of a model. Therefore, it is crucial to remove them from the data set.
How to screen for outliers in a data set?
(i) Remove everything that has a z-score that is larger than 3std in either direction
(ii) Interquartile range (IQR), aka midspread, contains values that occur throughout the length of the middle of 50% of a data set. Formula: IQR = Q3 - Q1
What is the meaning of an inlier?
An Inliner is a data point within a data set that lies at the same level as the others. It is usually an error and is removed to improve the model accuracy. Unlike outliers, inlier is hard to find and often requires external data for accurate identification.
What is the meaning of KPI in statistics?
KPI is an acronym for a key performance indicator. It can be defined as a quantifiable measure to understand whether the goal is being achieved or not.
What is the Pareto principle?
Also known as the 80/20 rule, the Pareto principle states that 80% of the effects or results in an experiment are obtained from 20% of the causes.
Example: 20% of sales come from 80% of customers.
What is the Law of Large Numbers in statistics?
According to the law of large numbers, an increase in the number of trials in an experiment will result in a positive and proportional increase in the results coming closer to the expected value
How would you describe a ‘p-value’?
P-value in statistics is calculated during hypothesis testing, and it is a number that indicates the likelihood of data occurring by a random chance. If a p-value is 0.05 and is less than alpha, we can conclude that there is a probability of 5% that the experiment results occurred by chance, or you can say, 5% of the time, we can observe these results by chance.
What are some of the properties of a normal distribution?
- Symmetrical – The shape changes with that of parameter values
- Unimodal – Has only one mode
- Mean – the measure of central tendency
- Central tendency – the mean, median, and mode lie at the centre, which means that they are all equal, and the curve is perfectly symmetrical at the midpoint
What are the types of biases that you can encounter while sampling?
Sampling bias occurs when you lack the fair representation of data samples during an investigation or a survey. There are six types:
(i) Undercoverage bias
(ii) Observer Bias
(iii) Survivorship bias
(iv) Self-Selection/Voluntary Response Bias
(v) Recall Bias
(vi) Exclusion Bias