Questions Flashcards
What is sensitivity and specificity formula?
Sensitivity, True Positive Rate or recall for binary classification: TP/TP+FN
Specificity, True Negative Rate: TN/TN+FP
What are parametric and non-parametric statistical tests?
How are confidence interval and confidence level related?
Using CLT, given a sample, its mean, its std (or population std), we can use z scores related to different confidence levels to calculate confidence interval.
Example: Website
When we have a sample sized 100,
sample mean: 20
population STD: 10
The sample mean approximates the population mean and pop std/ √ n= sample std, so the sample std=1
the sample has gaussian distribution, so 1.96 * std covers 95% of the values. (that’s why z-score for 95% confidence level equals 1.96), so we can say: there’s a 95% chance (we are 95% confident) than the population mean is within this interval: sample mean ±, 1.96
The premise is that we can estimate the parameters of sampling distribution of mean, using one sample, then we are calculating the confidence interval
What’s the difference between using CLT and bootstrapping for sampling?
Given a large enough sample size, confidence intervals for the mean can be constructed by applying the Central Limit Theorem or by the bootstrap method (Bootstrap estimated distributions of test statistics are most certainly not always Gaussian. The beauty of the bootstrap is that you need not make any assumptions about that distribution, as it can often be wrong).
Boostrap is done using sampling with replacement, where CLT has assumtions such as Samples should be independent of each other
CLT video
What’s the average precision?
The general definition for the Average Precision (AP) is finding the area under the precision-recall curve.
How is balanced accuracy calculated? when is it used?
Balanced Accuracy is used in both binary and multi-class classification. It’s the arithmetic mean of sensitivity and specificity (sensitivity+specificity /2), its use case is when dealing with imbalanced data, i.e. when one of the target classes appears a lot more than the other.
What are the assumptions of CLT?
- The data must follow the randomization condition. It must be sampled randomly
- Samples should be independent of each other. One sample should not influence the other samples
- Sample size should be not more than 10% of the population when sampling is done without replacement
- The sample size should be sufficiently large. Now, how we will figure out how large this size should be? Well, it depends on the population. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well
- In general, a sample size of 30 is considered sufficient when the population is symmetric.
What does the mean of a bootstrapped sample approximate?
The mean of bootstrapped samples, apporximates the mean of the original sample.
i.e. the distribution of means of the samples acquired from bootsrapping, approximates the mean of the original sample.
Micro-precision values can be high even if the model is performing very poorly on a rare class since it gives more weight to the common classes. True/False?
True
For single-label multi-class problems, micro-averaging would result in precision being exactly the same as accuracy. That does not provide any additional information about the model’s performance. True/False
True
What is the definition of P-value?
In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
How is P-value calculated?
Assuming the null hypothesis is true, we have a normal distribution ( it’s the sample mean distribution), with the mean equal to the null’s mean value and std equal to the alternative’std, then we calculate the z score as:
(Alt mean- Null mean)/ (Alt std/√number of samples)
And from the z score we find the value of alpha (P-value), which is the probability of the sample we have happening, assuming the null hypothesis is true, if this is less than 0.05, it means that there is less than 5% chance that we could obtain the sample we have just by chance, hence, we can reject the null hypothesis
Can we use a sample’s STD as an approximation of the population STD?
Yes. The standard deviation is a measurement of the spread of the data — it is the average distance of the data from the mean. We are rarely interested in the amount of variation in our sample: the sample standard deviation is only useful as an approximation of the population standard deviation.
When do we use T distribution table?
The rules for when to use a T distribution table are as follows.
Population standard deviation UNKNOWN and original population normal or symmetrical
OR
sample size greater than or equal to 30 and Population standard deviation UNKNOWN.
A statistic is an unbiased estimator of a parameter when the ____ of its sampling distribution is equal to the actual value of the parameter.
Mean. In other words, a statistic is unbiased, when on average, it equals to the value of the population parameter it’s estimating.
So if for example the Q1 of population is 70, then the sampling distribution of Q1’s mean should equal 70 if it’s unbiased