Statistics Flashcards
What is a statistic?
What is a statistical estimate? Give an example.
Statistic:
• A value calculated from data which can be used to describe that data
• It can also infer the distribution of the data
Statistical Estimate:
• The estimate of a statistic from some data
• E.g. the average height in the UK can be estimated as the average height of Kubrick employees
What is the law of large numbers?
As the size of a dataset increases, the estimations of parameters converge on the real values of those parameters.
• This is to say that more evidence is better than less evidence
• This applies to the mean, standard deviation, and many other statistical values
Explain correlations, reliability and validity.
Correlations
Many variables are correlated. An example might be height or weight, or the heights of fathers and sons.
• Correlation does NOT imply any causation, E.g. You going to be doesn’t cause the sun to set.
Reliability
This is the degree to which a measure gives the same values across observations.
• E.g. An IQ test today and tomorrow should give the same results, and if it doesn’t, it’s not a reliable test.
Validity
This is the degree to which a variable measures what it is supposed to.
• E.g. An IQ test should tally with other indicators of academic performance.
What is a normal (Gaussian) Distribution? Give an example.
This is a very common type of distribution which is bell-shaped and symmetric about the mean.
- 68% of values are within one standard deviation.
- 95% of values are within two standard deviations.
- 99% of values are within three standard deviations.
An example might be the heights of females, which has the values: µ = 162, σ=7.5 and hence:
• 68% of the female population have heights between 154.5cm and 169.5cm.
• 95% of the female population have heights between 147cm and 177cm.
What are the two main properties of normally distributed variables?
The sum of two normally distributed random variables is also normally distributed:
The multiplication of a normally distributed random variable yields another normally distributed variable.
What is a z-score?
These are scores used to scale all normally distributions to the same size so that probabilities can be found for any dataset. The equation for z is:
z = x-mu / sigma
This z value can then be used either with a package in a coding language or with lookup tables in order to calculate probabilities.
What is a critical value of a z-score?
Critical values are values of z which express some probability. The z-score such that the probability of a standard normal random variable being greater than z_(a/2) is a/2.
For example, a z-score of z_0.25 would be the z score such that:
P(z < z_0.25) =0.25
Explain sampling distributions.
The statistics that are calculated on a random variable/sample will themselves have some kind of distribution. This simple fact can be really useful.
The means of baby weights will form a normal distribution.
The sums of coin flip results will form a binomial distribution.
What is the standard error?
The standard error of a statistic is the standard deviation of its sampling distribution. The standard error in the mean is the following:
SE = sigma / sqrt(N)
So we get more error if N is small, and as N approaches infinity, our error approaches 0.
What is a confidence interval?
What is a confidence level?
Why are these useful in ML?
When we report data, it is useful to know the uncertainty in those values. One of the ways we can do this is to define the following:
- Confidence Interval: A plausible range of values for a population parameter.
- Confidence Level: The percentage change that the real value lies within your confidence interval.
These are useful in ML because we want to make a prediction and know how confident we are that the prediction is correct.
If your population parameter has a normal distribution (e.g. the mean, or the sum), what is the confidence interval?
Point estimate ± Zα/2 x SE
Here your confidence level is 1 – α. This means if you want a 95% confidence interval you need to set α = 0.05. Generally as the confidence interval is made larger, the confidence level increases.
Suppose we collected the BMI for a sample of 100 people. The sample mean is 24.4 and the standard deviation is 3.
• Q1. What is the standard error for the sample mean?
• Q2. What distribution does the sample mean follow?
• Q3. Construct the 95% confidence interval for the mean.
The standard error is defined as follows, and hence:
SE = σ / sqrt(N) = 3 / sqrt(100) = 0.3
The distribution should be a normal distribution because the central limit theorem finds that the distribution will always be normal for the mean.
Since we want an interval of 95%, we must have α = 0.05 from (Confidence Level = 1 – α).
Point estimate ± Zα/2 x SE = 24.4 ± Z0.05/2 x SE = 24.4 ± 1.96 x 0.3 = (23.8, 24.9)
In the context of confidence intervals, what are accuracy and precision?
Give some examples.
Confidence intervals have a trade-off between accuracy and precision.
• Accuracy: Whether the confidence interval contains the real parameter.
• Precision: The width of the confidence interval.
For example:
• If your weather forecast is some value between -10° and 50°, then it is accurate but not precise.
• If your weather forecast is between 18.5° and 18.52°, it is precise but likely inaccurate.
What is Hypothesis testing?
How do you perform hypothesis testing?
Often, we have an idea about the nature of a relationship in the data, and we need to test whether our hypothesis is correct or not. There is an established way of doing this.
We define the current state of affairs as the null hypothesis, and anything that isn’t this as the alternate hypothesis. Suppose a class had low participation. The teacher tries to improve this by asking random students some questions. In your class of 31, and out of the 10 questions asked, they ask you 5 of them. Is something going on here?
- Null Hypothesis: The teacher is fairly and randomly selecting students, you just got unlucky.
- Alternate Hypothesis: The teacher is not selecting fairly.
You assume the alternate hypothesis only if there is significant evidence to suggest that the null hypothesis is not true.
Explain the concept of p-values.
In order to disprove a null hypothesis, we need to find the p value.
P-value: The probability that your data are the result of normal statistical fluctuations. As this value gets lower, there is more evidence against your null hypothesis.
We need to define an α, the cut-off point below which we say that the null hypothesis is disproved. Normal values for this are 0.05, 0.01, and so on.