Statistics Flashcards

Question 1

Q

What is a statistic?

What is a statistical estimate? Give an example.

Answer

A

Statistic:
• A value calculated from data which can be used to describe that data
• It can also infer the distribution of the data
Statistical Estimate:
• The estimate of a statistic from some data
• E.g. the average height in the UK can be estimated as the average height of Kubrick employees

Question 2

Q

What is the law of large numbers?

Answer

A

As the size of a dataset increases, the estimations of parameters converge on the real values of those parameters.
• This is to say that more evidence is better than less evidence
• This applies to the mean, standard deviation, and many other statistical values

Question 3

Q

Explain correlations, reliability and validity.

Answer

A

Correlations

Many variables are correlated. An example might be height or weight, or the heights of fathers and sons.
• Correlation does NOT imply any causation, E.g. You going to be doesn’t cause the sun to set.

Reliability

This is the degree to which a measure gives the same values across observations.
• E.g. An IQ test today and tomorrow should give the same results, and if it doesn’t, it’s not a reliable test.

Validity

This is the degree to which a variable measures what it is supposed to.
• E.g. An IQ test should tally with other indicators of academic performance.

Question 4

Q

What is a normal (Gaussian) Distribution? Give an example.

Answer

A

This is a very common type of distribution which is bell-shaped and symmetric about the mean.

68% of values are within one standard deviation.
95% of values are within two standard deviations.
99% of values are within three standard deviations.

An example might be the heights of females, which has the values: µ = 162, σ=7.5 and hence:
• 68% of the female population have heights between 154.5cm and 169.5cm.
• 95% of the female population have heights between 147cm and 177cm.

Question 5

Q

What are the two main properties of normally distributed variables?

Answer

A

The sum of two normally distributed random variables is also normally distributed:

The multiplication of a normally distributed random variable yields another normally distributed variable.

Question 6

Q

What is a z-score?

Answer

A

These are scores used to scale all normally distributions to the same size so that probabilities can be found for any dataset. The equation for z is:

z = x-mu / sigma

This z value can then be used either with a package in a coding language or with lookup tables in order to calculate probabilities.

Question 7

Q

What is a critical value of a z-score?

Answer

A

Critical values are values of z which express some probability. The z-score such that the probability of a standard normal random variable being greater than z_(a/2) is a/2.

For example, a z-score of z_0.25 would be the z score such that:
P(z < z_0.25) =0.25

Question 8

Q

Explain sampling distributions.

Answer

A

The statistics that are calculated on a random variable/sample will themselves have some kind of distribution. This simple fact can be really useful.

The means of baby weights will form a normal distribution.

The sums of coin flip results will form a binomial distribution.

Question 9

Q

What is the standard error?

Answer

A

The standard error of a statistic is the standard deviation of its sampling distribution. The standard error in the mean is the following:

SE = sigma / sqrt(N)

So we get more error if N is small, and as N approaches infinity, our error approaches 0.

Question 10

Q

What is a confidence interval?
What is a confidence level?

Why are these useful in ML?

Answer

A

When we report data, it is useful to know the uncertainty in those values. One of the ways we can do this is to define the following:

Confidence Interval: A plausible range of values for a population parameter.
Confidence Level: The percentage change that the real value lies within your confidence interval.

These are useful in ML because we want to make a prediction and know how confident we are that the prediction is correct.

Question 11

Q

If your population parameter has a normal distribution (e.g. the mean, or the sum), what is the confidence interval?

Answer

A

Point estimate ± Zα/2 x SE

Here your confidence level is 1 – α. This means if you want a 95% confidence interval you need to set α = 0.05. Generally as the confidence interval is made larger, the confidence level increases.

Question 12

Q

Suppose we collected the BMI for a sample of 100 people. The sample mean is 24.4 and the standard deviation is 3.
• Q1. What is the standard error for the sample mean?
• Q2. What distribution does the sample mean follow?
• Q3. Construct the 95% confidence interval for the mean.

Answer

A

The standard error is defined as follows, and hence:
SE = σ / sqrt(N) = 3 / sqrt(100) = 0.3

The distribution should be a normal distribution because the central limit theorem finds that the distribution will always be normal for the mean.

Since we want an interval of 95%, we must have α = 0.05 from (Confidence Level = 1 – α).
Point estimate ± Zα/2 x SE = 24.4 ± Z0.05/2 x SE = 24.4 ± 1.96 x 0.3 = (23.8, 24.9)

Question 13

Q

In the context of confidence intervals, what are accuracy and precision?
Give some examples.

Answer

A

Confidence intervals have a trade-off between accuracy and precision.
• Accuracy: Whether the confidence interval contains the real parameter.
• Precision: The width of the confidence interval.

For example:
• If your weather forecast is some value between -10° and 50°, then it is accurate but not precise.
• If your weather forecast is between 18.5° and 18.52°, it is precise but likely inaccurate.

Question 14

Q

What is Hypothesis testing?

How do you perform hypothesis testing?

Answer

A

Often, we have an idea about the nature of a relationship in the data, and we need to test whether our hypothesis is correct or not. There is an established way of doing this.

We define the current state of affairs as the null hypothesis, and anything that isn’t this as the alternate hypothesis. 
Suppose a class had low participation. The teacher tries to improve this by asking random students some questions. In your class of 31, and out of the 10 questions asked, they ask you 5 of them. Is something going on here?

Null Hypothesis: The teacher is fairly and randomly selecting students, you just got unlucky.
Alternate Hypothesis: The teacher is not selecting fairly.
You assume the alternate hypothesis only if there is significant evidence to suggest that the null hypothesis is not true.

Question 15

Q

Explain the concept of p-values.

Answer

A

In order to disprove a null hypothesis, we need to find the p value.
P-value: The probability that your data are the result of normal statistical fluctuations. As this value gets lower, there is more evidence against your null hypothesis.
We need to define an α, the cut-off point below which we say that the null hypothesis is disproved. Normal values for this are 0.05, 0.01, and so on.

Question 16

Q

Explain the concept of statistical significance.

Suppose a coin is flipped 10 times and 9 of those times it lands on a head. What are the null and alternate hypotheses.
How would you find the p-value?

Answer

Study These Flashcards

A

We say something is statistically significant if it is unlikely to have occurred given the null hypothesis.

Null Hypothesis: The coin is fair, it just happen to land like that.
Alternate Hypothesis: The coin is biased towards heads.
In this case, the p value can be found with the binomial distribution.

Question 17

Q

Explain how to perform a T-test.

Answer

Study These Flashcards

A

Here we test whether the true mean μ is equal to some value μ0.

If we don’t know the standard deviation, we can estimate it, and use that estimate to find an approximate standard error:
SE^ = sigma^ / sqrt(N)
Because we are approximating the standard deviation, we can’t use a z-score. We instead construct a t-score:
t = xbar + μ0 / SE^

This will yield a t-distribution with ‘degrees of freedom’ equal to N-1.

Using the t distribution, we can find the probability, and compare that to out cut off alpha value. We use the t-test if we need to estimate the standard error of our sampling distribution. When we have large sample sizes (say N>30), then the difference between the z-test and the t-test should be small.

Question 18

Q

How would you compare means between two independent groups?

Answer

Study These Flashcards

A

Often, we want to draw inferences on the means of two populations using a sample. We can perform a hypothesis test on the difference between the two means.
Our Null Hypothesis is that the difference between the means is nothing (a null value):
μ1 – μ2 = d0
Are alternate hypotheses are that this is not the case:
μ1 – μ2 > d0
μ1 – μ2 < d0
μ1 – μ2 ≠ d0
The standard error for the difference between two means is:
SE^ = sqrt( σ12 /N1 + σ22 /N2 )
And the t-statistic is given by:
t = x1 + x2 / SE^
This will follow the t distribution with N-1 degrees of freedom where N is the smaller of the two samples.

Question 19

Q

How would you compare means between two groups with paired data?

Answer

Study These Flashcards

A

Sometimes, our subjects participate in both control and treatment groups. In this case the data is paired. To find the t-statistic in this situation we use:
t = xdiff / SE^

Question 20

Q

Explain the concept of error rates.

Answer

Study These Flashcards

A

Just as is the case in other areas of machine learning, sometimes we want to optimise for a certain kind of error. In hypothesis testing, we talk about type 1 and type two errors:

Type 1: Reject the null hypothesis H0 incorrectly.
Type 2: Fail to reject the null hypothesis when the alternative is true.

These types of errors can eb tuned by adjusting alpha. If alpha is too small then we might miss a relationship, if it is too large we might think one is there when it isn’t.

Statistics Flashcards

(20 cards)