Probability, Correlation And Hypothesis Testing Flashcards
Comparative pie charts formula
Outliers formula
Comparative pie charts
The ratio of the sample size is the same as the ratio of the areas
Population mean
Sample mean
‘Sum of’
The sample mean when xi occurs with a frequency fi
What is discrete data?
Data that can only take certain values which are often integers but sometimes aren’t , for example shoe size
What is continuous data?
Can take any numerical value such as height
What is the range?
Highest value - lowest value
What is IQR?
Q3 - Q1
Standard deviation formulas
Variance formulas
What is probability?
What is a set?
A collection of numbers which cannot have repeats
What is a subset?
All the elements in ‘A’ are in ‘S’
What is an empty set?
An imaginary set with no elements
What is a sample space?
All the possible outcomes of a random experiment
Complement of A
A’ (not A)
B is a subset of A
If B occurs so does A
Mutually exclusive
The occurrence of one event excludes the possibility that any other events could occur (they cannot happen at the same time)
If A and B are exclusive the probability of A or B occurring is the probability of the sum of AUB
P(AUBUC) = P(A) +P(B) +P(C)
Independent events
The probability of event A occurring is unaffected by whether or not B occurs
If A and B are independent then P(AnB) = P(A) x P(B)
The addition law of probability
Multiplication law
What is Pearson’s Product Moment Correlation Coefficient
The PMCC is denoted by R and named after Pearson, an applied mathematician who worked on the application of statistics to genetics evolution
PMCC formulas
Interpreting PMCC values
R = 1 perfect positive correlation
R = -1 perfect negative correlation
R = 0 no linear correlation
What does a measure of correlation indicate?
A relationship between the two values however, it does not indicate a causal relationship
Spearman’s correlation coefficient formula
Spearman’s
Makes no assumptions about the original data and the original data does not need to be linear
PMCC
We can only do a hypothesis test here if the variables are jointly normally distributed
H0 and H1
H0: null hypothesis (no correlation)
H1: correlation
Hypothesis testing
What is a regression line?
It should intersect the double mean point and should be linear for bivariate data
The equation for the linear regression line is given as:
Y = ax + b
Where a is the gradient and b is the y intercept
X is the independent value and y is the dependent
Things to consider when analysing the regression model
How do we interpret the model
How can we interpret in context the coefficient of x
How can we interpret in context the constant term
What is a residual?
An error the model produces when trying to predict a data point
It is the distance between the data point and regression line
For y on x regression it is only sensible to consider predictions for y
How to calculate a residual?
What does a positive residual indicate?
Where the model is giving an underprediction
What does a negative residual indicate?
An overprediction
What should we see when we plot predicted vs actual?
Strong positive correlation
What should we see when we plot predicted vs residual?
A uniform distribution clustered around zero with no patterns
Anscombe’s quartet
Each data set has the same summary statistics and are clearly different
Unstructured statistics
Each data set has the same summary statistics but they are visually different
The normal distribution diagram
The normal distribution formula
What is the z-value?
The number of standard deviations a value is above/below the mean
Because the normal distribution is symmetrical we can use the positive z-value to calculate the negative
We can only use the z-table when…
The z-value is positive (on the right of the graph)
We’re finding the probability to the left of this z-value
Changing the direction of the inequality
Changing the sign or direction of the inequality does ‘1-‘
If we do both they cancel out
Standardising formula
To find the z score?
To find the z value for a probability?
Use the z table backwards
Find the value on the table and work backwards
Central limit theorem
If we continually take samples of the same size and record their corresponding sample means, they themselves will be normally distributed around the known population mean
How is the sample mean normally distributed
Standard deviation
Continuity corrections
We can convert discrete data to continuous
Approximating
To approximate a binomial distribution as a normal we can copy over the mean and variance of the binomial
We must change the letter as it is a different distribution