statistics year 2 Flashcards
how to prove a normal distribution can be used to model a variable
- state data is continous
- show majority data lies within the mean plus or minus three standard deviations
when calculating normal distribution questions on calculator, for unkown upper/lower use plus or minus three standard deviations from the mean
‘given’ formula
P(B|A) = P(A∩B)/ P(A)
P(A∩B)= P(A) x P(B|A)
this is given on the formula booklet
standardising a score formula
(score-mean) /standard deviation
probability of exact number using normal distribution
Zero. Straight line so wont have area
-mention continous
independent events
Two events, A and B, are independent if P(A|B) = P(A) or if P(A∩B) = P(A) x P(B)
modelling with probability
To model real-life situations mathematically, you often have to make simplifying assumptions. You can analyse and improve your model by comparing predicted results with actual data, questioning any assumptions that have been made.
To test a binomial model you can use the mean and variance. For X- B(n,p), the mean (u) and variance σ2 are given by u=np and σ2= np(1-p)
continuous random variable (CRV), X
- can take any one of an infinite number of values on a given interval. Instead of assigning probabilities to individual values of X, you assign probabilities to ranges of values of X and the probability distribution is represented by a curve or a sequence of curves called a probability density function
P(a≤ X ≤b) = integral of f(x) between bounds b and a
-The ≤ and < signs become interchangeable, so will lead to the same area under the curve
Normal distribution
- The normal probability density function has a bell-shaped curve. It is a continuous function so the area under the curve can be used to calculate probabilities
- Total area under the curve=1
- In the normal distribution:
- mean=median=mode
- distribution is symmetrical
- points of inflection one standard deviation from the mean
- roughly 68% of values lie within one standard deviation of the mean
- roughly 99.8% of values lie within three standard deviations of the mean
- If a variable X follows a normal distribution you write X ∼ N ( μ , σ 2 )
standard normal distribution
given as symbol Z
Z ∼ N ( 0, 1 )
z = (X – μ) / σ
- often need to use inverse normal to find Z and then rearrange to solve for mean or standard deviations as required
using normal distribution as approximation to the binomial
- the binomial distribution models situations where a random variable takes only discrete values. The normal distribution models continuous variables. If n is large enough, usually when n is bigger than 30 and if p is roughly 0.5, you can use a normal distribution to approximate a binomial distribution.
- As the number of trials for binomial distributions increase the shape of the distribution may become increasingly symmetric about its mean and increasingly resembles a Normal distribution
- For X∼B(n,p) as n increases, the distribution of X tends to that of the random variable Y where Y∼N(np,np(1-p)) ONLY IF N IS LARGE AND P IS ROUGHLY 0.5
- P(X=x) ≈ P(x-0.5< Y < x+0.5)
- Inclusion of the 0.5 increases the accuracy of the approximation. It is known as a continuity correction. You should always use it when approximating a discrete distribution by a continuous distribution
correlation hypothesis testing
- testing for evidence of linear correlation. Test can only tell you type of correlation and not strength
- null hypothesis is always p=0
- alternative hypothesis can be p>0, p<0 or p is NOT 0 (this would be a two-tailed test)
Reject null hypothesis if: - the p-value is less than the significance level
- or if PMCC (r) of a sample falls in the critical region, so is closer to -1 or 1 than p is
- use hypothesis test as it would be to difficult to work out the PMCC value for the whole population. Instead can be estimated from a sample taken from the whole population, and the PMCC value will be denoted as r
PMCC
- Pearsons product moment correlation coefficient, r, is a statistic that estimates p
- measure of correlation in a sample and is used to estimate the population correlation coefficient. Estimate becomes better as the sample size increases, but it is likely to differ from the true value
what does it mean for a test to have a 5% significance level
- 5% chance of rejecting the null hypothesis even though its actually true