Statistics Flashcards
Bivariate data
Data relating to pairs of variables
Variables that are statistically related
Correlated
How do you identify correlation
Scatter graph
What goes on x axis
Explanatory or independent variable
Population
The set of things you are interested in
E.g. all people in the uk
Census
Observes or measures every member of the population
Parameter
A number that describes the entire population
E.g. the mean or standard deviation
Sample
Subset of a population
Used to find out information about the whole population
Statistic
A value calculated from a sample
E.g. the mean or standard deviation of the sample that can be used to estimate the mean of the population or standard deviation of the population
Sampling unit
An individual unit from the population
E.g. The particular person living in the uk
Sampling frame
A list of all the sampling units in the population
E.g. The electoral register for the uk
Advantage of using a sample over a census
Quicker, fewer people have to respond and less data to process
Less expensive
Advantage of using a census over a sample
Should be a completely accurate result
A sample may not be large enough to give information about small sub groups of the population
Sample disadvantage
Data may not be accurate
Sample might not be large enough to give information about small sub groups
Census disadvantage
Takes a long time and expensive
Hard to process the large quantities of data
Cannot be used if the testing destroys them
Advantages of sampling
Quick and not as expensive
Fewer people have to respond
Less data to process
Advantages of a census
Should give a completely accurate result
If you want to know the mean number of sweets in a packet of sweets, why is it not possible to use a census
Destroying all the sweets
Can’t use a census if all the sampling units are being destroyed
3 methods of random sampling
Simple random sampling
Systematic sampling
Stratified sampling
Simple random sampling method
Number all the items in the population
Use a random number generator to select sample of desired size
If a number is replicated generate another number for item to be sampled
Systematic sampling method
Number all items in the population
Let n=population size/sample size
Use random number generator from 1 to n to select the first item
Choose every nth item
Stratified sampling method
The population divided into groups
Decide how many to sample from each group using…
(Number in group/Number in population)×sample size
Use simple random sampling to select the items from each group
So it is proportional and representative
2 methods of non random sampling
Opportunity sampling/convenience sampling
Quota sampling
Opportunity sampling method
Sample consists of any items available to be sampled
Used to sample the required number from each group and once requirement is filled any further items are ignored
E.g. who walks into the frozen aisle of a supermarket
Quota sampling method
The population divided into groups
Decide how many to sample from each group using…
(Number in group/Number in population)×sample size
Sample the first “X” for each group and ignore any further items
Advantages of simple random sampling
Free of bias
Easy and cheap to implement for small populations and small samples
Each sampling unit has a known and equal chance of selection
Disadvantages of simple random sampling
Not suitable for a small population size or sample size
Large samples are expensive and time consuming and disruptive
Need a sampling frame
Advantages of systematic sampling
Simple and quick to use
Suitable for a large sample/population
Disadvantages of systematic sampling
Sampling frame needed
Can introduce bias if the sampling frame isn’t random
Advantages of stratified sampling
Sample accurately reflects the population
Guarantees proportional representation of groups within a population
Disadvantages of stratified sampling
Population must be clearly classified into discrete groups
Selection within each group suffers same as simple random sampling e.g. need a sampling frame
Advantages of opportunity sampling
Easy to carry out
Inexpensive
Disadvantages of opportunity sampling
Unlikely to provide representative sample
Highly dependant on individual researcher
Not random so may introduce bias
Advantages of quota sampling
Allows a small sample to still be representative of the population
No sampling frame needed
Quick and easy and inexpensive
Allows for easy comparison between different groups in a population
Disadvantages of quota sampling
Not random so may introduce bias
Population must be divided into groups which can be costly or inaccurate
Non-responses are not recorded as such
Increasing scope of study increases number of groups which adds time and expense
Mean
A numerical measure
Given by the Σx/n
What’s a median
A numerical measure
Given by n+1/2 for non grouped data
And n/2 for groped
Mode
Most common value
Range
Difference between the highest and lowest data value
Lower Quartile
Q1
Point that is a quarter of the way along an ordered data set
Given by n+1/4 for non-grouped data
And n/4 for grouped data
Upper Quartile
Q3
Point that is three quarters of the way along an ordered data set
Given by 3(n+1)/4 for non-grouped data
And 3n/4 for grouped data
IQR
Interquartile range
The difference between the lower and upper quartile
Q3-Q1
Variance
A measure of spread of data
σ^2=Σ(x-x̄^2)/n
Where x̄ is the mean
Standard deviation
A measure of spread of data
σ=sqrt of variance
Can you use your calculator to get the median for linear interpolation
No
Not accurate
How do you use a calculator to get the mean, median, standard deviation, variance and quartiles
Shift Menu/setup 3. Statistics Frequency on Menu/setup 6. Statistics 1. 1-Variable Input values AC (sets table) OPTN 2 (1-Variable calc)
Discrete data
Can only take certain values and can have gaps
shoe size, money, number of sweets
Median for grouped
n/2
Median for non-grouped data
n+1/2
Continuous data
Can take any value in a certain range
height, time, length
Linear interpolation assumption
Assuming that the data values are evenly distributed within each class
How do you work out standard deviation
Root of variance
Or
Page 3/4 of formula book
Where
What is coding
A way of simplifying statistical calculations
Each data value is coded to make a new set of data values that are easier to work with
Coding formula for mean and standard deviation
Mean: ȳ=(x̄-a)/b
Standard deviation: σy=σx/b
What is an outlier
An extreme piece of data which differs significantly from other observed data values
Expected formula will be given in exam
What does it mean to clean data
Remove outliers
But keep the outliers in unless told otherwise
Mark with an x if you are able to identify them
Advantage of mode
Useful for non numerical data
Not usually affected by outlier or emissions
Always an observed data value
Disadvantage of mode
Does not use all data values
May not be representative if low frequency
May not be representative if in a small population
Advantage of median
Not usually affected by outliers or errors
Disadvantage of median
Not always a data value
Does not use all data values
Advantage of mean
When data is large a few extreme values have little effect
Uses all data values
Disadvantage of mean
May not always be a data value
Affected by outliers and errors if in a small population
Advantage of range as a measure of spread
Reflects the full data set
Disadvantage of range
Distorted by outliers
Advantage of using the IQR as a measure of spread
Not distorted by outliers
Disadvantages of using the IQR as a measure of spread
Does not reflect the full data set
Advantage of using the standard deviation as a measure of spread
When data is large a few outliers have negligible impact
Disadvantage of using the standard deviation as a measure of spread
When a data set is small a few outliers have a large impact
What is a box plot
Can be drawn to represent important features of data
AKA FIVE FIGURE SUMMARY since it displays the lowest and highest values, the quartiles and the median
Can display any outliers with an x or *
When can cumulative frequency be used
For grouped data
Can be an alternative way to estimate the median, quartiles or percentiles
Do you include outliers in range
Yes
Unless told otherwise
How do you construct a cumulative frequency graph
Calculate cumulative frequency
Appropriate scale
Plot points using max value for the class width NOT middle of class
Find the quartile necessary by using the cumulative frequency to read off the value of ‘variable’ like height or time
When can a histogram be used
Grouped continuous data
Gives a good picture of how data is distributed and allows you to see the rough location and shape of the data spread
Relationship between area and frequency in histogram
Area is proportional to frequency
Frequency density
Frequency/class width Assume there is an equal spread so you use the midpoint of each class Don't join first and last point
What is a frequency polygon
Obtained by joining the middle of the top of each bar
How do you construct a histogram
Frequency density: frequency/class width Frequency density on the y axis
Assumption for using a frequency polygon
That the data is spread equally in classes
What is an experiment
A repeatable process that gives rise to a number of outcomes
What is an event
A collection of outcomes from an experiment from which a probability is assigned
What is a sample space
A set of all possible outcomes
How are probabilities written
Decimals
Fractions
What is a random variable
A variable whose value depends on the outcome of an event
Sample space in terms of discrete probability distribution
Range of values a random variable can take
When is a variable discrete
If it can only take certain numerical values
What is a probability distribution
Describes the probability of every outcome in the sample space
Several ways this can be displayed e.g
Table
Probability mass function
How can probability distribution be displayed
Table
Probability mass function
What is a discrete uniform distribution
Probability distribution in which the probabilities of each outcome is the same
What is a probability density function
The distribution for a continuous random variable
The area under the graph of this function represents probability
Explain thr notation X~B(n,p)
Notation for the binomial distribution of X
Where n is the number of trials carried out
And p is the probability of success
When can a binomial distribution be applied
Two outcomes only, win and lose
There are a fixed number of trials, n
The probability of success is the same for each trial, p
All trials are independent
How do you find the probability for a binomial that’s equal to something
Menu 7 4 - Binomial PD 2 Enter with =
How do you find the probability for a binomial that’s less than or equal to
Menu 7 Down 1 - Binomial CD 2 Enter with =
How do you find the probability of a binomial that’s less than something
Calculator only works out less than or equal to
Change it
How do you find the probability of a binomial that’s greater than or greater than or equal to
Calculator only works out less than or equal to
Change it and use 1 minus
How do you calculate the expected value of a binomial distribution
np
What is the expectation of a distribution
The ling term average
If the event was repeated many times the expected value would be the average of the outcomes
E[X] = ų = np
Where n is the number of trials and p is the probability
What is a hypothesis
A statement made about the value of a population parameter
Testing a hypothesis is done by carrying out an experiment or taking a sample from the population
What is a test statistic in term of hypothesis
The result of the experiment or the statistic that is calculated from the sample
In order to carry out the test there must be two hypotheses
What is a null hypothesis
H0
The default position
What is expected
What is an alternative hypothesis
H1
Describes an alternative possibility
More, less, different
What is a one tailed test
Describes when you are testing whether a parameter is more or less than some number
What is a two tailed test
When you are testing whether a parameter is not equal to some number
2 methods to conduct a hypothesis test
Find critical region and compare to test statistic
Find the probability of being at least as extreme as the test statistic and comparing to significance level
How do you construct a hypothesis testing conclusion
Compare the probability to significance level or test statistic to critical region
Accept/Reject H0
State the outcome - is/is not enough evidence to suggest…
Method for hypothesis testing with probabilities
State hypotheses
Assume H0 true and state the distribution being used
Expected value and diagram
Calculator to find the probability of interest
Compare to the given significance level and be careful of the tail of interest
If the probability is greater than the given significance then accept H0
Conclusion
When do you accept or reject H0 in hypothesis testing with probabilities
When calculated probability is greater than significance level you accept H0 so insufficient evidence to suggest…
When the calculated value is less than the significance level then you reject H0 so there is sufficient evidence to suggest…
What is a critical region
The region of a probability distribution which, if the test statistic falls within, would cause the null hypothesis to be rejected
The critical value is the first value to fall inside the critical region
What is the critical value
The first value to fall inside the critical region
Acceptance region
Region of a probability distribution which, if the test statistic falls within, would cause the null hypothesis to be rejected
What is the actual significance level of a hypothesis test
The probability of the test statistic falling in the critical region assuming the null hypothesis is correct
What does the location of the critical region depend on
The type of alternative hypothesis
What is the significance level
The probability of incorrectly rejecting the null hypothesis
How do you find the critical region
State hypotheses
Assume H0 true and state binomial distribution
Calculate the expected value
Determine whether the critical region is before or after this
‘What numbers lie in the “significance level” percent?”
Menu
7
Down
Binomial CD
1 list
Input estimate numbers until you get a suitable value
Critical region must be lower than significance level if at bottom and greater if at the top
How do you find the actual significance level
Once you’ve found the critical region
It is the probability that correlates to this
What do you have to be careful of when finding the critical region above the expected value
Add one to the value that is just above the significance
How do you test hypotheses with the critical region
State hypotheses
Assume H0 true and state binomial distribution
E[X] and graph to determine location of critical region
Find critical region:
Menu 7
Down Binomial CD
1 List
Input approximate values
If test value is in the critical region then you reject H0
If test value is not in the critical region it is in the acceptance region for H0 so accept
Conclusion
Ø
The empty set
No intersections
Definition and formula for mutually exclusive events
When the events have no outcomes in common
P(AnB) = 0 and P(AuB) = P(A) + P(B)
Definition for independent events and formula
When one event has no effect on another
P(AnB) = P(A) x P(B)
Formula used to prove and test if independent
What can tree diagrams be uses for
To show two or more events happening in succession
Explain the notation P(B|A)
The probability that B occurs given A has already occured
What is conditional probability
A way of modelling situations in which the probability of an event can change depending on the outcome of a previous event
Formula for conditional probability
P(B|A) =P(BnA)/P(A)
Rule for independent events in conditional probability
P(A|B) = P(A|B') = P(A) P(B|A) = P(B|A') = P(B)
Addition formula for for the events A and B
P(AuB) = P(A) + P(B) - P(AnB)
Multiplication rule for conditional probability
P(B|A) = P(BnA)/P(A)
So
P(AnB) = P(B|A) x P(A)
Binomial vs normal distribution
Binomial is for discrete data
Normal is for continuous
What is a continous random variable
A variable that can take any one of infinitely many values
What is the normal distribution
A continously probability distribution that can model naturally occurring characteristics
Notation for normal distribution
X~N(μ,σ²)
If X is a normally distributed random variable with the population mean μ and variance σ²
What are the conditions for normal distribution
Symmetrical, mean=median=mode
Has a bell shaped curve with asymptote at each end
Has a total area under the curve of 1
Has points of inflection at μ+σ and μ-σ
What is a point of inflection
Convex to concave or vice versa
Rules for a normally distributed variable
Approximately 68% of data lies within one standard deviation of the mean (μ+/-σ)
95% of the data lies within two standard deviations of the mean (μ+/-2σ)
Nearly all data (99.7%) lies within three standard devations of the mean (μ+/-3σ)
How do you find probabilities using the normal distribution
Menu 7: Distribution 2: Normal CD Enter μ and σ using = Fill in the upper and lower boundaries If only one boundary to use... Lower = -99999 Upper = 99999
Explain P(X=a)=0
The probability of an individual thing, a, happening is zero
Not actually zero since asymptote but so small it is approximately 0 and has no area
When do you use the inverse normal
When given a probability to calculate a value that satisfies an inequality
Calculator only calculates less than
How do you calculate inverse normal
Menu 7: Distribution
3: Inverse normal
Area is the area less than the value that satisfies inequality
Since calculator only works out
Why can’t you use PD for P(X<=1)
Must be CD
Since X can also be zero
How do you standardise a normally distributed variable
By coding the data
So that it is modelled by the standard normal distribution
Why is the standard normal distribution useful
To standardise a normally distributed variable
By coding it
Rules for the standard normal distribution
Z~N(0,1)
Has mean 0 and standard deviation 1
What can the standard normal distribution be used to find
μ or σ if they are unknown
Z=(x-μ)/σ
How do you use the standard normal distribution go find μ or σ
Z~N(0,1)
Draw both graphs with equivalent areas
Find value of z for which P(Z>/etc)=area
Z=x-μ/σ to get value
How can you test hypotheses about the mean of a normally distributed random variable
By looking at the mean of a sample called the sample mean
Formulas to use for hypothesis testing with the normal distribution
For a random sample of size n taken from a random variable X~N(μ,σ²), the sample mean distribution is given by
X̄~N(μ,σ²/n)
The mean is the same but the variance is different
What must be used when completing a hypothesis test with the normal distribution
The sample mean
Because you are using a sample of a given size and extrapolating that to give conclusions about the whole population
Method for using hypothesis testing with normal distribution
State hypotheses Assume H0 true and state the sample mean distribution Sketch the graph Find the probability Compare Conclusion
What goes on y axis
Response/dependant variable
Expected to change in response to the other variable
What is a regression line
A line which fits as well as possible to the points on the scatter graph
Useful to identify a trend
PMCC
Product Moment Correlation Coefficient
Provides information on the type and strength of the correlation between two variables
Described by ‘r’
PMCC of 1
Perfect positive correlation
PMCC of -1
Perfect negative correlation
PMCC of 0
No correlation
PMCC of -0.2 - 0.2
Weak/poor correlation
PMCC of 0.75 to 1
Strong positive correlation
PMCC of -0.75 to 1
Strong negative correlation
Type and strength of correlation for a town’s annual income and the crime rate
Moderate negative correlation
Give the type and strength of correlation for the height of father’s and their sons
Positive correlation
Give the type and strength of correlation between the cooking time for a chicken and the weight of the chicken
Strong positive correlation
Give the type and strength of the correlation between shoe size and salary
No correlation
When is a prediction made using a regression line unreliable
When the predication is made in different conditions than those for the original sample data
Interpolation in terms of regression line
Using a regression line to make predictions which fall within the range of observed data
Stronger correlation means more reliable prediction
Extrapolation in terms of regression line
Making predictions outside of the range of observed data
Unreliable since no evidence that the pattern extends beyond the observed range
How do you find the regression line and PMCC (r)
Frequency off Menu 6: Statistics 2: y=ax+b Enter data items x and y in table OPTN 4: Regression calc Displays a and b for the regression line of form y=ax+b and 'r'/PMCC
How do you put frequency on
Shift
Setup
3
1 or 2
What is causal correlation
When a change in one variable does affect the other
What is spurious correlation
Correlation without causal connection
What is a regression line
Line of best fit
Y=c+mx
C: when “x” is zero “units”, “C” is the predicted number of “y”
M: every increase in “x” by “1 unit” corresponds to an increase/decrease in “y” by “m units”
What is curve fitting used for
To model polynomial and exponential relationships
Polynomial curve fitting equations
If y=ax^n Then log(y)=log(a)+nlog(b) Where Y=log(y) and x=log(x)
Exponential curve fitting equations
If Y=kb^x Then log(y)=log(k)+xlog(b) Where y=log(y)
a and b in Y=ab^t
a=initial number of variable y
b=proportional increase or decrease as t increases by 1
Why can you use hypothesis testing with correlation coefficient
To determine whether the oroduct moment correlation coefficient, r, for a particular sample indicates that there is likely to be a linear relationship within the whole population
r vs p for correlation hypothesis testing
r is PMCC for a sample
p is PMCC for the population
Explain the hypotheses for corration hypothesis testing
H0: p=0
H1: p>0 or p<0 or p≠0
Positive correlation, negative correlation, correlation
Method to find the critical region with PMCC then test hypothesis
Page 37 of FB, read off to find the critical region for r using the significance level and sample size
Sketch number line to determine if r is negative or positive
Assume no correlation to test alternative hypothesis
If r>critical region then reject H0
Conclusion
What is the large data set
Contains the weather data
For 5 UK weather stations
And 3 weather stations overseas
Why can’t you predict x for a value of your for the regression line y=mx+c
Regression line for y on x
Can only reliably be used to predict the y value