Statistics Flashcards
What value should you use if a there is a trace amount of rainfall?
You should treat it as 0.025mm
International locactions have more/less data than those in the UK.
less - they have limited data
How do you clean the data? (4)
- Missing data (n/a or -) should be removed (e.g. not used in the mean?)
- A value is assigned to trace
- Find and exclude any anomalies due to errors
- Make sure all values are given to the same number of decimal places/significant figures (generally already done)
What order are the UK cities in?
From south to north they are in alphabetical order, apart from Heathrow and Hurn (which are switched around)
Cambourne, Hurn, Heathrow, Leeming Leuchars
As we move further north, during May to October, the maximum hours of sunshine ___
increases
How is daily maximum relative humidity presented?
Percentages given to the nearest integer
Above what percentage of daily maximum relative humidity do you get mist and fog?
95%
What is dailt mean windspeed measured in?
Knots to the nearest integer
1 knot = ?
1.15mph
What is daily mean direction measured in?
- Degrees clockwise from the north (like bearings) rounded to the nearest 10°
- Cardinal directions
Wind/gust/cardinal direction refers to the direction the wind is blowing ____.
from
Beaufort conversion daily mean windspeed
A discrete 13 point scale from 0 (calm) to 12 (hurricane)
In the LDS, there is light from 1 – 3, moderate at 4 and fresh at 5.
less windy to more windy on the Beaufort scale
light, moderate, fresh
daily maxiumum gust
The highest instantaneous windspeed recorded, measured in knots
daily maximum gust direction
The direction of the maximum gust of wind recorded
pressure units
hPa (hectopascals)
1hPa = ?
2 conversions
100 Pa or 1 millibar
Approximate low pressure
< 990 - 1000 hPa
Approximate average air pressure (at sea level)
1013 hPa
Approximate high pressure
1025 hPa
visibility
The greatest distance that an object can be seen and recognized in daylight
visibility units
Dm (decametres)
1 Dm = ?
10 m
What qualitative data is found in the large data set?
Beaufort scale
Calculate the lower quartile for:
9, 9, 10, 11, 12, 12, 12, 13, 14
Q1 position = 9 * 1/4 = 2.25, this rounds up to the 3rd pos
Q1 = 10
You round up to the next position, even if it’s only 0.25
Calculate the upper quartile for:
7, 9, 9, 10, 10, 11, 12, 12, 12, 13, 14, 14, 15, 16, 16
Q3 position = 12 * 3/4 = 9, this rounds up to the 9.5th pos (mean of 9th and 10th positions)
Q3 = 12 + 13 / 2 = 12.5
You round up to the next position
interpercentile range
the difference between the values for two given percentiles
Describe a correlation vs. interpret a correlation between variables
Describe = weak/strong positive/negative or no correlation
Interpret = as (e.g. the rainfall) increases, the (e.g. sunshine hours) decreases, but you must be specific to the question
Bivariate data
Data which has pairs of values for two variables
How do you represent bivariate data?
On a scatter diagram
Regression line
The straight line that minimises the sum of the squares of the distances of each data point from it
Another name for the least squares regression line
Essentially it’s the ‘best’ line of best fit
The equation of the regression line of GMP (y) on energy consumption (x) is y = 225 + 12.9x.
An economist uses this regression equation to estimate the energy consumption of a country with a Gross National Product of 3500.
Give one reason why this may not be a valid estimate.
The regression equation should only be used to predict a value of GNP (y) given the energy consumption (x).
In maths, can you extrapolate data?
No - you can only make valid estimates for values of x within the range of the data set
standard deviation definition
The average variability in the data set; i.e. how spread apart the data is from the mean.
The data (represented by x) is coded using the formula y = (x - a) / b
How do you get from the coded data’s mean to the original data’s mean?
Coded data (y) * b + a
The data (represented by x) is coded using the formula y = (x - a) / b
How do you get from the coded data’s standard deviation to the original data’s standard deviation?
Coded data (y) * b
You are given a grouped frequency table showing the time taken for children to finish a race. How do you calculate an estimate for the standard deviation of the length of the children’s races?
- Fix the class widths (if needed)
- Calculate the total frequency (f)
- Find the midpoints (x)
- Multiply the midpoints by the frequency (fx)
- Square the midpoints and multiply by the frequency (fx2)
- Calculate the total fx and fx2
- Calculate the standard deviation:
√(sum of fx2 / f) - (sum of fx / f)2
You can use the table function in your calculator to speed things up
Advantage of box plots
Can easily compare multiple different groups
Disadvantage of box plots
Doesn’t show trends as easily as e.g. scatter graphs
cleaning the data
removing anomalies from a data set
experiment
A repeatable process that gives lots of outcomes
event
A collection of one or more outcomes
another way of saying “not A”
complement of A
mutually exclusive events
When events have no outcomes in common; they can’t happen at the same time so the circles don’t overlap in Venn diagrams
independent events
When one event has no effect on another; the probability of A occuring is the same whether or not B happens
How can you calculate mutually exclusive events?
P(A or B) = P(A) + P(B)
How can you calculate independent events?
P(A and B) = P(A) × P(B).
A sample of 10 children is taken. 4 children have a height between 80 and 90cm. Estimate how many have a height between 80 and 85cm, and state one assumption you made.
5/10 * 4 = 2
The children’s heights are uniformally distributed in the 80 < h < 90cm class.
P(X = x) = 2 / (x2), x = 2, 3, 4
Explain how you know that Marie’s function does not describe a probability distribution.
The sum of the probabilities does not equal 1.
When can you model X with a binomial distribution? What needs to happen? (4)
- There must be a fixed number of trials (n)
- There must be two set outcomes
- There must be a constant probability of success
- Each trial is independent of one another
These are assumptions you make when modelling a binomial distribution.
probability mass function
A function over the sample space of a discrete random variable which gives the probability that X is equal to a certain value. Can be presented as a function, table or graph
written as e.g. P(X = x) = 1/6
probability distribution
A function that describes the probability of any outcome in the sample space.
It can be represented as a function, table or diagram.
test statistic (+ example)
The result of an experiment or the statistic that is calculated from the sample e.g. the number of heads out of 10 trials
population parameter
The probability of something occuring in the hypothesis
hypothesis
A statement made about the value of a population parameter
A researcher asks some people whether they shop with their own carrier bag. 17 out of 25 people sampled said they do.
They want to test, at the 5% significance level, whether over 60% of shoppers try to be sustainable by using their own carrier bag.
Explain the condition under which the null hypothesis would be rejected.
H0: p = 0.6
H1: p > 0.6
The null hypothesis would be rejected when the probability of 17 or more people from a sample of 25 using their own carrier bag is less than 0.05, given that p = 0.6.
critical value
The first value to fall inside of the critical region
significance level
The probability (usually given as a percentage) of rejecting the null hypothesis, when in fact it is true
actual significance level
The probability of the test statistic falling within the critical region, given that H0 is true
How does the actual significance level differ to the tested significance level (threshold probability)?
They are the same for continuous data but may differ for discrete data
How can you find which tail a test statistic lies in a two-tailed test?
X ~ B(n,p)
n * p is the expected probability
If x < np, then you consider P(X ≤ p)
If x > np, then you consider P(X ≥ p)
You can find which tail a test statistic lies in a two-tailed test, so you don’t have to test both tails. Explain why this works.
for understanding
X ~ B(n,p)
n * p is the expected probability
If x < np, then you consider P(X ≤ p)
This means that the thing being tested occured less than expected, so you need to find the lower critical value to see if the test statistic is low enough to fall within the lower critical region or not. (The higher critical region is somewhere off in the far distance - it’s far too common)
If x > np, then you consider P(X ≥ p)
This means that the thing being tested occured more than expected, so you need to find the higher critical value to see if the test statistic is high enough to fall within the lower critical region or not. (The lower critical region is somewhere off in the far distance - it’s far too uncommon)