S1 Statistics Flashcards
Positive Skew
Q3 - Q2 > Q2 - Q1 or Mean > Median
Negative Skew
Q2 - Q1 > Q3 - Q2 or Median > Mean
Frequency
frequency density × class width
Qualitative Variables
Non-numerical - e.g. red, blue or long, short etc.
Quantitative Variables
Numerical - e.g. length, age, time, number of coins in pocket, etc
Continuous Variables
Can take any value within a given range - e.g. height, time, age, etc.
Discrete Variables
Can only take certain values - e.g. shoe size, cost in £ and p, number of coins.
Mode
The value, or class interval, which occurs most often.
Linear Interpolation
Median = a (start of group) + ((b - distance of beginning group to median / c - length of group) x length of group)
Interquartile Range
Q3 - Q1
Variance Formula
(sum of x squared / number of terms) - ((sum of x / n) squared)
Standard Deviation
the square root of the variance
Addition Law
P(AUB) = P(A) + P(B) - P(A intersect B)
P(B|A)
P(A and B)/P(A)
Mutually Exclusive Addition Rule
P(AUB) = P(A) + P(B)
Mutually exclusive intersection
P(A intersect B) = 0
Independent Event
One event does not effect the other
Independent: P(A|B) =
P(A)
Independent multiplication rule
P(A intersect B) = P(A) x P(B)
Product Moment Correlation Coefficient (PMCC)
a quantity between -1.0 and 1.0 that estimates the strength of the linear relationship between two random variables. Close to -1, strong negative correlation. Close to 1, strong positive correlation.
If scale changes for PMCC (correlation) ….
PMCC (correlation) is still the same
Linear Regression
a statistical method used to fit a linear model to a given data set (basically best fit line)
Reliable Regression
Values within the range of data
Standard Deviation Definition
a measure that is used to quantify the amount of variation or dispersion of a set of data values
Discrete Random Variable
Variables can only take certain values
Uniform / Discrete Distribution
Every outcome has the same value
F(x)
Cumulative distribution function in which probabilities up to 1
E(x)
(expected value if you did it many times, like the mean) Sum of x multiplied by p
Variance of E(x)
E(x^2) - E(X)^2
Mean of the squares / square of the mean
Expected value is affected by multiplication, division, subtraction, addition
E(4x+1) = 4E(x) + 1
E(1 - x) = 1 - E(x)
E(x/2) = E(x) / 2
Variance is not affected by addition and subtraction, but is affected by multiplication and division
For variance, you have to square the value Var(4x) = 16Var(x) Var(x+1) = Var(x) Var(3x+2) = 9Var(x) Var(x/2) = 1/4Var(x)
Independent Rule: P(A intersect B) =
P(A) x P(B)
Normal Distribution Formula
X ~ N(u , o^2)
Standardizing Formula: Z =
(x - Mean) / standard deviation
Explain why a histogram is appropriate for this data
Data is continuous
Explain why this diagram would support the fitting of a
regression line of x onto y.
The points are close to an implied straight line
of best fit. There is a strong correlation within the data
Which is the explanatory variable?
The variable that influences the other variable. etc. The explanatory variable is the age of each coin. This is because the age is set and the weight varies.
Give a reason to support the use of normal distribution in this case
Mean and median are very close
- but when data is skewed, normal distribution will not be a good fit
It was discovered that a coin in the original sample, which was 5 years oldl and weighed 20 grams, was a fake.
State whether the exclusion of this coin would increase or decrease the value of the PMCC. Give a reason for your answer
It would decrease the value of the PMCC closer to -1 because removing the fake will result in a better linear fit
Write down 2 of these events that are mutally exclusive. Give a reason for your answer
No overlap between events
State whether the estimate of the mean or median is a better representation of the average speed of traffic on the road
- Mean is a better representation because it uses all the data
OR - Median is better because data is skewed and median is not affected by extreme values
Comment on shape of distribution
- skewness because …
- symmetric because median is similar to mean
what happens to median estimates when values are below the median change?
Median remains the same becomes values that change are below the median
what happens to mean estimates when values become lower than used to?
Mean would lower as changes reduce total of x
what happens to standard deviation estimates when values become lower than used to?
The standard deviation would increase because the data is more spread out
Response Variable
The dependent variable. The variable being studied