Topic 2 Flashcards
What is the purpose of descriptive statistics?
To describe or summarise the overall pattern of data
How do you describe numerical data?
The three S’s - shape, centre and spread (plus outliers)
How do you describe categorical data?
Table of frequencies or proportions
What is a symmetrical shape?
Right and left side mirrored, can also be bell-shaped
What is a shape that is skewed to the left?
Left side extends further out than the right side
What is a shape that is skewed to the right?
Right side extends further out than the left side
What is a symmetrical, bimodal shape?
Symmetrical with two peaks
What is a symmetrical, uniform shape?
Symmetrical and flat
What is an outlier? What may be the cause of them?
Observations that deviate from the overall pattern of distribution. They may be caused by natural variation or measurement error.
What are numerical summaries for centre or location? (3)
Mode, median, mean
What are numerical summaries for spread? (3)
Range, inter-quartile range (IQR), standard deviation
What is mode?
The most common value or peak of data
What is median?
The middle; the value that divides an ordered data set into two equal halves
For what types of variables would you find the median?
Ordinal, discrete and continuous
What is mean?
The average of the data, found by adding all values and dividing by the number of cases
What does the ‘x bar’ symbol represent?
Mean
Is mean or median resistant to outliers/skewness and why?
Median, because it is always the middle. Mean can be more affected by outliers.
Mean ? median in symmetrical data?
Mean = median
Mean ? median in skewed left data?
Mean < median
Mean ? median in skewed right data?
Mean > median
What is the ‘range’ of data?
The difference between the largest and smallest values in the data set
What are the first, second and third quartiles?
Q1 - 25% of data below Q1
Q2 - 50% of data below Q2 - aka the median
Q3 - 75% of data below Q3
How do you calculate quartiles? (4 steps)
- Arrange data from lowest to highest
- Calculate the median (M)
- Calculate Q1 - median of the first half of data (excluding M)
- Calculate Q3 - median of the second half of the data (excluding M)
How do you find the interquartile range (IQR)?
IQR = Q3-Q1
What is the 1.5IQR rule used for?
A criteria used to identify outliers
How do you find the lower threshold to identify any low outliers?
Q1 - 1.5IQR
How do you find the upper threshold to identify any high outliers?
Q3 + 1.5IQR
What is the 5-number summary?
Summary of the minimum, Q1, median, Q3 and maximum
What two values are represented by the sides of a box on a boxplot?
Q1 and Q3
What does the line in a box of a boxplot indicate?
The median
What value is s squared?
Variance
What do you do to the value of variance to get the standard deviation?
Find the square root of variance
What does a small standard deviation imply?
The data is concentrated around the mean
What does a large standard deviation imply?
The data is widely spread around the mean
Is standard deviation or IQR used more commonly? Which is resistant and sensitive to outliers?
Standard deviation is used more commonly however it is sensitive to outliers. IQR is resistant to outliers.
What measure of centre is used for symmetrical data?
Mean
What measure of spread is used for symmetrical data?
Standard deviation
What measures of centre are used for data that is skewed or with outliers?
Median and mean
What measures of spread are used for data that is skewed or with outliers?
Standard deviation and IQR
What graphs are used with one categorical and one numerical variable? (3)
- Side-by-side
- Histograms/boxplots
What graph is used with two numerical variables?
Scatterplot
What descriptive statistics number/data is used with two numerical variables?
Correlation coefficient - r
What does a response variable measure/record? On which axis is it plotted?
A response variable measures the outcome of a study. It is plotted on the y-axis
What does an explanatory variable measure/record On which axis is it plotted?
An explanatory variable explains the changes in the response variable. It is plotted on the x-axis
What is an independent variable compared to a dependent variable?
A variable that can be controlled to determine the value of a dependent variable
What are some synonymous terms for independent variable? (6)
- Explanatory variable
- Predictor variable
- Controlled variable
- Regressor
- Manipulated variable
- Input variable
What are some synonymous terms for dependent variable? (6)
- Outcome variable
- Response variable
- Measured variable
- Regressand
- Observed variable
- Output variable
Does correlation always imply causation?
No
What graphs would be used for a continuous Y variable and a categorical X variable? (2)
- Side-by-side boxplots
- Vertically aligned histograms
What graph would be used for a continuous Y variable and a continuous X variable?
Scatterplot
What graph would be used for a categorical Y variable and a categorical X variable?
Clustered bar chart
What is the correlation coefficient a measure of?
It is a measurement of the strength of the linear relationship between two continuous variables, X and Y
With what graph do you always use the correlation coefficient?
Scatterplot
If the correlation coefficient r > 0, what does this mean for the linear relationship between X and Y?
r > 0 means as X increases, Y tends to increase
If the correlation coefficient r < 0, what does this mean for the linear relationship?
r < 0 means as X increases, Y tends to decrease
If r=0, what does this mean?
Existence. There is no linear relationship between X and Y. There could be some other kind of relationship
What values of r indicate a stronger linear relationship?
The closer r is to 1 or -1, the stronger the linear relationship
What would the graph show if r = -1 or r = 1?
The observations lie exactly on a line, with no scatter
Is r sensitive to outliers?
Yes
Can r be used for curved relationships?
No
Does r (correlation) distinguish between a predictor variable and a response variable?
No
What four characteristics should be asked from a scatterplot?
- Does a relationship exist between the two variables?
- What is its form? (linear, curved etc.)
- Is it increasing or decreasing?
- How strong is the relationship? (Correlation coefficient r)
What descriptive statistics can be used for one categorical variable?
Frequency table
What graphs can be used with two categorical variables? (2)
- Clustered bar chart
- Stacked bar chart
What three characteristics should be asked from clustered/stacked bar charts?
- Does a relationship exist between the two variables
- Is it increasing or decreasing
- How strong is the relationship?