Data Analytics Theory Flashcards

1
Q

Which of the mean, mode and median are resistant to outliers?

A

The mean is very sensitive to the presence of outliers. The median and mode are very resistant to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or false - the median is calculated differently depending on if there is an even or odd number in the sample?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the steps for determining the mean?

A

The sum of all sample values (Xi) divided by the number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps for determining the median?

A

Order the sample values in ascending order. For odd total n the median is found at (n+1)/2. For even total n, the median is the average of the value at n/2 and (n+2)/2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the steps for determining the mode?

A

Creating a frequency table and observing the highest frequency. Then observe which value this is for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the mode?

A

The observation that occurs most frequently in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should we not look at measures of centrality in isolation?

A

Comparing the measures of centrality between datasets may indicate that they are similar when in reality they have different amounts of dispersion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the measures of centrality?

A

Mean, mode, median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do measures of variability describe?

A

Measures of variability describe how dispersed observations in the univariate dataset are. They describe whether observations are tightly clustered or spread out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the measures of variability?

A

Variance and standard deviation. Range (though very sensitive to outliers). Five number summary provides basic information about variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are synonyms of the mean?

A

Arithmetic mean or average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for calculating the mean?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the mean?

A

The mean is considered to be the central (typical) measurement of a collection of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula for calculating the standard deviation?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the formula for calculating the variance?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the variance?

A

The average squared distance of each observation from the mean. Measured in units squared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What units is the variance measured in?

A

Units squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the standard deviation?

A

The square root of the variance - it is useful to consider how close the observations are from the mean. Measured in the same units/same scale as the observations in the numerical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What units is the standard deviation measured in?

A

Same units/scale as the observations in the numerical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How much of the data is usually within one standard deviation from the mean?

A

68%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How much of the data is usually within two standard deviations from the mean?

A

95%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are order statistics?

A

Statistics based on sorted (ranked) data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define a quantile.

A

The value computed from a sorted collection of numerical measurements (in ascending order) that indicates an observation’s rank when compared to all other present observations. It can take a value between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the 0.5th quantile mean?

A

This is the median value, below which half (50%) of the measurements lie.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What values can a quantile take?

A

Between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What values can a percentile take?

A

Between 0 and 100.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the relationship between a quantile and percentile.

A

The percentile is the quantile expressed in “percent scale” of 0 to 100 ie Pth quantile = 100 x Pth percentile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Define percentile.

A

The percentile is the quantile expressed in “percent scale” of 0 to 100 ie Pth quantile = 100 x Pth percentile. The Pth percentile is the cutoff point that indicates that at least P percent of the observation in the dataset take on this value or less.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does the 80th percentile represent?

A

The 80th percentile is the cutoff point which indicates that 80% of observations in the dataset may be found at this point or below.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are quartiles?

A

Quartiles are three cut off points that divide the dataset into four equal groups (Q1, Q2, Q3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Define the first quartile

A

Q1 = 0.25th quantile = 25th percentile. This is the middle value between the smallest observation and the median. Ie it is the median of the lower half of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Define the second quartile.

A

Q2 = 0.5th quantile = 50th percentile. This is the median of the dataset (the value which splits the dataset in half).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Define the third quartile.

A

Q3 = 0.75th quantile = 75% percentile. This is the middle value between the median and the highest observation in the dataset. Ie it is the median of the upper half of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Define the range.

A

The range is the difference between the smallest and largest observations in a numerical variable. It is extremely sensitive to outliers and therefore not very useful as a general measure of dispersion in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Why is the range not very useful as a general measure of dispersion in the data?

A

It is extremely sensitive to outliers - its calculation involves the use of extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the five number summary?

A

This provides basic information about variability in the dataset. It consists of the 0th percentile (minimum), 25th percentile (Q1), 50th percentile (Q2), 75th percentile (Q3) and 100th percentile (maximum). Ie it is the quartiles plus the maximum and minimum values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the interquartile range?

A

The interquartile range (IQR) measures the width of the “middle 50 percent” of the data. It is the range of values between Q1 (0.25 quantile) and Q3 (0.75 quantile). It is very resistant to outliers as it doesn’t consider the extremes where outliers are present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why is the IQR resistant to outliers?

A

The IQR measures the range across the middle 50% of the data, and therefore unlike the range it doesn’t consider the extremes where the outliers are present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the first step to carry out before determining order statistics?

A

Sort the data in ascending order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is covariance?

A

Covariance measures joint variability — the extent of variation between two random variables. It quantifies how two variables vary together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are the possible outcomes for covariance and what does each mean?

A

R = 0 - there is no linear relationship between numerical variables x and y.
R > 0 - there is a positive linear relationship between numerical variables x and y (as x increases, y increases and vice versa).
R < 0 - there is a negative linear relationship between numerical variables x and y (as x increases, y decreases and vice versa)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does a positive linear relationship mean?

A

R > 0 - as x increases, y increases and vice versa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What does a negative linear relationship mean?

A

R < 0 - as x increases, y decreases and vice versa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Does correlation or covariance measure how strong a relationship is?

A

Correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Why does calculating the covariance not tell us how strong a relationship is?

A

Covariance can tell us if there is a relationship between two variables, but it cannot measure how strong the relationship is as there is no scale to compare the value of r to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What type of variable can covariance and correlation be calculated for?

A

Numerical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the problem with covariance?

A

We cannot quantify strength of the linear relationship between two variables. There are no upper or lower limits which covariance coefficient can take.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What does correlation measure?

A

The direction and strength of an association between two variables. It is used to interpret the covariance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What coefficient do we use for correlation?

A

Pearson’s product-moment correlation coefficient (Pxy, Rho xy).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are the interpretations of the absolute strength of the Pearson’s product-moment correlation coefficient?

A

There are guidelines available to interpret the value of rho.
|rho| = 0.0 – no linear relationship
0.0 < |rho| <= 0.19 – very weak L.R.
0.20 <= |rho| <= 0.39 – weak L.R.
0.40 <= |rho| <= 0.59 – moderate L.R.
0.60 <= |rho| <= 0.79 – strong L.R.
0.80 <= |rho| < 1.0 – very strong L.R.
|rho| = 1.0 – perfect L.R.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are the basic interpretations of the Pearson’s product-moment correlation coefficient?

A

If rho = 1, there is a perfect positive linear relationship between variables x and y.
If 0 < rho < 1, there is a positive linear relationship between x and y. The closer to 1 the stronger it is.
If rho = -1, there is a perfect negative linear relationship between x and y.
If -1 < rho < 0, there is a negative linear relationship between x and y. The closer to -1 the stronger it is.
If rho = 0, there is no linear relationship between x and y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What values can Pearson’s product-moment correlation coefficient take on?

A

Rho is between -1 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Why are we able to say how strong the relationship is using Pearson’s product-moment correlation coefficient?

A

It is scaled between - 1 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is a frequency table?

A

A statistical technique used to get more insight into the properties of categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are the columns of a frequency table?

A

1 - category
2 - frequency column (F) - the number of occurrences of each categorical variable. Will total to n
3 - relative frequency (RF) - the proportion of occurrences of each categorical variable. (F/n). The sum of all relative frequencies when written as proportions must be equal to 1.
4 - percentages (P) - proportions multiplied by 100. The sum of this column must equal 100.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What does the relative frequency column of a frequency table sum to?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Why are frequency tables useful?

A

They help us to summarise large amounts of data and display this information clearly. We can see the most/least common variables and can calculate proportions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What are contingency tables used for?

A

A contingency table summarises data for two categorical variables (table of counts by category). Each value in the table represents the number of times a particular combination of variable outcomes occurred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is the relationship between a frequency table and contingency table?

A

Both tables are used to summarise information on categorical variables. A frequency table is used to summarise information on a single categorical variables whereas contingency tables summarise the data for two categorical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What kind of tool can be used to answer questions like “what proportion of spam emails contains text without numbers?”

A

Two categorical variables - contingency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What are bar charts used to visualise?

A

Categorical variables. This can be represented as frequency or proportion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How are categorical variables visualised?

A

Bar charts - this can be by frequency or proportion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What are the different axis of a bar chart?

A

The x-axis represents the different symbols (categories) of a categorical variable. The y-axis represents the frequency or proportion of the occurrence of each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is a mosaic plot?

A

A graphical representation of the information in a contingency table. It is similar to a bar plot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

How many variables can a mosaic plot represent?

A

A mosaic plot can be used to visualise one or two categorical variables from a contingency table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

How do mosaic plots represent the number of observations?

A

Mosaic plots use box areas to represent the number of observations that that box represents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is used to visualise contingency tables?

A

A mosaic plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

How does a two-variable mosaic plot represent the two variables?

A

One category (x) is used to create an initial one variable mosaic plot where the area represents the number of observations for that category. The second category (y) is represented by splitting each bar proportionally according to the fractions of y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What types of variables are plotted on a scatterplot?

A

Numerical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What is a scatterplot?

A

A plot that provides a case-by-case view of data for two numerical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What are scatterplots useful for?

A

Scatterplots are helpful in quickly spotting associations between two numerical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What is a box plot?

A

A visualisation technique used for explaining important features of the distribution of the target numerical variable. It provides insight into centrality, spread, skewness and possible outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What does a box plot show?

A

Centrality (mean), spread (quartiles), skewness and possible outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Do the whiskers of a box plot represent the full range?

A

No, the whiskers may not capture the maximum and minimum values. The whiskers are determined differently dependent on the software package used. Eg 1.5 the IQR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What can box plots be useful for?

A

Identifying outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

If a box plot shows lot of outliers above the maximum whisker (high positive) what does this indicate about the skew of the data?

A

Right-skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

If a box plot shows lot of outliers below the minimum whisker what does this indicate about the skew of the data?

A

Left-skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

How can you identify suspected outliers on a box plot?

A

Suspected outliers are the observations beyond the maximum reach of the whiskers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

What is an outlier?

A

An outlier is an observation that appears extreme relative to the rest of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Why is it important to look for outliers?

A
  • To identify a strong skew in the distribution
  • To identify data collection or entry errors
  • To get an insight into interesting properties of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What are side-by-side box plots used for?

A

Side-by-side box plots is a traditional tool for comparing numerical observations across categories. It is particularly useful for comparing centrality and spread of numerical observations between categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

What visualisation technique can you use for exploring the distribution of numerical and categorical variables together?

A

Side-by-side box plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

What measures are side-by-side box plots particularly useful for?

A

Comparison of centrality and spread of numerical observations between categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

How should you answer questions describing graphs?

A
  • Describe what you see
  • Relate this to the question (ie what does this mean in real life)
  • Support with figures from the graph
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

What are histograms?

A

Histograms are plots that are used for describing the shape of the data distribution of the target numerical variable. They also provide a view of the data density of the target numerical variable (higher bars represent where data is more common).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

What kind of data type is plotted in a histogram?

A

Numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

What kind of visualisation describes the shape of the data distribution of a numerical variable?

A

Histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

What does a higher bar in a histogram represent?

A

Where the data are relatively more common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

What kind of visualisation describes the data density of a numerical variable?

A

Histogram - where higher bars represent where the data are relatively more common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

What are the similarities of bar charts and histograms?

A

They use bars to represent frequencies / they both measure frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

What are the differences of bar charts and histograms?

A
  • Histograms re used for displaying distributions of numerical variables while bar charts are used for categorical variables.
  • Both measure frequencies, but in histograms, observations first need to be “binned”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

What is a “bin” in a histogram?

A

A defined interval (used to group individual numerical values). The number of observations that fall within each interval are counted and this frequency is used to determine the height of the bar for that interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Why does bin width matter when plotting histograms?

A

The chosen bin width can alter the story that the histogram is telling. Increasing the bin widths may decrease the number of modes available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

What are the steps of constructing a histogram?

A

1 - define the bins and bin sizes (software may determine this)
2 - once defined, count how many observations fall into each interval
3 - plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

How is the mode represented in a histogram?

A

The mode is represented by a prominent peak in the distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

What can histograms show?

A

Histograms can show how many and what the modes of a distribution are.
- Unimodal / bimodal / multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Describe a right-skewed distribution.

A

When data trails off to the right ie observations are clustered on the left of the axis and there is a long tail to the right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Describe a left-skewed distribution.

A

When data trails off to the left ie observations are clustered on the right of the axis and there is a long tail to the left.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

If observations are clustered on the left of a histogram and there is a long tail to the right - what kind of skew is this?

A

Right-skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

If observations are clustered on the right of a histogram and there is a long tail to the left - what kind of skew is this?

A

Left-skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

How do you describe a dataset that shows roughly equal trailing off in both directions?

A

Symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

What is a symmetric distribution?

A

A dataset that shows roughly equal trailing off in both directions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

Why is it important to check if data is normally distributed?

A

A lot of statistical inference relies on data being normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

If the distribution of a dataset is symmetric, what measures should you use to describe the centre and spread?

A

Mean and standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

What kind of distribution is best described by the mean and standard deviation?

A

Symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

If the distribution of a dataset is skewed, what measures should you use to describe the centre and spread?

A

Median and IQR - they are robust to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

What is the relationship between the median, mean and mode of a symmetric distribution?

A

mean ~ median ~ mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

What is the relationship between the median, mean and mode of a right-skewed distribution?

A

mode < median < mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

What is the relationship between the median, mean and mode of a left-skewed distribution?

A

mean < median < mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

Why does mean ~ median ~ mode not hold for skewed data?

A

The mean is pulled in the direction of the tail, towards the extremes. The mode is pulled in the opposite direction (where the data is clustered)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

If data is right-skewed, what kind of transformations can result in new samples which are less skewed?

A

y = sqrt(x)
y = ln(x)
y = -1/x
In increasing order of skewness severity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

If data is left-skewed, what kind of transformations can result in new samples which are less skewed?

A

y = x^2
y = x^3
In increasing order of skewness severity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

Why is the bin width choice important?

A

Depending on bin size, the story the graph tells can change. If the bin size is too wide, it may mislead you into thinking that the data is normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

What are the options to have on the y-axis of a histogram?

A

Absolute frequency or relative frequency (F/n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

What is the difference in shape of the relative histogram in relation to the absolute frequency histogram?

A

They have the same shape. The difference is the Y-axis and the fact that the areas of the bars of the relative frequency histogram add up to one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

How do you calculate the relative frequency?

A

The absolute frequency divided by the Toal number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

When do we use the relative frequency histogram over the absolute frequency histogram?

A

Use the relative frequency histogram when we want to investigate whether the proportion is less than or greater than a certain value. Ie we want to look at proportion rather than frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

How do you answer a question to determine how many people fall in a certain interval, if the bin widths are too big to answer this accurately?

A

Can’t determine an exact answer with these bin widths, we can only estimate. To answer accurately we need to have a narrower histogram (one with smaller bins)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

If you keep changing the bin widths of a histogram to become smaller, what happens?

A

The histogram forms a more smooth curve, approaching the density curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

What is a density curve?

A

A density curve is a smoothed version of the relative frequency histogram. It is used for the visualisation of continuous variables or very large populations. It also represents a probability density function. The area under the curve is equal to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

What kind of variables is visualised in a density curve or probability density function?

A

A continuous variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

What does the area under a density curve represent?

A

The area corresponds to measuring probabilities. The total area is equal to 1. Similar to the bars in a relative frequency diagram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

From a probability density curve, what is the probability that x = a particular value from the continuous distribution?

A

The probability that x is equal to some value from the continuous distribution is ALWAYS equal to 0. This happens because a single point on the density curve diagram has a width of 0 and therefore we can’t obtain the area underneath the curve at a single point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

Which distribution is the most common?

A

The normal curve or normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

What are the properties of the normal distribution?

A
  • It is unimodal and symmetric around its mean bell-shaped curve
  • Mean, mode and median are equal
  • It is determined by two parameters (mu and sigma), usually denoted as N(mu, sigma)
  • The area under the normal curve is 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

What parameters determine the normal distribution?

A

Mu and sigma - N(mu, sigma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

What is the standard normal distribution?

A

A normal distribution where mu = 0 and sigma = 1, represented as N(0,1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

Which normal distribution is represented by N(mu = 0, sigma = 1)?

A

The standard normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

We want our dataset to be normally distributed, but in practice our data comes from lots of different types of distributions, with lots of different influencing parameters. How do we account for this?

A

Transform our dataset onto the standard normal distribution. This enables us to refer to the standardised tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

What parameters determine the shape of the normal distribution?

A

Mu (mean) - the centre of the curve, changing mu shifts the curve left / right
Sigma (standard deviation) - the width of the curve. Changing sigma stretches or constricts the curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

Which rule describes how many observations lie within different numbers of standard deviations from the mean in the normal distribution?

A

68-95-99.7 Rule
- 68% of observations lie within 1 SD away from the mean in the normal distribution
- 95% of observations lie within 2 SDs
- 99.7% of observations lie within 3 SDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

How many observations lie within 1/2/3 SDs away from the mean in the normal distribution?

A

68%, 95%, 99.7%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
132
Q

How do we analyse normally distributed data?

A

We should convert available observations into the standard deviation units and measure their distances from the mean.
To perform this type of conversion we use the standardisation technique called Z-score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
133
Q

What is a Z-score?

A

The Z-score of an observation is the number of standard deviations it falls above or below the mean. It is used to analyse normally distributed data.

134
Q

How do we calculate the Z score?

A

For an observation x that follows the normal distribution N(u,o)
Z = (x-u) / o
By calculating a Z-score we “convert” the data value for its normal distribution N(u,o) to a value from the normal standard distribution N(0,1) in such a way that it maintains all the properties of the original dataset.

135
Q

What does a Z score of 1 mean?

A

The observation is one standard deviation away from the mean? (above)

136
Q

If an observation is 1.5 standard deviations below the mean, what is its Z-score?

A

z = -1.5.

137
Q

What can comparing the Z-scores of two observations allow you to determine?

A

You can use Z-scores to roughly identify which observations are more unusual than others. If the absolute value of the Z-score is larger, it is more unusual - |z1| > |z2| means z1 is more unusual.

138
Q

How can you tell which of two observations is more unusual?

A

The more unusual observation will have a larger Z score, ie it will be more standard deviations away from the mean.

139
Q

What does the magnitude and sign of the Z-score indicate?

A

Magnitude - the number of standard deviations away from the mean the observation is.
Value - whether this number of standard deviations away is above or below the mean.

140
Q

If a random variable X~N(mu, sigma), then what can we say about the random variable Z = (x-mu)/sigma?

A

Z ~ N(0,1)
It follows that it is normally distributed once transformed.

141
Q

How do we calculate percentiles for a N(mu, sigma) distribution?

A

We transform it to the standard normal distribution (Z scores) and use the N(0,1) percentiles, which are listed in a normal probability table to determine the percentile based on the Z score.

142
Q

What are the steps to follow for solving normal probability problems?

A

1 – draw and label a picture of the normal distribution (doesn’t need to be exact)
2 – shade in the region of interest
3 – calculate the Z-score of the cutoff value
4 – look up the percentile for the Z-score in the normal probability table
5 – do you need to subtract from 1?
Always verify that the final answer makes sense with the picture you drew.

143
Q

What is the textbook definition of a Z-score?

A

Z-score is a statistical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean.

144
Q

What are the two types of categorical variables?

A

Nominal categorical variable - have no implied order
Ordinal categorical variable - have a natural ordering

145
Q

What are the two most popular approaches to evaluating whether sample data follows the normal distribution or not?

A
  • Statistical tests
  • Visualisation techniques
    Ideal to use both
146
Q

What are some statistical approaches of evaluating whether a given sample of data follows the normal distribution?

A
  • Shapiro-Wilk test
  • Kolmogorov – Smirnov test
  • Anderson – Darling test, etc.
147
Q

What is a drawback of relying on statistical tests for evaluating if data follows the normal distribution?

A

Statistical tests are very sensitive to the presence of outliers. If a certain number of outliers are present in a normally distributed data set, statistical tests may report that the data set is not drawn from a normal distribution. Visualisation techniques may help overcome this problem.

148
Q

What are two visualisation techniques for normality assessment?

A
  • Histograms with the best fitting normal curve overlaid on the plot
  • The normal probability plot (quantile-quantile plot or QQ plot)
149
Q

What is the quantile-quantile (QQ) plot a synonym for?

A

The normal probability plot.

150
Q

What does a histogram with the best fitting normal curve overlaid on the plot show?

A

This is used to visualise normality assessment. The sample mean and SD are used as the parameters for the best fitting normal curve. The closer the curve is to the histogram, the more reasonable the normal model assumption is.

151
Q

What does the normal probability plot show?

A

This is used to visualise normality assessment. Data are plotted on the y-axis of the plot and theoretical quantiles (following normal distribution) are plotted on the x-axis. The closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.

152
Q

When using a histogram with the best fitting normal curve overlaid or the normal probability plot to visualise normality, how does sample size affect this?

A

A smaller sample size will show more variability around the curve. A larger sample size increases the confidence.

153
Q

When using a histogram with the best fitting normal curve overlaid to visualise normality, what does a curve close to the histogram represent?

A

A curve closer to the histogram means it is more reasonable to assume the data is normally distributed.

154
Q

How is right-skew observed on a normal probability plot?

A

Points bend up and to the left of the line.

155
Q

How is left-skew observed on a normal probability plot?

A

Points bend down and to the right of the line.

156
Q

If visualisations show that the data isn’t normally distributed what should you do?

A

Perform further analysis, eg different visualisations or investigating if and why there are outliers.

157
Q

How are short tails visualised on a normal probability plot?

A

Short tails (narrower than the normal distribution) - points follow an S-shaped curve.

158
Q

How are long tails visualised on a normal probability plot?

A

Long tails (wider than the normal distribution) - points start below the line, bend to follow it, and end above it.

159
Q

What is the purpose of statistical inference?

A

To draw conclusions about and assess population parameters for a specific population based on a sample of data taken from that population.

160
Q

Why do we use sample statistics?

A

Sample statistics (mean, proportions etc) are used was point estimates for the unknown population parameters of interest, as it is difficult (or impossible) to collect data from the complete population.

161
Q

What is a point estimate?

A

In statistics, a point estimate is a single value that is calculated from sample data to estimate an unknown population parameter. It is a “best guess” or “best estimate” of the population parameter.

They generally vary from one sample to another and this sampling variation suggests our estimates may be close, but not exactly the true population parameter.

162
Q

What does variation in point estimates between different samples suggest?

A

This sampling variation suggests that the estimate is not exactly equal to the true population parameter.

163
Q

What does the sampling distribution represent?

A

The distribution of point estimates based on samples of a fixed size from a certain population.

164
Q

What are the parameters of a sampling distribution?

A

The central “balance” point of a sampling distribution is its mean.

The standard deviation of a sampling distribution is referred to as a standard error.

165
Q

What is the standard error?

A

The standard deviation of a sampling distribution. Reflects the fact that probabilities are no longer tied to raw measurements/observations, but rather to a quantity calculated from a sample of such observations.

The standard error of an estimate describes how far the point estimate is from the true population parameter eg how far the typical estimate is away from the actual
population mean.

166
Q

What is the difference between the standard deviation and the standard error?

A

The standard deviation measures the variability of individual data points inside the sample

The standard error measures how far the point estimate is from the population parameter.

167
Q

What is the central limit theorem?

A

If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is approximated well by the normal distribution.

The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.1

168
Q

What is the normal model which approximates the distribution of the sample mean? For both known standard deviation of the population and unknown.

A

[See flashcard]

169
Q

What conditions need to be met for the central limit theorem (CLT) to apply (to the mean)?

A

Independence - sample observations must be independent.

Sample size/skew - either the population distribution is normal, or if the population distribution is skewed, the sample size is large.

170
Q

For the central limit theorem (CLT), independence is difficult to verify, but it is more likely if:

A
  • Random sampling / assignment is used
  • If sampling without replacement, n is less than 10% of the population
171
Q

To apply the central limit theorem (CLT), if the population distribution is skewed, what do we need to ensure?

A

The more skewed the population distribution, the larger sample size we need to apply for the CLT.

For moderately skewed distributions, n > 30 is a widely used rule of thumb.

172
Q

For the central limit theorem (CLT), it is difficult to verify the skew / if it is normally distributed, but how can we check it?

A

We can check it using the sample data and assume that the sample mirrors the population.

173
Q

What conditions need to be met for the central limit theorem (CLT) to apply (for the single proportion)?

A
  • Independence - sampled observations must be independent.
  • Sample size / skew - at least 10 success and 10 failure observations. eg for the marathon example, at least 10 who ran < 2 hours and 10 who ran > 2 hours
174
Q

If you report a single point estimate, such as the sample mean, are you likely to capture the exact population parameter, eg the population mean?

A

It is very likely that we will not capture the exact population parameter. Instead, if we report a range of the plausible values, we have a good chance to capture a true population parameter.

A plausible range of values for the population parameter is called a confidence interval.

175
Q

What is a confidence interval?

A

A plausible range of values for the population parameter.

They may be constructed in different ways, depending on the type of statistic and therefore shape of the corresponding sample distribution.

176
Q

For symmetrically distributed sample statistics, like those involving means and proportion’s, what is the general formula of the confidence interval?

A

[See flashcard]

177
Q

What is Z* and what does it depend on?

A

Z* is the critical value and can have a different value depending on the confidence level.

178
Q

What is Z* x SE in the confidence interval?

A

The margin of error. For a given sample the margin of error changes as the confidence level changes.

179
Q

What is the formula for the margin error?

A

Z* x SE

180
Q

How do we adjust the confidence level?

A

Adjust Z* in the formula

181
Q

What are the two most commonly used confidence intervals in practice?

A

95% confidence interval, Z* = 1.96
99% confidence interval, Z* = 2.58

182
Q

How do we find Z* values to use in the calculation of confidence intervals?

A

Use the normal Z-table.
eg how do we be 96% confident?

183
Q

If we want to be more certain that we will capture the population parameter, how does our confidence interval need to change?

A

The confidence interval needs to increase ie become wider. This will increase our confidence level.

Too wide an interval may not be very informative.

184
Q

Why may too wide of a confidence interval be a problem?

A

It may not be very informative eg weather example.

185
Q

How should we interpret a confidence interval [l,u]?

A

We are XY% (eg 95%) confident that the true population parameter is between the lower bound (l) and upper bound (u) of our confidence interval.

186
Q

What do confidence intervals attempt to do?

A

Confidence intervals try to capture the population parameter - they say nothing about the confidence of capturing individual observations, a proportion of observations or about capturing point estimates.

187
Q

What are the steps for significance tests / statistical inference?

A

1 - formulation of the practical problem in terms of statistical hypotheses
2 - construction of a test statistic
3 - description of a critical region and/or the calculation of the p-value
4 - significance level or size of the test
5 - further assessment

188
Q

What is the null hypothesis?

A

The null hypothesis H0 represents what we currently hold as true. H0 is basically a standard with which the evidence for HA can be compared.

One-sample: there is no difference from our previous knowledge (maintenance of status quo)

Two-sample: there is no difference between the populations being compared.

189
Q

What is the alternative hypothesis?

A

HA represents what we want to test.

It expresses the range of situations that we wish the test to be able to diagnose. Depending upon the outcome of the test we may take action.

190
Q

What language is used to summarise the outcome of the null hypothesis?

A

Language - is there enough evidence to reject the null hypothesis (we never accept it).

“H0 is rejected in favour of HA”
“There is insufficient evidence to reject H0 in favour of HA”

191
Q

What is a test statistic?

A

The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your observed data match the distribution expected under the null hypothesis of that statistical test.

The test statistic is used to calculate the p value of your results, helping to decide whether to reject your null hypothesis.

It is a function of the data plus the information in the hypothesis H0.

192
Q

What properties should a test statistic satisfy?

A

1 - its probability distribution must be calculable (at least approximately) under the assumption that H0 is true
2 - it should behave differently when H0 is true from when HA is true

193
Q

What is the critical region?

A

A region of values of the test statistic t which support our preference for HA rather than H0

194
Q

If the calculated value of t (calculated under the assumption that H0 is true) falls in a suitable critical region, what do we do?

A

We reject H0 in favour of HA

Otherwise, we are unable to reject H0 in favour of HA

195
Q

If the calculated value of t (calculated under the assumption that H0 is true) does not fall in a suitable critical region, what do we do?

A

We are unable to reject H0 in favour of HA.

196
Q

How are tests constructed in significance tests?

A

So that the lack of information, particularly too little data, tends to result in non-critical values of the test statistic.

Hence, it is unwise to talk positively about “accepting H0”.

Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.

197
Q

Why do we not “accept H0”.

A

Lack of information, particularly too little data, tends to result in non-critical values of the test statistic. Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.

198
Q

What is the p-value?

A

A p-value, or probability value, is a number describing the likelihood of obtaining the observed data under the null hypothesis of a statistical test.

The p-value quantifies the strength of the evidence against the null hypothesis H0 and in favour of the alternative hypothesis HA.

199
Q

What results in a small p-value?

A

H0 is true and an improbable event has occurred
HA is true

200
Q

How do you interpret the p-value?

A

If the p-value is small, H0 is rejected in favour of HA
If the p-value is not “small”, the evidence does not support the reject of H0 in favour of HA.

201
Q

What are two approaches to investigating the null hypothesis?

A

Calculate the p-value
Investigate t-statistic and the critical region

202
Q

In making a test of H0 against HA we can make two kinds of error - what are they?

A

Type 1 - false positive. H0 is rejected when in fact it is true.

Type 2 - false negative. H0 is not rejected when it is true.

203
Q

If conducting a serious clinical trial would you want the significance level alpha to be larger or smaller?

A

Would choose a smaller significance level - we would rather have 1 in 100 errors than 5 in 100 errors.

204
Q

What is the significance level alpha?

A

The significance level of an event (such as a statistical test) is the probability that the event could have occurred by chance.

It is the probability of rejecting H0 when in fact it is true, ie committing a Type 1 error.

205
Q

What influences the choice of the significance level alpha?

A

Depends on the particular problem and how serious it is a true H0 is rejected (false positive) eg medical trials

206
Q

What does a significance level (alpha) of 0.05 mean?

A

We will allow 5 incorrect rejections of H0 from every 100 we make. There is a 5% chance that the result is due to chance.

207
Q

What are the basic conclusions when working with a significance level of 5%?

A

P <= 5% (p <= 0.05) – the test is significant at 5% level and H0 is rejected in favour of HA
P > 5% (p > 0.05) – the test is not significant at the 5% level and H0 is not rejected in favour of HA

208
Q

What are the further assessments that can be made when working with a significance level of 5%

A
  • P > 10% - there is no (or very little) evidence for rejecting H0 in favour of HA
  • 5% < P <= 10% - on the available evidence, we cannot reject H0 is in favour of HA but we have some suspicion (ie we would like to obtain more evidence)
    Eg you didn’t reject the null due to a small dataset
  • 1% < p <= 5% - significant at 5% level and H0 is rejected in favour of HA. If the decision to change is important, we should probably seek further evidence
  • 0.1% < p <= 1% - highly significant at the 5% level. There is considerable evidence for rejection of H0 in favour of HA
  • P <= 0.1% - very highly significant at the 5% level. We are very confident that HA is to be preferred to H0
209
Q

When testing a single mean using the Z statistic, what are H0 and HA? How are the test statistic and confidence interval calculated?

A

[See flashcard]

210
Q

When testing the comparison of two means using the Z statistic, what are H0 and HA? How are the test statistic and confidence interval calculated?

A

[See flashcard]

211
Q

For testing a single proportion, what is the formula for the test statistic Z and the confidence interval?

A

[See flashcard]

212
Q

For testing the comparison of two proportions, what is the formula for the test statistic Z and the confidence interval?

A

[See flashcard]

213
Q

What is a t-distribution?

A

The t-distribution, also known as the Student’s t-distribution, is a statistical function that creates a probability distribution. The t-distribution is similar to the normal distribution, with its bell shape, but it has heavier tails. It is used for estimating population parameters for small sample sizes or unknown variances. T-distributions have a greater chance for extreme values than normal distributions, and as a result have fatter tails.

214
Q

How does the shape of the t-distribution compare to the standard normal distribution?

A

They are both bell-shaped curves centred at 0. The t-distribution has fatter tails, meaning observations are more likely to fall further away from the mean (over 2 SDs from the mean).

The thicker tails are helpful for resolving our problem with a less reliable estimate of the standard error (since n is small).

215
Q

What are the conditions for the t-distribution?

A

When the population SD is unknown and we have a small data sample (n<30) we address the uncertainty of the standard error using the t distribution.

216
Q

What influences the shape of the t-distribution.

A

It is centred at zero and influenced by one parameter, the degrees of freedom (df). The larger the degrees of freedom, the more closely the t-distribution resembles the standard normal model. When df >= 30, it is nearly indistinguishable from the normal distribution.

217
Q

What are degrees of freedom?

A

Degrees of freedom are the maximum number of logically independent values, which may vary in a data sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.

218
Q

What is the cut off value of n for the t-distribution and why?

A

n < 30 - for n >= 30, the t-distribution and the normal distribution are nearly indistinguishable

219
Q

Describe the t-table

A

A t table is a reference statistical table that contains critical values of the t distribution, also known as the t score or t value.

Each row represent a t-distribution with different degrees of freedom. The columns correspond to tail probabilities.

220
Q

What are the formulas for obtaining the t-statistic and confidence intervals for a single mean?

A

[See flashcard]

221
Q

What is a paired comparison t-test?

A

The Paired Samples t Test compares the means of two measurements taken from the same individual, object, or related units.

Each subject has two observations.

222
Q

What are the formulas for obtaining the t-statistic and confidence intervals for a paired comparison?

A

[See flashcard]

223
Q

What is the formula for the test based on the t-distribution for comparison of two means - independent samples?

A

[See flashcard]

224
Q

What is an assumption made in the formula for the test based on the t-distribution for comparison of two means - independent samples?

A

Use the pooled variance in the calculations

224
Q

What is the formula for S-pooled in the test based on the t-distribution for comparison of two means - independent samples?

A

[See flashcard]

225
Q

What is the chi squared test?

A

Goodness-of-fit test for classified data - The distribution of a categorical variable in a sample often needs to be compared with the distribution of a categorical variable in another sample.

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic.

226
Q

What is a difference of the chi squared test from the test of proportions?

A

In chi squared tests we don’t assume normal distribution.

227
Q

In a chi squared test, observations are classified into classes. What is the condition for this?

A

Each observation is classified into k mutually exclusive and exhaustive classes ie each observation belongs to one and only one class.

228
Q

Where does the critical region of the chi squared distribution lie and why?

A

The critical region lies in the right hand tail only.

This is because, if H0 is not true, we would expect the Eis to be quite different from the Ois, resulting in a larger than expected phi squared value.

Small phi squared results when Eis and Ois are in good agreement - we wouldn’t want to reject H0 in this case.

229
Q

What is the difference between the distribution of phi squared and chi squared?

A

The exact distribution of phi squared is discrete and is approximated by the continuous chi squared distribution.

o For this approximation to be reasonable, Ei should be > 5 for each class
o If not, combine adjacent classes with resultant loss of one or more degrees of freedom

230
Q

To approximate the phi squared distribution (discrete) as the chi squared distribution (continuous) to be reasonable, what needs to be in place?

A

Ei should be > 5 for each class.

If not, combine adjacent classes with the resultant loss of one or more degrees of freedom.

231
Q

In the chi squared test, if the number of observations of Ei is not >5, what must you do?

A

Combine adjacent classes with the resultant loss of one or more degrees of freedom.

232
Q

What is the formula for the test statistic phi in the chi squared test?

A

[See flashcard]

233
Q

What adjustment do you have to make in the test statistic phi in the chi squared test if there is only 1 degree of freedom?

A

The Yates’ Continuity Correction - add magnitude and -1/2
[See flashcard]

234
Q

What parameters influence the chi squared distribution?

A

The chi squared distribution has just one parameter called the degrees of freedom (df) which influence the shape, centre and spread of the distribution.

235
Q

How does changing the degrees of freedom influence the shape of the chi squared distribution?

A

Higher degrees of freedom – the distribution shifts to the right and becomes flatter

236
Q

How do the t-table and chi-square table differ?

A

One important difference from the t-table is that the chi-square table only provides upper tail values

237
Q

What is an ANOVA test?

A

ANOVA, or Analysis of Variance, is a test used to determine differences between results from three or more unrelated samples or groups.

ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable.

238
Q

How do we compare the means of 2 groups?
How do we compare the means for 3 groups?

A
  • 2 groups: Z or a T statistic
  • 3 groups: test Analysis of Variance (ANOVA) and a new statistic called F
239
Q

What are the conditions to be met for ANOVA?

A

1 - The observations should be independent within and between groups.
If the data are a simple random from less than 10% of the population, the condition is satisfied. Eg no pairing
2 - The observations within each group should be nearly normal (important when sample sizes are small)
3 - The variability across the groups should be about equal (especially important when the sample sizes differ between groups).

240
Q

What test statistic is used for ANOVA?

A

F statistic

241
Q

What is the purpose of carrying out statistical tests to compare statistics?

A

Compare to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability.

242
Q

With only two groups, how do the t-test and ANOVA compare?

A

They are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic.

243
Q

With more than two groups, what does ANOVA compare the sample mean to?

A

An overall grand mean

244
Q

What is the formula for the F statistic?

A

F = variability between sample groups / variability within sample groups

245
Q

In order to reject H0, what size does the F statistic need to be?

A

A large F statistic is needed for the p-value to be small to reject the H0.

A large F statistic means the variability between sample groups is greater than the variability within sample groups.

246
Q

What are the different degrees of freedom associated with the ANOVA table?

A

Group - k - 1
Total - n - 1
Error - dft - dfg

ie the difference between the total and the grouped degrees of freedom

247
Q

What are the different sum squares columns in the ANOVA table and how are they calculated?

A

SSG - sum of squares between groups, measures the variability between the groups [see flashcard]

SST - sum squares total, measures the total variability in the dataset [see flashcard]

SSE - sum squares error, measures variability within groups SSE = SST - SSG

247
Q

How do you compare the F value to the F tables / probability value?

A

From F-tables, find the F* value as the value from the column dfg and the row dfe. If F > F*, it is in the critical region therefore it is significant and at least one mean is different (different for at least one group).

The P value can be computed. A large F value correlates to a smaller P value, therefore if F > F* P < 0.05 (alpha).

248
Q

What is the Mean Sq column in the ANOVA table?

A

The mean square error.

Calculated for the group and error row as Sum of squares / degrees of freedom

249
Q

What adjustments do we make to the t-test following an ANOVA?

A

Use common variance (MSE from the ANOVA table) instead of each group’s variances in the calculation of the SE.

Use common degrees of freedom (dfE from the ANOVA table).

Use a modified significance level, this resolves the issue of increasing the type I error rate if we run too many tests (false positives).

250
Q

What is the scenario of testing many pairs of groups with a t test called?

A

Multiple comparisons

251
Q

What significance level adjustment is used for the post-ANOVA t-test?

A

The Bonferroni correction, which is a more stringent significance level.

alpha* = alpha / K

K - number of comparisons being considered

K = k(k-1) / 2

252
Q

How do you calculate the significance level of the Bonferroni correction?

A

alpha* = alpha / K

K - number of comparisons being considered

K = k(k-1) / 2

253
Q

What is the formula for the standard error of the differences in two means after ANOVA?

A

[see flashcard]

254
Q

What is linear regression?

A

Linear regression is a statistical technique that can be used for prediction and evaluating whether there is a linear relationship between two numerical variables x and y.

Linear regression assumes that the relationship between two variables can be modelled by a straight line

255
Q

Linear regression assumes the relationship between two variables can be modelled by what straight line?

A

y = B0 + B1x

x - predictor variable (explanatory variable, independent variable)
y - response variable (dependent variable)
B0 - intercept (expected value of the response variable when the predictor is 0)
B1 - slope parameter (the change in the mean response for each one-unit increase in the predictor)

256
Q

What does it mean if the slope of the linear regression model line is 0?

A

The predictor x has no effect on the value of the response y

257
Q

How are the parameters B0 and B1 of the linear regression model estimated?

A

Using data - these are point estimates b0 and b1

258
Q

What does y hat represent in the linear regression model determined from point estimates?

A

y_hat indicates it is a collection of estimated (predicted) observations of observed variable y, based on the input collection of predictor observations x

258
Q

How do we rewrite the linear regression model using the point estimate from the data?

A

y_hat = b0 + b1x

259
Q

What are the differences between observed and estimated values termed in the linear regression model?

A

Residuals (epsilon)

n is the same, the same number of points

260
Q

What are residuals (epsilon) in the linear regression model?

A

The differences between the observed and estimated values.

261
Q

What is the residual of the i-th observation (xi, yi)

A

The difference of the observed response (yi) and the response we would predict based on the model fit (y_hati)

Ei = yi - y_hati

262
Q

When the regression line represents a good approximation of our dataset, what happens to the residuals?

A

The residuals are pretty small.

The best fitting regression line (line that has the smallest possible residuals). A poor fitting regression line has large residuals.

263
Q

What is one of the most common approaches of finding the line with the smallest possible residuals?

A

Ordinary least squares regression (OLS)

264
Q

What is the goal of OLS?

A

OLS - ordinary least squares regression (OLS)

Goal is to find the line that minimises the least square criterion ie minimises the sum of the squared residuals [see flashcard]

The line that minimises this least squares criterion is usually called the least squares line

265
Q

What is the line that minimises the least squares criterion?

A

The least squares line

266
Q

How do you find the least squares line?

A

[see flashcard]

267
Q

What is the assumption of linearity for the least squares line?

A

The data should show a linear trend. If there is a nonlinear trend, an advanced regression method should be applied.

267
Q

When fitting a least squares line, what conditions need to be met?

A
  • Linearity
  • Nearly normal residuals
  • Constant variability
267
Q

What can we do once we have a formula of the least squares line?

A

We can use input values of x to get predicted values y_bar

With a fitted simple linear model, you’re able to calculate a point estimate y_hati of the mean response value yi

268
Q

What is the assumption of nearly normal residuals for the least squares line?

A

Generally, the residuals must be nearly normal.
When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points
Residuals are normally distributed if they are scattered around 0 with uniform variance.

269
Q

What is the assumption of constant variability for the least squares line?

A

The variability of the points around the least squares line remains roughly constant

270
Q

What should we do when we have estimated the regression coefficient b0 and b1?

A

We want to determine how good our model is.

One approach is using the coefficient of determination R^2.

R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line.

Formula [ see flashcard ]

If we can calculate how much variance is due to the residual variable, we can calculate how much is due to the outcome variable

271
Q

What is the coefficient of determination?

A

We want to determine how good our model is.

One approach is using the coefficient of determination R^2.

R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line.

272
Q

What is one of the first steps of data analysis?

A

Descriptive analysis - this helps to understand how the data is distributed and provides important information for further steps.

273
Q

How does R match the input values to the function arguments?

A

By position or by name

274
Q

What is data science?

A

Turning raw data into understanding, insight and knowledge

275
Q

What is a variable?

A

A quantity, quality or property that you can measure.

(values may vary from measurement to measurement)

276
Q

What are synonyms of “variable”?

A
  • Table column
  • Field
  • Attribute
  • Property
  • Feature
  • Vector
  • Dimension
277
Q

What are the two basic types of variable?

A

Numeric
Categorical

278
Q

What are numeric variables?

A

Variables whose values are recorded as numbers (integer or real values)

279
Q

What are categorical variables?

A

Variables whose values are recorded as symbols.

Eg - gender
Eg - countries

280
Q

What are the types of numeric variables?

A

Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count.

Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric

281
Q

What are discrete variables?

A

Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count.

282
Q

What are continuous variables?

A

Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric

283
Q

What are the two types of categorical variables?

A

Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories).

Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)

284
Q

What are ordinal variables?

A

Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories).

285
Q

What are nominal variables?

A

Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)

286
Q

What is a dataset?

A

How we store collections of variables

287
Q

What are the different types of datasets?

A

Univariate dataset – dataset consisted of measurements that correspond to the single variable

Multivariate dataset – dataset consisted of measurements that correspond to two or more variables. Most relevant when individual components aren’t as useful when considered on their own. eg spatial coordinates. Allows us to think about two or more variables

Corresponding data analysis

Univariate data analysis – the analysis performed on a single variable
Multivariate data analysis – the simultaneous analysis of two or more variables

288
Q

What are observations?

A

Measurements made under similar conditions

289
Q

What is a tabular dataset?

A

A set of values, each associated with a variable and an observation.

Variables are table columns.
Observations are table rows.

290
Q

What is tidy tabular data?

A

Tabular data - a set of values, each associated with a variable and an observation.

Tabular data is tidy if each value is placed in its own “cell” - each variable in its own column, each observation in its own row.

291
Q

What is the size of a dataset?

A

Defined by the number of observations (rows) in the table

292
Q

What is the dimensionsionality of a dataset?

A

Defined by the number of variables (columns) in the table

293
Q

How do we describe a dataset?

A

Size - observations (row)

Dimensionality - variables (columns)

294
Q

What is the population?

A

The (usually) large pool of observational units that we are interested in.

295
Q

What is a sample?

A

A smaller collection of observational units selected from the population.

296
Q

What is sampling?

A

Sampling refers to the process of selecting observations from a population.

Simple random sampling
Stratified sampling
Cluster sampling
Multistage sampling

297
Q

What are the four common sampling strategies covered in this module?

A
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Multistage sampling
298
Q

Why do we sample?

A

It doesn’t make sense to collect data for the whole population - it is probably impossible to collect and calculate the actual population mean so we need a sample.

299
Q

Define a representative sample.

A

A sample is said to be a representative sample if the characteristics of the observational units selected are a good approximation of the characteristics form the original population.

Meal analogy.

300
Q

What is bias?

A

Bias corresponds to a favouring of one group in a population over another group

301
Q

Define generalisability.

A

Generalisability refers to the largest group in which it makes sense to make inferences about from the sample collected.

This is directly related to how the sample was selected.

302
Q

What are parameters and statistics?

A

Parameters and statistics are calculations based on the population and sample respectively.

  • Population - parameter - Greek letters
  • Sample - statistic - Arabic
  • The differences are denoted in the notation used
303
Q

What is a parameter?

A

A calculation based on one or more variables measured in the population.

Denoted by greek letters.

304
Q

What is a statistic?

A

A calculation based on one or more variables measured in the sample.

Denoted by lower case arabic letters (sometimes in combination with other symbols)

305
Q

Describe Simple Random Sampling.

A

A sampling strategy where the individuals are selected from the list of units in the population, by means of some random process, in such a way that each individual has equal chance to be selected.

Eg random number tables or pseudo-random number generators.

Selection can be performed sequentially (one at a time without replacement, so that at each stage, remaining individuals in the population have the same probability of being selected).

306
Q

What is sequential selection?

A

In simple random sampling, selection can be performed sequentially. Individuals can be selected from the population one at a time without replacement, so that each stage, remaining individuals in the population have the same probability of being selected.

307
Q

Why is selection with replacement less common in practice?

A

There is usually an assumption that all observations are independent of each other - replacing them would lose this.

308
Q

What is stratified sampling?

A

Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The sample of individuals is then drawn from each stratum using some other random sampling process, usually simple random sampling.

Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible.

This sampling strategy is used in cases when it is known that the population is heterogeneous with respect to one or more variables which may have a bearing on the factor being studied.

Eg if there was a difference in height by gender, you know to take it into consideration.

This ensures things are well represented.

309
Q

Which sampling strategy is described as a divide-and-conquer strategy?

A

Stratified sampling

310
Q

How are strata chosen for stratified sampling?

A

Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible.

311
Q

What are the purposes of stratification?

A

1 - to increase the accuracy and precision of the overall population estimates.

2 - to ensure that domains of study are adequately represented.

312
Q

What is cluster sampling?

A

A sampling strategy where the population is divided into many groups, called clusters, and then we sample a fixed number of clusters and include all observations from each of those clusters in the sample.

[Strata are separated based on convenience, not a measure of interest ie the measure of interest is not why you’re in that cluster]

Eg divide the class into tables and pick a sample of two tables.

313
Q

What is multistage sampling?

A

A sampling strategy where the population is divided into many groups, called clusters, and then we collect a random sample within each cluster.

Similar to cluster sampling (but rather than keeping all observations in each cluster, we collect a random sample within each selected cluster)

314
Q

Why might cluster or multistage sample be preferred?

A

Sometimes it can be more economical than the alternative sampling techniques.

They are most helpful when there is a lot of case-to-case variability within the cluster, but the clusters themselves don’t look very different from one another

eg neighbourhoods as clusters

315
Q

What are negatives of cluster/multistage sampling?

A

More advanced analysis techniques are typically required.

316
Q

What should you consider when selecting a sampling strategy?

A

The situation, time and money.

Simple random sampling may be the best to get representation but it can be expensive.

Multistage sampling can reduce the costs without reducing reliability.

317
Q

What are the stages of data science?

A

Collect data, process it and clean it.

EXDA and use of machine learning, algorithms and statistical models

Communicate, visualisations and report findings. [Which leads to making decisions]

Build data product.

Data is a cyclical process - once you build the data product, more data becomes viable.

318
Q

What is exploratory data analysis (EDA)?

A

A creative process of exploring data sets for patterns and relationships.

Starting with lots of visualisations and summaries is a good idea.

319
Q

What are the goals of EDA?

A

1 - Develop an understanding about data by formulating questions
2 - Search for answers using visualisation techniques and summary statistics
3 - use answers obtained to refine questions and/or generate new questions

320
Q

What are 5 techniques used in EDA to search for answers?

A

Using visualisations and summary techniques

  • Visualise distributions of all variables (using box plots and histograms)
  • Visualise time series of data
  • Investigate all pairwise relationships between variables using scatterplots
  • Perform data cleaning and variable transformation
  • Perform summary statistics (mean, median, lower and upper quartiles, minimum and maximum values, identify missing data, errors and outliers)
321
Q

What kind of questions do you need to ask at the beginning of the EDA process?

A

Start simple, it is difficult to ask revealing questions at the start of analysis as you do not know what insights are hidden in your dataset.

There are no universal rules of questions to ask to guide research.

Useful starting points
- What type of variation occurs within my variables?
- What is the relationship between variables

322
Q

What are summary statistics?

A

Statistics used to quantitatively describe a collection of measurements by summarising them in the form of a single variable

323
Q

Describe the summary statistics and visualisation techniques for numerical variables.

A

Summary statistics:
- Measures of centrality (mean, mode, median) ie the most typical values
- Measures of variability (variance, standard deviation, range, quantiles, five number summary) ie the spread of the data

Visualisation techniques:
- Histograms
- Boxplots

323
Q

How do you answer, what type of variation occurs within my variables?

A

Summary statistics and visualisation techniques

Numeric:
- Measures of centrality
- Measures of variability
- Histograms and box plots

Categorical:

324
Q

Describe the summary statistics and visualisation techniques for categorical variables

A

Summary statistics
- Counts
- Percentages
- Proportions

Visualisation techniques
- Bar charts

325
Q

How do you investigate the relationship between variables?

A

Summary statistics
- Covariance and correlation (N-N)
- Contingency tables (C-C)

Visualisation techniques
- Scatterplots (N-N)
- Paired boxplots (N-C)
- Paired histograms (N-C)
- Mosaic plots (C-C)

326
Q
A