Data Analytics Theory Flashcards

Question 1

Q

Which of the mean, mode and median are resistant to outliers?

Answer

A

The mean is very sensitive to the presence of outliers. The median and mode are very resistant to outliers.

Question 2

Q

True or false - the median is calculated differently depending on if there is an even or odd number in the sample?

Question 3

Q

What are the steps for determining the mean?

Answer

A

The sum of all sample values (Xi) divided by the number of samples.

Question 4

Q

What are the steps for determining the median?

Answer

A

Order the sample values in ascending order. For odd total n the median is found at (n+1)/2. For even total n, the median is the average of the value at n/2 and (n+2)/2.

Question 5

Q

What are the steps for determining the mode?

Answer

A

Creating a frequency table and observing the highest frequency. Then observe which value this is for.

Question 6

Q

What is the mode?

Answer

A

The observation that occurs most frequently in the dataset.

Question 7

Q

Why should we not look at measures of centrality in isolation?

Answer

A

Comparing the measures of centrality between datasets may indicate that they are similar when in reality they have different amounts of dispersion.

Question 8

Q

What are the measures of centrality?

Answer

A

Mean, mode, median

Question 9

Q

What do measures of variability describe?

Answer

A

Measures of variability describe how dispersed observations in the univariate dataset are. They describe whether observations are tightly clustered or spread out.

Question 10

Q

What are the measures of variability?

Answer

A

Variance and standard deviation. Range (though very sensitive to outliers). Five number summary provides basic information about variability.

Question 11

Q

What are synonyms of the mean?

Answer

A

Arithmetic mean or average

Question 12

Q

What is the formula for calculating the mean?

Answer

A

[See flashcard]

Question 13

Q

What is the mean?

Answer

A

The mean is considered to be the central (typical) measurement of a collection of observations.

Question 14

Q

What is the formula for calculating the standard deviation?

Answer

A

[See flashcard]

Question 15

Q

What is the formula for calculating the variance?

Answer

A

[See flashcard]

Question 16

Q

What is the variance?

Answer

A

The average squared distance of each observation from the mean. Measured in units squared.

Question 17

Q

What units is the variance measured in?

Answer

A

Units squared

Question 18

Q

What is the standard deviation?

Answer

A

The square root of the variance - it is useful to consider how close the observations are from the mean. Measured in the same units/same scale as the observations in the numerical variable.

Question 19

Q

What units is the standard deviation measured in?

Answer

A

Same units/scale as the observations in the numerical variable.

Question 20

Q

How much of the data is usually within one standard deviation from the mean?

Question 21

Q

How much of the data is usually within two standard deviations from the mean?

Question 22

Q

What are order statistics?

Answer

A

Statistics based on sorted (ranked) data

Question 23

Q

Define a quantile.

Answer

A

The value computed from a sorted collection of numerical measurements (in ascending order) that indicates an observation’s rank when compared to all other present observations. It can take a value between 0 and 1.

Question 24

Q

What does the 0.5th quantile mean?

Answer

A

This is the median value, below which half (50%) of the measurements lie.

Question 25

Q

What values can a quantile take?

Answer

A

Between 0 and 1.

Question 26

Q

What values can a percentile take?

Answer

A

Between 0 and 100.

Question 27

Q

What is the relationship between a quantile and percentile.

Answer

A

The percentile is the quantile expressed in “percent scale” of 0 to 100 ie Pth quantile = 100 x Pth percentile.

Question 28

Q

Define percentile.

Answer

A

The percentile is the quantile expressed in “percent scale” of 0 to 100 ie Pth quantile = 100 x Pth percentile. The Pth percentile is the cutoff point that indicates that at least P percent of the observation in the dataset take on this value or less.

Question 29

Q

What does the 80th percentile represent?

Answer

A

The 80th percentile is the cutoff point which indicates that 80% of observations in the dataset may be found at this point or below.

Question 30

Q

What are quartiles?

Answer

A

Quartiles are three cut off points that divide the dataset into four equal groups (Q1, Q2, Q3)

Question 31

Q

Define the first quartile

Answer

A

Q1 = 0.25th quantile = 25th percentile. This is the middle value between the smallest observation and the median. Ie it is the median of the lower half of the dataset.

Question 32

Q

Define the second quartile.

Answer

A

Q2 = 0.5th quantile = 50th percentile. This is the median of the dataset (the value which splits the dataset in half).

Question 33

Q

Define the third quartile.

Answer

A

Q3 = 0.75th quantile = 75% percentile. This is the middle value between the median and the highest observation in the dataset. Ie it is the median of the upper half of the dataset.

Question 34

Q

Define the range.

Answer

A

The range is the difference between the smallest and largest observations in a numerical variable. It is extremely sensitive to outliers and therefore not very useful as a general measure of dispersion in the data.

Question 35

Q

Why is the range not very useful as a general measure of dispersion in the data?

Answer

A

It is extremely sensitive to outliers - its calculation involves the use of extreme values.

Question 36

Q

What is the five number summary?

Answer

A

This provides basic information about variability in the dataset. It consists of the 0th percentile (minimum), 25th percentile (Q1), 50th percentile (Q2), 75th percentile (Q3) and 100th percentile (maximum). Ie it is the quartiles plus the maximum and minimum values.

Question 37

Q

What is the interquartile range?

Answer

A

The interquartile range (IQR) measures the width of the “middle 50 percent” of the data. It is the range of values between Q1 (0.25 quantile) and Q3 (0.75 quantile). It is very resistant to outliers as it doesn’t consider the extremes where outliers are present.

Question 38

Q

Why is the IQR resistant to outliers?

Answer

A

The IQR measures the range across the middle 50% of the data, and therefore unlike the range it doesn’t consider the extremes where the outliers are present.

Question 39

Q

What is the first step to carry out before determining order statistics?

Answer

A

Sort the data in ascending order.

Question 40

Q

What is covariance?

Answer

A

Covariance measures joint variability — the extent of variation between two random variables. It quantifies how two variables vary together.

Question 41

Q

What are the possible outcomes for covariance and what does each mean?

Answer

A

R = 0 - there is no linear relationship between numerical variables x and y.
R > 0 - there is a positive linear relationship between numerical variables x and y (as x increases, y increases and vice versa).
R < 0 - there is a negative linear relationship between numerical variables x and y (as x increases, y decreases and vice versa)

Question 42

Q

What does a positive linear relationship mean?

Answer

A

R > 0 - as x increases, y increases and vice versa

Question 43

Q

What does a negative linear relationship mean?

Answer

A

R < 0 - as x increases, y decreases and vice versa

Question 44

Q

Does correlation or covariance measure how strong a relationship is?

Answer

A

Correlation

Question 45

Q

Why does calculating the covariance not tell us how strong a relationship is?

Answer

A

Covariance can tell us if there is a relationship between two variables, but it cannot measure how strong the relationship is as there is no scale to compare the value of r to.

Question 46

Q

What type of variable can covariance and correlation be calculated for?

Answer

A

Numerical variables.

Question 47

Q

What is the problem with covariance?

Answer

A

We cannot quantify strength of the linear relationship between two variables. There are no upper or lower limits which covariance coefficient can take.

Question 48

Q

What does correlation measure?

Answer

A

The direction and strength of an association between two variables. It is used to interpret the covariance.

Question 49

Q

What coefficient do we use for correlation?

Answer

A

Pearson’s product-moment correlation coefficient (Pxy, Rho xy).

Question 50

Q

What are the interpretations of the absolute strength of the Pearson’s product-moment correlation coefficient?

Answer

A

There are guidelines available to interpret the value of rho.
|rho| = 0.0 – no linear relationship
0.0 < |rho| <= 0.19 – very weak L.R.
0.20 <= |rho| <= 0.39 – weak L.R.
0.40 <= |rho| <= 0.59 – moderate L.R.
0.60 <= |rho| <= 0.79 – strong L.R.
0.80 <= |rho| < 1.0 – very strong L.R.
|rho| = 1.0 – perfect L.R.

Question 51

Q

What are the basic interpretations of the Pearson’s product-moment correlation coefficient?

Answer

A

If rho = 1, there is a perfect positive linear relationship between variables x and y.
If 0 < rho < 1, there is a positive linear relationship between x and y. The closer to 1 the stronger it is.
If rho = -1, there is a perfect negative linear relationship between x and y.
If -1 < rho < 0, there is a negative linear relationship between x and y. The closer to -1 the stronger it is.
If rho = 0, there is no linear relationship between x and y.

Question 52

Q

What values can Pearson’s product-moment correlation coefficient take on?

Answer

A

Rho is between -1 and 1.

Question 53

Q

Why are we able to say how strong the relationship is using Pearson’s product-moment correlation coefficient?

Answer

A

It is scaled between - 1 and 1.

Question 54

Q

What is a frequency table?

Answer

A

A statistical technique used to get more insight into the properties of categorical variables.

Question 55

Q

What are the columns of a frequency table?

Answer

A

1 - category
2 - frequency column (F) - the number of occurrences of each categorical variable. Will total to n
3 - relative frequency (RF) - the proportion of occurrences of each categorical variable. (F/n). The sum of all relative frequencies when written as proportions must be equal to 1.
4 - percentages (P) - proportions multiplied by 100. The sum of this column must equal 100.

Question 56

Q

What does the relative frequency column of a frequency table sum to?

Question 57

Q

Why are frequency tables useful?

Answer

A

They help us to summarise large amounts of data and display this information clearly. We can see the most/least common variables and can calculate proportions.

Question 58

Q

What are contingency tables used for?

Answer

A

A contingency table summarises data for two categorical variables (table of counts by category). Each value in the table represents the number of times a particular combination of variable outcomes occurred.

Question 59

Q

What is the relationship between a frequency table and contingency table?

Answer

A

Both tables are used to summarise information on categorical variables. A frequency table is used to summarise information on a single categorical variables whereas contingency tables summarise the data for two categorical variable.

Question 60

Q

What kind of tool can be used to answer questions like “what proportion of spam emails contains text without numbers?”

Answer

A

Two categorical variables - contingency table

Question 61

Q

What are bar charts used to visualise?

Answer

A

Categorical variables. This can be represented as frequency or proportion.

Question 62

Q

How are categorical variables visualised?

Answer

A

Bar charts - this can be by frequency or proportion.

Question 63

Q

What are the different axis of a bar chart?

Answer

A

The x-axis represents the different symbols (categories) of a categorical variable. The y-axis represents the frequency or proportion of the occurrence of each category.

Question 64

Q

What is a mosaic plot?

Answer

A

A graphical representation of the information in a contingency table. It is similar to a bar plot.

Answer 61

A

A mosaic plot can be used to visualise one or two categorical variables from a contingency table.

Answer 62

A

Mosaic plots use box areas to represent the number of observations that that box represents.

Answer 63

A

A mosaic plot

Answer 64

A

One category (x) is used to create an initial one variable mosaic plot where the area represents the number of observations for that category. The second category (y) is represented by splitting each bar proportionally according to the fractions of y.

Answer 65

A

Numerical variables

Answer 66

A

A plot that provides a case-by-case view of data for two numerical variables.

Answer 67

A

Scatterplots are helpful in quickly spotting associations between two numerical variables.

Answer 68

A

A visualisation technique used for explaining important features of the distribution of the target numerical variable. It provides insight into centrality, spread, skewness and possible outliers.

Answer 69

A

Centrality (mean), spread (quartiles), skewness and possible outliers.

Answer 70

A

No, the whiskers may not capture the maximum and minimum values. The whiskers are determined differently dependent on the software package used. Eg 1.5 the IQR

Answer 71

A

Identifying outliers.

Answer 72

A

Right-skewed

Answer 73

A

Left-skewed

Answer 74

A

Suspected outliers are the observations beyond the maximum reach of the whiskers.

Answer 75

A

An outlier is an observation that appears extreme relative to the rest of the data

Answer 76

A

To identify a strong skew in the distribution
To identify data collection or entry errors
To get an insight into interesting properties of the data

Answer 77

A

Side-by-side box plots is a traditional tool for comparing numerical observations across categories. It is particularly useful for comparing centrality and spread of numerical observations between categories.

Answer 78

A

Side-by-side box plots

Answer 79

A

Comparison of centrality and spread of numerical observations between categories.

Answer 80

A

Describe what you see
Relate this to the question (ie what does this mean in real life)
Support with figures from the graph

Answer 81

A

Histograms are plots that are used for describing the shape of the data distribution of the target numerical variable. They also provide a view of the data density of the target numerical variable (higher bars represent where data is more common).

Answer 82

A

Numerical

Answer 83

A

Histogram

Answer 84

A

Where the data are relatively more common.

Answer 85

A

Histogram - where higher bars represent where the data are relatively more common.

Answer 86

A

They use bars to represent frequencies / they both measure frequencies.

Answer 87

A

Histograms re used for displaying distributions of numerical variables while bar charts are used for categorical variables.
Both measure frequencies, but in histograms, observations first need to be “binned”

Answer 88

A

A defined interval (used to group individual numerical values). The number of observations that fall within each interval are counted and this frequency is used to determine the height of the bar for that interval.

Answer 89

A

The chosen bin width can alter the story that the histogram is telling. Increasing the bin widths may decrease the number of modes available.

Answer 90

A

1 - define the bins and bin sizes (software may determine this)
2 - once defined, count how many observations fall into each interval
3 - plot

Answer 91

A

The mode is represented by a prominent peak in the distribution.

Answer 92

A

Histograms can show how many and what the modes of a distribution are.
- Unimodal / bimodal / multimodal

Answer 93

A

When data trails off to the right ie observations are clustered on the left of the axis and there is a long tail to the right.

Answer 94

A

When data trails off to the left ie observations are clustered on the right of the axis and there is a long tail to the left.

Answer 95

A

Right-skewed

Answer 96

A

Left-skewed

Answer 97

A

Symmetric

Answer 98

A

A dataset that shows roughly equal trailing off in both directions.

Answer 99

A

A lot of statistical inference relies on data being normally distributed.

Answer 100

A

Mean and standard deviation

Answer 101

A

Symmetric

Answer 102

A

Median and IQR - they are robust to outliers.

Answer 103

A

mean ~ median ~ mode

Answer 104

A

mode < median < mean

Answer 105

A

mean < median < mode

Answer 106

A

The mean is pulled in the direction of the tail, towards the extremes. The mode is pulled in the opposite direction (where the data is clustered)

Answer 107

A

y = sqrt(x)
y = ln(x)
y = -1/x
In increasing order of skewness severity

Answer 108

A

y = x^2
y = x^3
In increasing order of skewness severity

Answer 109

A

Depending on bin size, the story the graph tells can change. If the bin size is too wide, it may mislead you into thinking that the data is normally distributed.

Answer 110

A

Absolute frequency or relative frequency (F/n)

Answer 111

A

They have the same shape. The difference is the Y-axis and the fact that the areas of the bars of the relative frequency histogram add up to one.

Answer 112

A

The absolute frequency divided by the Toal number of observations

Answer 113

A

Use the relative frequency histogram when we want to investigate whether the proportion is less than or greater than a certain value. Ie we want to look at proportion rather than frequency.

Answer 114

A

Can’t determine an exact answer with these bin widths, we can only estimate. To answer accurately we need to have a narrower histogram (one with smaller bins)

Answer 115

A

The histogram forms a more smooth curve, approaching the density curve.

Answer 116

A

A density curve is a smoothed version of the relative frequency histogram. It is used for the visualisation of continuous variables or very large populations. It also represents a probability density function. The area under the curve is equal to 1.

Answer 117

A

A continuous variable.

Answer 118

A

The area corresponds to measuring probabilities. The total area is equal to 1. Similar to the bars in a relative frequency diagram.

Answer 119

A

The probability that x is equal to some value from the continuous distribution is ALWAYS equal to 0. This happens because a single point on the density curve diagram has a width of 0 and therefore we can’t obtain the area underneath the curve at a single point.

Answer 120

A

The normal curve or normal distribution.

Answer 121

A

It is unimodal and symmetric around its mean bell-shaped curve
Mean, mode and median are equal
It is determined by two parameters (mu and sigma), usually denoted as N(mu, sigma)
The area under the normal curve is 1

Answer 122

A

Mu and sigma - N(mu, sigma)

Answer 123

A

A normal distribution where mu = 0 and sigma = 1, represented as N(0,1)

Answer 124

A

The standard normal distribution

Answer 125

A

Transform our dataset onto the standard normal distribution. This enables us to refer to the standardised tables.

Answer 126

A

Mu (mean) - the centre of the curve, changing mu shifts the curve left / right
Sigma (standard deviation) - the width of the curve. Changing sigma stretches or constricts the curve

Answer 127

A

68-95-99.7 Rule
- 68% of observations lie within 1 SD away from the mean in the normal distribution
- 95% of observations lie within 2 SDs
- 99.7% of observations lie within 3 SDs

Answer 128

A

68%, 95%, 99.7%

Answer 129

A

We should convert available observations into the standard deviation units and measure their distances from the mean.
To perform this type of conversion we use the standardisation technique called Z-score.

Answer 130

A

The Z-score of an observation is the number of standard deviations it falls above or below the mean. It is used to analyse normally distributed data.

Answer 131

A

For an observation x that follows the normal distribution N(u,o)
Z = (x-u) / o
By calculating a Z-score we “convert” the data value for its normal distribution N(u,o) to a value from the normal standard distribution N(0,1) in such a way that it maintains all the properties of the original dataset.

Answer 132

A

The observation is one standard deviation away from the mean? (above)

Answer 133

A

z = -1.5.

Answer 134

A

You can use Z-scores to roughly identify which observations are more unusual than others. If the absolute value of the Z-score is larger, it is more unusual - |z1| > |z2| means z1 is more unusual.

Answer 135

A

The more unusual observation will have a larger Z score, ie it will be more standard deviations away from the mean.

Answer 136

A

Magnitude - the number of standard deviations away from the mean the observation is.
Value - whether this number of standard deviations away is above or below the mean.

Answer 137

A

Z ~ N(0,1)
It follows that it is normally distributed once transformed.

Answer 138

A

We transform it to the standard normal distribution (Z scores) and use the N(0,1) percentiles, which are listed in a normal probability table to determine the percentile based on the Z score.

Answer 139

A

1 – draw and label a picture of the normal distribution (doesn’t need to be exact)
2 – shade in the region of interest
3 – calculate the Z-score of the cutoff value
4 – look up the percentile for the Z-score in the normal probability table
5 – do you need to subtract from 1?
Always verify that the final answer makes sense with the picture you drew.

Answer 140

A

Z-score is a statistical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean.

Answer 141

A

Nominal categorical variable - have no implied order
Ordinal categorical variable - have a natural ordering

Answer 142

A

Statistical tests
Visualisation techniques
Ideal to use both

Answer 143

A

Shapiro-Wilk test
Kolmogorov – Smirnov test
Anderson – Darling test, etc.

Answer 144

A

Statistical tests are very sensitive to the presence of outliers. If a certain number of outliers are present in a normally distributed data set, statistical tests may report that the data set is not drawn from a normal distribution. Visualisation techniques may help overcome this problem.

Answer 145

A

Histograms with the best fitting normal curve overlaid on the plot
The normal probability plot (quantile-quantile plot or QQ plot)

Answer 146

A

The normal probability plot.

Answer 147

A

This is used to visualise normality assessment. The sample mean and SD are used as the parameters for the best fitting normal curve. The closer the curve is to the histogram, the more reasonable the normal model assumption is.

Answer 148

A

This is used to visualise normality assessment. Data are plotted on the y-axis of the plot and theoretical quantiles (following normal distribution) are plotted on the x-axis. The closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.

Answer 149

A

A smaller sample size will show more variability around the curve. A larger sample size increases the confidence.

Answer 150

A

A curve closer to the histogram means it is more reasonable to assume the data is normally distributed.

Answer 151

A

Points bend up and to the left of the line.

Answer 152

A

Points bend down and to the right of the line.

Answer 153

A

Perform further analysis, eg different visualisations or investigating if and why there are outliers.

Answer 154

A

Short tails (narrower than the normal distribution) - points follow an S-shaped curve.

Answer 155

A

Long tails (wider than the normal distribution) - points start below the line, bend to follow it, and end above it.

Answer 156

A

To draw conclusions about and assess population parameters for a specific population based on a sample of data taken from that population.

Answer 157

A

Sample statistics (mean, proportions etc) are used was point estimates for the unknown population parameters of interest, as it is difficult (or impossible) to collect data from the complete population.

Answer 158

A

In statistics, a point estimate is a single value that is calculated from sample data to estimate an unknown population parameter. It is a “best guess” or “best estimate” of the population parameter.

They generally vary from one sample to another and this sampling variation suggests our estimates may be close, but not exactly the true population parameter.

Answer 159

A

This sampling variation suggests that the estimate is not exactly equal to the true population parameter.

Answer 160

A

The distribution of point estimates based on samples of a fixed size from a certain population.

Answer 161

A

The central “balance” point of a sampling distribution is its mean.

The standard deviation of a sampling distribution is referred to as a standard error.

Answer 162

A

The standard deviation of a sampling distribution. Reflects the fact that probabilities are no longer tied to raw measurements/observations, but rather to a quantity calculated from a sample of such observations.

The standard error of an estimate describes how far the point estimate is from the true population parameter eg how far the typical estimate is away from the actual
population mean.

Answer 163

A

The standard deviation measures the variability of individual data points inside the sample

The standard error measures how far the point estimate is from the population parameter.

Answer 164

A

If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is approximated well by the normal distribution.

The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.1

Answer 165

A

[See flashcard]

Answer 166

A

Independence - sample observations must be independent.

Sample size/skew - either the population distribution is normal, or if the population distribution is skewed, the sample size is large.

Answer 167

A

Random sampling / assignment is used
If sampling without replacement, n is less than 10% of the population

Answer 168

A

The more skewed the population distribution, the larger sample size we need to apply for the CLT.

For moderately skewed distributions, n > 30 is a widely used rule of thumb.

Answer 169

A

We can check it using the sample data and assume that the sample mirrors the population.

Answer 170

A

Independence - sampled observations must be independent.
Sample size / skew - at least 10 success and 10 failure observations. eg for the marathon example, at least 10 who ran < 2 hours and 10 who ran > 2 hours

Answer 171

A

It is very likely that we will not capture the exact population parameter. Instead, if we report a range of the plausible values, we have a good chance to capture a true population parameter.

A plausible range of values for the population parameter is called a confidence interval.

Answer 172

A

A plausible range of values for the population parameter.

They may be constructed in different ways, depending on the type of statistic and therefore shape of the corresponding sample distribution.

Answer 173

A

[See flashcard]

Answer 174

A

Z* is the critical value and can have a different value depending on the confidence level.

Answer 175

A

The margin of error. For a given sample the margin of error changes as the confidence level changes.

Answer 176

A

Adjust Z* in the formula

Answer 177

A

95% confidence interval, Z* = 1.96
99% confidence interval, Z* = 2.58

Answer 178

A

Use the normal Z-table.
eg how do we be 96% confident?

Answer 179

A

The confidence interval needs to increase ie become wider. This will increase our confidence level.

Too wide an interval may not be very informative.

Answer 180

A

It may not be very informative eg weather example.

Answer 181

A

We are XY% (eg 95%) confident that the true population parameter is between the lower bound (l) and upper bound (u) of our confidence interval.

Answer 182

A

Confidence intervals try to capture the population parameter - they say nothing about the confidence of capturing individual observations, a proportion of observations or about capturing point estimates.

Answer 183

A

1 - formulation of the practical problem in terms of statistical hypotheses
2 - construction of a test statistic
3 - description of a critical region and/or the calculation of the p-value
4 - significance level or size of the test
5 - further assessment

Answer 184

A

The null hypothesis H0 represents what we currently hold as true. H0 is basically a standard with which the evidence for HA can be compared.

One-sample: there is no difference from our previous knowledge (maintenance of status quo)

Two-sample: there is no difference between the populations being compared.

Answer 185

A

HA represents what we want to test.

It expresses the range of situations that we wish the test to be able to diagnose. Depending upon the outcome of the test we may take action.

Answer 186

A

Language - is there enough evidence to reject the null hypothesis (we never accept it).

“H0 is rejected in favour of HA”
“There is insufficient evidence to reject H0 in favour of HA”

Answer 187

A

The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your observed data match the distribution expected under the null hypothesis of that statistical test.

The test statistic is used to calculate the p value of your results, helping to decide whether to reject your null hypothesis.

It is a function of the data plus the information in the hypothesis H0.

Answer 188

A

1 - its probability distribution must be calculable (at least approximately) under the assumption that H0 is true
2 - it should behave differently when H0 is true from when HA is true

Answer 189

A

A region of values of the test statistic t which support our preference for HA rather than H0

Answer 190

A

We reject H0 in favour of HA

Otherwise, we are unable to reject H0 in favour of HA

Answer 191

A

We are unable to reject H0 in favour of HA.

Answer 192

A

So that the lack of information, particularly too little data, tends to result in non-critical values of the test statistic.

Hence, it is unwise to talk positively about “accepting H0”.

Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.

Answer 193

A

Lack of information, particularly too little data, tends to result in non-critical values of the test statistic. Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.

Answer 194

A

A p-value, or probability value, is a number describing the likelihood of obtaining the observed data under the null hypothesis of a statistical test.

The p-value quantifies the strength of the evidence against the null hypothesis H0 and in favour of the alternative hypothesis HA.

Answer 195

A

H0 is true and an improbable event has occurred
HA is true

Answer 196

A

If the p-value is small, H0 is rejected in favour of HA
If the p-value is not “small”, the evidence does not support the reject of H0 in favour of HA.

Answer 197

A

Calculate the p-value
Investigate t-statistic and the critical region

Answer 198

A

Type 1 - false positive. H0 is rejected when in fact it is true.

Type 2 - false negative. H0 is not rejected when it is true.

Answer 199

A

Would choose a smaller significance level - we would rather have 1 in 100 errors than 5 in 100 errors.

Answer 200

A

The significance level of an event (such as a statistical test) is the probability that the event could have occurred by chance.

It is the probability of rejecting H0 when in fact it is true, ie committing a Type 1 error.

Answer 201

A

Depends on the particular problem and how serious it is a true H0 is rejected (false positive) eg medical trials

Answer 202

A

We will allow 5 incorrect rejections of H0 from every 100 we make. There is a 5% chance that the result is due to chance.

Answer 203

A

P <= 5% (p <= 0.05) – the test is significant at 5% level and H0 is rejected in favour of HA
P > 5% (p > 0.05) – the test is not significant at the 5% level and H0 is not rejected in favour of HA

Answer 204

A

P > 10% - there is no (or very little) evidence for rejecting H0 in favour of HA
5% < P <= 10% - on the available evidence, we cannot reject H0 is in favour of HA but we have some suspicion (ie we would like to obtain more evidence)
Eg you didn’t reject the null due to a small dataset
1% < p <= 5% - significant at 5% level and H0 is rejected in favour of HA. If the decision to change is important, we should probably seek further evidence
0.1% < p <= 1% - highly significant at the 5% level. There is considerable evidence for rejection of H0 in favour of HA
P <= 0.1% - very highly significant at the 5% level. We are very confident that HA is to be preferred to H0

Answer 205

A

[See flashcard]

Answer 206

A

[See flashcard]

Answer 207

A

[See flashcard]

Answer 208

A

[See flashcard]

Answer 209

A

The t-distribution, also known as the Student’s t-distribution, is a statistical function that creates a probability distribution. The t-distribution is similar to the normal distribution, with its bell shape, but it has heavier tails. It is used for estimating population parameters for small sample sizes or unknown variances. T-distributions have a greater chance for extreme values than normal distributions, and as a result have fatter tails.

Answer 210

A

They are both bell-shaped curves centred at 0. The t-distribution has fatter tails, meaning observations are more likely to fall further away from the mean (over 2 SDs from the mean).

The thicker tails are helpful for resolving our problem with a less reliable estimate of the standard error (since n is small).

Answer 211

A

When the population SD is unknown and we have a small data sample (n<30) we address the uncertainty of the standard error using the t distribution.

Answer 212

A

It is centred at zero and influenced by one parameter, the degrees of freedom (df). The larger the degrees of freedom, the more closely the t-distribution resembles the standard normal model. When df >= 30, it is nearly indistinguishable from the normal distribution.

Answer 213

A

Degrees of freedom are the maximum number of logically independent values, which may vary in a data sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.

Answer 214

A

n < 30 - for n >= 30, the t-distribution and the normal distribution are nearly indistinguishable

Answer 215

A

A t table is a reference statistical table that contains critical values of the t distribution, also known as the t score or t value.

Each row represent a t-distribution with different degrees of freedom. The columns correspond to tail probabilities.

Answer 216

A

[See flashcard]

Answer 217

A

The Paired Samples t Test compares the means of two measurements taken from the same individual, object, or related units.

Each subject has two observations.

Answer 218

A

[See flashcard]

Answer 219

A

[See flashcard]

Answer 220

A

Use the pooled variance in the calculations

Answer 221

A

[See flashcard]

Answer 222

A

Goodness-of-fit test for classified data - The distribution of a categorical variable in a sample often needs to be compared with the distribution of a categorical variable in another sample.

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic.

Answer 223

A

In chi squared tests we don’t assume normal distribution.

Answer 224

A

Each observation is classified into k mutually exclusive and exhaustive classes ie each observation belongs to one and only one class.

Answer 225

A

The critical region lies in the right hand tail only.

This is because, if H0 is not true, we would expect the Eis to be quite different from the Ois, resulting in a larger than expected phi squared value.

Small phi squared results when Eis and Ois are in good agreement - we wouldn’t want to reject H0 in this case.

Answer 226

A

The exact distribution of phi squared is discrete and is approximated by the continuous chi squared distribution.

o For this approximation to be reasonable, Ei should be > 5 for each class
o If not, combine adjacent classes with resultant loss of one or more degrees of freedom

Answer 227

A

Ei should be > 5 for each class.

If not, combine adjacent classes with the resultant loss of one or more degrees of freedom.

Answer 228

A

Combine adjacent classes with the resultant loss of one or more degrees of freedom.

Answer 229

A

[See flashcard]

Answer 230

A

The Yates’ Continuity Correction - add magnitude and -1/2
[See flashcard]

Answer 231

A

The chi squared distribution has just one parameter called the degrees of freedom (df) which influence the shape, centre and spread of the distribution.

Answer 232

A

Higher degrees of freedom – the distribution shifts to the right and becomes flatter

Answer 233

A

One important difference from the t-table is that the chi-square table only provides upper tail values

Answer 234

A

ANOVA, or Analysis of Variance, is a test used to determine differences between results from three or more unrelated samples or groups.

ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable.

Answer 235

A

2 groups: Z or a T statistic
3 groups: test Analysis of Variance (ANOVA) and a new statistic called F

Answer 236

A

1 - The observations should be independent within and between groups.
If the data are a simple random from less than 10% of the population, the condition is satisfied. Eg no pairing
2 - The observations within each group should be nearly normal (important when sample sizes are small)
3 - The variability across the groups should be about equal (especially important when the sample sizes differ between groups).

Answer 237

A

F statistic

Answer 238

A

Compare to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability.

Answer 239

A

They are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic.

Answer 240

A

An overall grand mean

Answer 241

A

F = variability between sample groups / variability within sample groups

Answer 242

A

A large F statistic is needed for the p-value to be small to reject the H0.

A large F statistic means the variability between sample groups is greater than the variability within sample groups.

Answer 243

A

Group - k - 1
Total - n - 1
Error - dft - dfg

ie the difference between the total and the grouped degrees of freedom

Answer 244

A

SSG - sum of squares between groups, measures the variability between the groups [see flashcard]

SST - sum squares total, measures the total variability in the dataset [see flashcard]

SSE - sum squares error, measures variability within groups SSE = SST - SSG

Answer 245

A

From F-tables, find the F* value as the value from the column dfg and the row dfe. If F > F*, it is in the critical region therefore it is significant and at least one mean is different (different for at least one group).

The P value can be computed. A large F value correlates to a smaller P value, therefore if F > F* P < 0.05 (alpha).

Answer 246

A

The mean square error.

Calculated for the group and error row as Sum of squares / degrees of freedom

Answer 247

A

Use common variance (MSE from the ANOVA table) instead of each group’s variances in the calculation of the SE.

Use common degrees of freedom (dfE from the ANOVA table).

Use a modified significance level, this resolves the issue of increasing the type I error rate if we run too many tests (false positives).

Answer 248

A

Multiple comparisons

Answer 249

A

The Bonferroni correction, which is a more stringent significance level.

alpha* = alpha / K

K - number of comparisons being considered

K = k(k-1) / 2

Answer 250

A

alpha* = alpha / K

K - number of comparisons being considered

K = k(k-1) / 2

Answer 251

A

[see flashcard]

Answer 252

A

Linear regression is a statistical technique that can be used for prediction and evaluating whether there is a linear relationship between two numerical variables x and y.

Linear regression assumes that the relationship between two variables can be modelled by a straight line

Answer 253

A

y = B0 + B1x

x - predictor variable (explanatory variable, independent variable)
y - response variable (dependent variable)
B0 - intercept (expected value of the response variable when the predictor is 0)
B1 - slope parameter (the change in the mean response for each one-unit increase in the predictor)

Answer 254

A

The predictor x has no effect on the value of the response y

Answer 255

A

Using data - these are point estimates b0 and b1

Answer 256

A

y_hat indicates it is a collection of estimated (predicted) observations of observed variable y, based on the input collection of predictor observations x

Answer 257

A

y_hat = b0 + b1x

Answer 258

A

Residuals (epsilon)

n is the same, the same number of points

Answer 259

A

The differences between the observed and estimated values.

Answer 260

A

The difference of the observed response (yi) and the response we would predict based on the model fit (y_hati)

Ei = yi - y_hati

Answer 261

A

The residuals are pretty small.

The best fitting regression line (line that has the smallest possible residuals). A poor fitting regression line has large residuals.

Answer 262

A

Ordinary least squares regression (OLS)

Answer 263

A

OLS - ordinary least squares regression (OLS)

Goal is to find the line that minimises the least square criterion ie minimises the sum of the squared residuals [see flashcard]

The line that minimises this least squares criterion is usually called the least squares line

Answer 264

A

The least squares line

Answer 265

A

[see flashcard]

Answer 266

A

The data should show a linear trend. If there is a nonlinear trend, an advanced regression method should be applied.

Answer 267

A

Linearity
Nearly normal residuals
Constant variability

Answer 268

A

We can use input values of x to get predicted values y_bar

With a fitted simple linear model, you’re able to calculate a point estimate y_hati of the mean response value yi

Answer 269

A

Generally, the residuals must be nearly normal.
When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points
Residuals are normally distributed if they are scattered around 0 with uniform variance.

Answer 270

A

The variability of the points around the least squares line remains roughly constant

Answer 271

A

We want to determine how good our model is.

One approach is using the coefficient of determination R^2.

R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line.

Formula [ see flashcard ]

If we can calculate how much variance is due to the residual variable, we can calculate how much is due to the outcome variable

Answer 272

A

We want to determine how good our model is.

One approach is using the coefficient of determination R^2.

R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line.

Answer 273

A

Descriptive analysis - this helps to understand how the data is distributed and provides important information for further steps.

Answer 274

A

By position or by name

Answer 275

A

Turning raw data into understanding, insight and knowledge

Answer 276

A

A quantity, quality or property that you can measure.

(values may vary from measurement to measurement)

Answer 277

A

Table column
Field
Attribute
Property
Feature
Vector
Dimension

Answer 278

A

Numeric
Categorical

Answer 279

A

Variables whose values are recorded as numbers (integer or real values)

Answer 280

A

Variables whose values are recorded as symbols.

Eg - gender
Eg - countries

Answer 281

A

Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count.

Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric

Answer 282

A

Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count.

Answer 283

A

Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric

Answer 284

A

Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories).

Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)

Answer 285

A

Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories).

Answer 286

A

Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)

Answer 287

A

How we store collections of variables

Answer 288

A

Univariate dataset – dataset consisted of measurements that correspond to the single variable

Multivariate dataset – dataset consisted of measurements that correspond to two or more variables. Most relevant when individual components aren’t as useful when considered on their own. eg spatial coordinates. Allows us to think about two or more variables

Corresponding data analysis

Univariate data analysis – the analysis performed on a single variable
Multivariate data analysis – the simultaneous analysis of two or more variables

Answer 289

A

Measurements made under similar conditions

Answer 290

A

A set of values, each associated with a variable and an observation.

Variables are table columns.
Observations are table rows.

Answer 291

A

Tabular data - a set of values, each associated with a variable and an observation.

Tabular data is tidy if each value is placed in its own “cell” - each variable in its own column, each observation in its own row.

Answer 292

A

Defined by the number of observations (rows) in the table

Answer 293

A

Defined by the number of variables (columns) in the table

Answer 294

A

Size - observations (row)

Dimensionality - variables (columns)

Answer 295

A

The (usually) large pool of observational units that we are interested in.

Answer 296

A

A smaller collection of observational units selected from the population.

Answer 297

A

Sampling refers to the process of selecting observations from a population.

Simple random sampling
Stratified sampling
Cluster sampling
Multistage sampling

Answer 298

A

Simple random sampling
Stratified sampling
Cluster sampling
Multistage sampling

Answer 299

A

It doesn’t make sense to collect data for the whole population - it is probably impossible to collect and calculate the actual population mean so we need a sample.

Answer 300

A

A sample is said to be a representative sample if the characteristics of the observational units selected are a good approximation of the characteristics form the original population.

Meal analogy.

Answer 301

A

Bias corresponds to a favouring of one group in a population over another group

Answer 302

A

Generalisability refers to the largest group in which it makes sense to make inferences about from the sample collected.

This is directly related to how the sample was selected.

Answer 303

A

Parameters and statistics are calculations based on the population and sample respectively.

Population - parameter - Greek letters
Sample - statistic - Arabic
The differences are denoted in the notation used

Answer 304

A

A calculation based on one or more variables measured in the population.

Denoted by greek letters.

Answer 305

A

A calculation based on one or more variables measured in the sample.

Denoted by lower case arabic letters (sometimes in combination with other symbols)

Answer 306

A

A sampling strategy where the individuals are selected from the list of units in the population, by means of some random process, in such a way that each individual has equal chance to be selected.

Eg random number tables or pseudo-random number generators.

Selection can be performed sequentially (one at a time without replacement, so that at each stage, remaining individuals in the population have the same probability of being selected).

Answer 307

A

In simple random sampling, selection can be performed sequentially. Individuals can be selected from the population one at a time without replacement, so that each stage, remaining individuals in the population have the same probability of being selected.

Answer 308

A

There is usually an assumption that all observations are independent of each other - replacing them would lose this.

Answer 309

A

Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The sample of individuals is then drawn from each stratum using some other random sampling process, usually simple random sampling.

Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible.

This sampling strategy is used in cases when it is known that the population is heterogeneous with respect to one or more variables which may have a bearing on the factor being studied.

Eg if there was a difference in height by gender, you know to take it into consideration.

This ensures things are well represented.

Answer 310

A

Stratified sampling

Answer 311

A

Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible.

Answer 312

A

1 - to increase the accuracy and precision of the overall population estimates.

2 - to ensure that domains of study are adequately represented.

Answer 313

A

A sampling strategy where the population is divided into many groups, called clusters, and then we sample a fixed number of clusters and include all observations from each of those clusters in the sample.

[Strata are separated based on convenience, not a measure of interest ie the measure of interest is not why you’re in that cluster]

Eg divide the class into tables and pick a sample of two tables.

Answer 314

A

A sampling strategy where the population is divided into many groups, called clusters, and then we collect a random sample within each cluster.

Similar to cluster sampling (but rather than keeping all observations in each cluster, we collect a random sample within each selected cluster)

Answer 315

A

Sometimes it can be more economical than the alternative sampling techniques.

They are most helpful when there is a lot of case-to-case variability within the cluster, but the clusters themselves don’t look very different from one another

eg neighbourhoods as clusters

Answer 316

A

More advanced analysis techniques are typically required.

Answer 317

A

The situation, time and money.

Simple random sampling may be the best to get representation but it can be expensive.

Multistage sampling can reduce the costs without reducing reliability.

Answer 318

A

Collect data, process it and clean it.

EXDA and use of machine learning, algorithms and statistical models

Communicate, visualisations and report findings. [Which leads to making decisions]

Build data product.

Data is a cyclical process - once you build the data product, more data becomes viable.

Answer 319

A

A creative process of exploring data sets for patterns and relationships.

Starting with lots of visualisations and summaries is a good idea.

Answer 320

A

1 - Develop an understanding about data by formulating questions
2 - Search for answers using visualisation techniques and summary statistics
3 - use answers obtained to refine questions and/or generate new questions

Answer 321

A

Using visualisations and summary techniques

Visualise distributions of all variables (using box plots and histograms)
Visualise time series of data
Investigate all pairwise relationships between variables using scatterplots
Perform data cleaning and variable transformation
Perform summary statistics (mean, median, lower and upper quartiles, minimum and maximum values, identify missing data, errors and outliers)

Answer 322

A

Start simple, it is difficult to ask revealing questions at the start of analysis as you do not know what insights are hidden in your dataset.

There are no universal rules of questions to ask to guide research.

Useful starting points
- What type of variation occurs within my variables?
- What is the relationship between variables

Answer 323

A

Statistics used to quantitatively describe a collection of measurements by summarising them in the form of a single variable

Answer 324

A

Summary statistics:
- Measures of centrality (mean, mode, median) ie the most typical values
- Measures of variability (variance, standard deviation, range, quantiles, five number summary) ie the spread of the data

Visualisation techniques:
- Histograms
- Boxplots

Answer 325

A

Summary statistics and visualisation techniques

Numeric:
- Measures of centrality
- Measures of variability
- Histograms and box plots

Categorical:

Answer 326

A

Summary statistics
- Counts
- Percentages
- Proportions

Visualisation techniques
- Bar charts

Answer 327

A

Summary statistics
- Covariance and correlation (N-N)
- Contingency tables (C-C)

Visualisation techniques
- Scatterplots (N-N)
- Paired boxplots (N-C)
- Paired histograms (N-C)
- Mosaic plots (C-C)