Statistics in R: Class test 1 Flashcards

Question 1

Q

Describe the two different types of categorical data

Answer

A

Nominal data - Data items placed into discrete classes. There is no inherent order between categories (Gender, Blood Group)
Ordinal data - numbers (or letters) acting as labels only, data are
placed in an order based on their value. Various categories can be logically arranged in a meaningful order (Ranks, Grades)

Question 2

Q

Describe the difference between interval and ratio data

Answer

A

For interval data doubling is not meaningful and there is no true zero (Temperature). For ratio data these two factors are meaningful, such as distance

Question 3

Q

Describe the difference between continuous and discrete data

Answer

A

Continuous data has an infinite number of values such as weight where a value of 1.5kg is meaningful.
Discrete data is has a set number of values, for example 1.5 fingers is not meaningful

Question 4

Q

What indicates that a dataset is normally distributed?

Answer

A

68% of data is + or - 1 standard deviation away from the centre mean

Question 5

Q

Define the mean, median and mode

Answer

A

Mean - average of all values
Mode - the value that occurs most frequently
Median - the middle value

Question 6

Q

What are the null hypothesis and the alternative hypothesis?

Answer

A

Null hypothesis - there is no significant difference, correlation or relationship
Alternative hypothesis - there is a significant relationship correlation or difference

Question 7

Q

What is a Type I and Type II error?

Answer

A

A Type I error is a false rejection of the null hypothesis
A Type II error is the non-rejection of a false null hypothesis

Question 8

Q

What is the p-value?

Answer

A

The p-value is a measure of whether the results of your test could have occurred by chance. If the p-value is <=0.05 then the results are significant and the null hypothesis can be rejected

Question 9

Q

What does Pearson’s correlation coefficient(r) tell us?

Answer

A

The strength and direction of a linear, normally distributed, numerical relationship between two variables between -1 and +1 with values close to +1 and -1 indicating correlation and close to 0 indicating no relationship

Question 10

Q

Why is linear regression used?

Answer

A

To determine the line of best fit in a dataset that is linear, normally distributed and numerical and make predictions based of this

Question 11

Q

What are residuals?

Answer

A

Residuals are the difference between observed values and the predicted values from the line of best fit

Question 12

Q

What does the Coefficient of Determination(r^2) tell us and how is it found?

Answer

A

The Coefficient of Determination tells us how much of the variance in the response variable is caused by the explanatory variable and is found by squaring Pearson’s(r)

Question 13

Q

What does the Adjusted r^2 value tell us?

Answer

A

Adjusted r^2 tells us the amount of variance explained by the model but adjusts for the number of terms in the model

Question 14

Q

What is the formula for multiple linear regressions analysis?

Answer

A

y = m1ev1 + m2ev2 + … + c

Question 15

Q

How would you visualise the relationships between the RV and EVs in a multiple linear regression analysis?

Answer

A

Using a pairs plot, you are able to visualise all single linear regression models between the variables

Question 16

Q

What are the steps to determining the minimum sufficient model?

Answer

A

Create your model that includes all explanatory variables
Check all explanatory variables are significant (p>0.05)
Remove the least significant term from the model
Run the model again until all explanatory variables are significant

Question 17

Q

What are the key assumptions when undertaking linear regression?

Answer

A

Constant variance of residuals
Normal distribution of residuals

Question 18

Q

What indicates on the residuals vs fitted y-values that residuals have constant variance?

Answer

A

The “sky at night” pattern with no obvious structure indicates constant variance

Question 19

Q

What indicates on a QQ plot that residuals are normally distributed?

Answer

A

All points are lying on a 45 degree slope or close to it

Question 20

Q

When is it appropriate to use Pearson’s corelation coefficient and linear regression?

Answer

A

When all values are numerical, normally distributed and linear

Question 21

Q

What is a contingency table and what does it show?

Answer

A

A contingency table is a table in a matrix format that display the frequency distribution of variables

Question 22

Q

When can the Kendall and Spearman corelation coefficients be used?

Answer

A

When data is interval or ordinal, non-normally distributed and relationship is non-linear