Statistics in R: Class test 1 Flashcards

1
Q

Describe the two different types of categorical data

A

Nominal data - Data items placed into discrete classes. There is no inherent order between categories (Gender, Blood Group)
Ordinal data - numbers (or letters) acting as labels only, data are
placed in an order based on their value. Various categories can be logically arranged in a meaningful order (Ranks, Grades)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the difference between interval and ratio data

A

For interval data doubling is not meaningful and there is no true zero (Temperature). For ratio data these two factors are meaningful, such as distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the difference between continuous and discrete data

A

Continuous data has an infinite number of values such as weight where a value of 1.5kg is meaningful.
Discrete data is has a set number of values, for example 1.5 fingers is not meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What indicates that a dataset is normally distributed?

A

68% of data is + or - 1 standard deviation away from the centre mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define the mean, median and mode

A

Mean - average of all values
Mode - the value that occurs most frequently
Median - the middle value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the null hypothesis and the alternative hypothesis?

A

Null hypothesis - there is no significant difference, correlation or relationship
Alternative hypothesis - there is a significant relationship correlation or difference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Type I and Type II error?

A

A Type I error is a false rejection of the null hypothesis
A Type II error is the non-rejection of a false null hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the p-value?

A

The p-value is a measure of whether the results of your test could have occurred by chance. If the p-value is <=0.05 then the results are significant and the null hypothesis can be rejected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does Pearson’s correlation coefficient(r) tell us?

A

The strength and direction of a linear, normally distributed, numerical relationship between two variables between -1 and +1 with values close to +1 and -1 indicating correlation and close to 0 indicating no relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is linear regression used?

A

To determine the line of best fit in a dataset that is linear, normally distributed and numerical and make predictions based of this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are residuals?

A

Residuals are the difference between observed values and the predicted values from the line of best fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the Coefficient of Determination(r^2) tell us and how is it found?

A

The Coefficient of Determination tells us how much of the variance in the response variable is caused by the explanatory variable and is found by squaring Pearson’s(r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the Adjusted r^2 value tell us?

A

Adjusted r^2 tells us the amount of variance explained by the model but adjusts for the number of terms in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula for multiple linear regressions analysis?

A

y = m1ev1 + m2ev2 + … + c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How would you visualise the relationships between the RV and EVs in a multiple linear regression analysis?

A

Using a pairs plot, you are able to visualise all single linear regression models between the variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps to determining the minimum sufficient model?

A
  1. Create your model that includes all explanatory variables
  2. Check all explanatory variables are significant (p>0.05)
  3. Remove the least significant term from the model
  4. Run the model again until all explanatory variables are significant
17
Q

What are the key assumptions when undertaking linear regression?

A
  1. Constant variance of residuals
  2. Normal distribution of residuals
18
Q

What indicates on the residuals vs fitted y-values that residuals have constant variance?

A

The “sky at night” pattern with no obvious structure indicates constant variance

19
Q

What indicates on a QQ plot that residuals are normally distributed?

A

All points are lying on a 45 degree slope or close to it

20
Q

When is it appropriate to use Pearson’s corelation coefficient and linear regression?

A

When all values are numerical, normally distributed and linear

21
Q

What is a contingency table and what does it show?

A

A contingency table is a table in a matrix format that display the frequency distribution of variables

22
Q

When can the Kendall and Spearman corelation coefficients be used?

A

When data is interval or ordinal, non-normally distributed and relationship is non-linear