Week 2 SCM (regressions) Flashcards

1
Q

What general formula is basic statistical modelling based on?

A

outcome = model + error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are 2 features that we should aim for with statistical model

A

we should aim for a statistical model that minimises error and can be generalised beyond the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what statistical test should we use when there are two levels of our independent variable?

A
  • a statistical t-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what statistical test should we use when there are more than two continuous levels of our independent variable?

A
  • a linear regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a general description of the general linear model?

A

the general linear model is a model for which the DV is composed of a linear combination of independent variables

each independent variables has a weight given by b

this weight determines the relative contribution each variable makes to the overall prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what can we cant cant we use the correlation coefficient for?

A

we can use the coeffcient to describe the relationship between two variables. We can then test this relationship for significance

we cannot use the coefficient to make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what can/cant we use the GLM for?

A

we can use the GLM for description of relationship, decision of significance(p) and also for prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the differences between correlation and linear regression?

A
  • correlation quantifies the direction between and shape of two numeric variables (x&y). correlation always lies between -1 & 1.
  • simple linear regression relates the two variables x & y to each other through an equation y = a + bx
  • therefore if visualised on a graph, a linear regression would include a straight line. This line can then be used to make predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the equation of a simple linear regression and what represents the slope and the intersect of the line?

A
  • if x and y are the variables a linear regression equation is: y= a + bx
  • b is the slope of the line and a is the intersect
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the difference between the line prediction and the specific data value called?

A

the residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what would the best line from a linear regression analysis show?

A

minimised residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what letter/symbol is used to define the slope and intercept of the line in a linear regression analysis

A

the slope is represented by b1
the intercept is represented by b0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the (more complicated) equation of a linear regression analysis?

A

Yi = (b0 + b1Xi) + E1

Yi is the outcome that we want to predict
Xi is the ith participants score on the predictor variable
b1 is the gradient of the regression line
b0 is the intercept of the regression line
E1 is the residuals, which represents the difference between the score predicted by the line, and the score that the participant actually obtained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how can you determine the type of relationship from the gradient of a line?

A

if the gradient is a positive value there is a positive relationship
if the gradient is a negative value there is a negative relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

how do we asses the fit of a line?

A
  • if the residuals is less then the line is a better fit
  • therefore to assess the fit of a line we look at the values of the residuals (the vertical deviations)
  • because the residuals can either be positive or negative, we must square them in this analysis
  • therefore the line with the smallest sum of squared residuals is the best fitting line
  • when conduction a linear regression, the mathematics will give us the line with the smallest sum of squared residuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a simple definition of regression towards the mean

A

if a variable is extreme the first time you measure it, it will be closer to the average the next time you measure it

this is because if the variable is extreme, it is more likely to be influenced by chance

and therefore the next time you measure it the values are less likely to be extreme by chance and so it will be closer to the mean

17
Q

how can regression to the mean trick us and how can we counteract this?

A

it can make it seem like an intervention is working but actually it is just the effect of regression to the mean

we can avoid being tricked by this by adding a control group

18
Q

what is the difference between pearsons correlation coefficient and the regression coefficient?

A
  • pearsons correlation coefficient is the covariance/the SD of x * y
  • The regression coefficient is the covariance/the SD of x * x
19
Q

how is the slope of a regression related to the correlation coefficient?

A

the slope (b1) = R * SDx/SDy

so the slope is equal to R * the ratio of the standard deviations of x and y

  • this is because the covariance is R * SDx * SDy
  • and the slope/regression coefficient is the covariance/ SDx*SDx

so if you enter the covariance formula in to the formula for the slope, it simplifies to the above equation

  • this means that if the SD of x is the same as the SD of y, the correlation coefficient is equal to the regression coefficient
20
Q

in what situation would the correlation coefficient (R) be equal to the regression coefficient

A

if the SD of x is equal to the SD of y

21
Q

what would the variability of a regression model show and how would we calculate it?

A
  • it would show how much variability in the outcome is not explained by the model
  • we would calculate it by looking at the sum of squared errors
  • each error is also known as the residual, and is the difference between the measured value and the value in the line of the regression slope
  • we then square these to get the sum of squared errors
22
Q

how do you calculate the mean squared error from the sum of squared errors?

A
  • the SSE/df
  • the degrees of freedom = N- 2 for a simple regression
23
Q

how do you calculate the standard error of the model from the mean squared error of a linear regression?

A

the standard error of the model is the square root of the mean squared error

24
Q

what does b0 represent in regression?

A

the intercept of the regression line

25
Q

what does b1 represent in regression?

A

the gradient of the regression line

26
Q

what is the meaning of parameter estimates

A
  • parametre estimates are also known as coefficients
  • so in regression they are b0 and b1
27
Q

what is the formula of the t-statistic of a simple linear model

A

t= b1/SE of the model

28
Q

What is the difference between SSt, SSe and SSm in a linear regression?

A

SSt uses the difference between the observed data and the mean value of y. This shows the total variance in the data.

SSr uses the difference between the observed data and the regression line. This shows the error in the model.

SSm uses the difference between the mean value of y and the regression line. This shows the improvement due to the model.

29
Q

what is the difference between the coefficient of determination (R square) and the coefficient of correlation (R)?

A
  • The coefficient of determination shows the percentage variation in Y which is explained by all the x values together
  • the coefficient of correlation is the degree of the relationship between two variables x and y
  • when there is only one x variable the coefficient of determination is the same as the coefficient of correlation
  • the coefficient of determination is between 0 and 1. It cannot be negative because it is squared. The higher the better.
  • The coefficient of correlation is between -1 and 1. 1 would indicate that the two variables are moving in unison and -1 would indicate that the two variables are perfect opposites. 0 would suggest that they are not correlated at all.
30
Q

How do you express the different sum of squares as a formula?

A

SSt= SSr + SSe

31
Q

what is the formula for R squared in terms of the sum of squares?

A

Rsquared= SSr/SSt

also R squared = 1- SSe/SSt

32
Q

what are the interpreting characteristics of R squared

A
  • its number always falls between 0 and 1 because it is a proportion
  • If R squared is 1, all the data points fit perfectly on the regression line. The predictor x accounts for all the variation in y.
  • If R squared is 0 the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y.
  • So if R squared = 0.2, 20% of the variation in y is explained by variation in x
33
Q

what is the general idea when using the sum of squares strategy to assess the fit of a regression line?

A
  • we calculate the fit of the data against the mean, and compare it with the fit of the data to the regression line
  • the equation we use to calculate the fit of each model is this:

deviation= sum( (observed - model)^2)

  • The SSt represents how good of a model the mean is
  • The SSr represents how good of a model the regression is
  • We then use these to calculate the SSm, which represents how much better the regression model is to the mean model
  • We can then calculate Rsquared to represent the proportion of improvement due to the model
  • Rsquared = SSm/SSt
  • in simple regression the square root of this value is the same as pearsons correlation coefficient
34
Q

Describe F tests and linear regressions

A
  • F tests follow the usual test statistic formula of measuring the amount of systematic variance/the amount of unsystematic variance
  • F is based on the ratio of : improvement due to the model (SSm) and the difference between the model and the observed data (SSr)
  • however instead of using the sum of squares we standardise that to use the Mean sum of squares, by dividing by the degrees of freedom
  • For SSm the degrees of freedom are the number of variables in the model. For SSr they are the number of observations minus the number of parametres being estimated
  • so: F= MSm/MSr
  • so its a measure of how much the model has improved the prediction of the outcome compared to the level of innacuracy in the model
  • if the model is good we would expect the improvement (Top) to be large and innacury (bottom) to be small. Therefore the larger the F value, the better the model.
  • A good model must have an F value of greater than 1.
35
Q

how would you use a t test to assess a regression coefficient?

A
  • the regression coefficient B1 represents the regression lines predictive power
  • if the reggression line was useless at prediction B1 would = 0 and the line would be flat
  • So if a variable significantly predicts an outcome it should have a value of B1 significantly higher than 0
  • We can use a t-test to test this
  • This t-test tests the null hypothesis that the value of B1=0
  • This t-test is based on whether the value of B1 is big compared to the amount of error in the estimate
  • To estimate how much error we would expect to find based of sampling difference alone we use the standard error of the B values of each sample. (If we were to plot a frequency distribution of all the b values of all our samples this would be the standard deviation of this distribution)
  • If the standard error is very small it means there is little variation between b values
  • The T-test therefore tells us whether B is different relative to the level of variation we’d expect to to find from sampling differences
  • T = Bobserved - Bexpected/ standard error of B
  • Because we are testing the null hypothesis that B=0 we can simplify this to:

T= B/SE of B

  • The larger T is the better
36
Q

What is the degrees of freedom for a regression?

A

N - P - 1

N = total sample size

P = number of predictors

37
Q

What is the F statistic in linear regression output?

A

The ratio of the mean square error that is explained by the model and the mean square error that is not explained by the model

MSm/MSr

38
Q

what is R squared in linear regression output?

A

The ratio of the variance that is explained by the model and the total variance

SSm/SSt

39
Q
A