GLM 1 - Simple linear regression Flashcards
In a linear relationship, how is the value of an outcome variable Y approximated?
Y ≈ β0 + β1X.
Y= dependent variable
B0= is an intercept
B1 = slope coefficient of X
What is the intercept/B0 (often labelled the constant)?
Expected mean value of Y when all X=0.
What is the B1?
The slope or how y changes per unit increase in x.
B1 is increase in y when you change x by a unit/when x is increased by one unit y will increase by beta 1
What is the terminology of a linear regression?
- We say that Y is regressed on X .
- We are expressing Y in terms of X .
- The dependent variable, Y , depends on X .
- The independent variable, X , doesn’t depend on anything.
How are the coefficients or parameters B0 and B1 estimated?
Using the available data:
(x1, y1), (x2, y2), . . . , (xn, yn ) - We have here a sample size of n data points.
How are the estimates of parameters written?
The estimates of the parameters are written with a circumflex or hat: ^
We then write our linear equation with these estimated coefficients: y^ = β^0 + β^1 xi
Only a hat over the dependent variable.
Independent variable (xi) does not have a hat as treated as fixed.
B0 and B1 are independent of each other
True or false
True
What does the circumflex allow us to differentiate between?
True value and estimated value
What happens if add value to B0?
This would only affect y but not B1xi – B0 can change independently of B1
What is y^I?
Predictions or predicted values of the outcomes y , given the independent variables, xi ’s
What are the differences between the predicted values, y^ i’s, and the observed values, yi ’s?
The residuals:
e^ := yj − yi^ .
That is, these are the values that remain after we have removed the
predictions from the observations.
Why are the residuals, e^i ’s, also equipped with a hat?
Because these are also estimated values.
Why are the black error bars vertical, and not perpendicular to the line in blue?
Residuals correspond to an addition to value of y hat
How can the optimal value of the parameters, β0 and β1 be found?
By considering the sum of the squares of the residuals:
RSS := e^1 + e^2
Why do we square residuals?
Residuals are defined as a subtraction of the predicted values from observed values; we can rewrite RSS in the following fashion: RSS = (y − y^1)2. Some values may be negative and some may be positive and thus must square them to normalise them and ensure they make a positive contribute to RSS.
What is the optimal purpose for B0 and B1?
To minimise distance from all the data points.
What is RSS a function of and why?
B0 and B1 because all residuals do depend upon values of B0 and B1. Thus, we may write the RSS as depending on these quantities:
RSS(β^0, β^1) = e^12 + e^22
The value taken by the RSS can therefore minimized for some values of β^0 and β^1.
How do we write this?
(β0, β1 ) := argmin RSS(β0 , β1),
Argim RSS- means argument that minimizes RSS
Where the hats on the right hand-side of the RSS have been suppressed.
The RSS is a function of the parameters β0 and β1 therefore…
it can take a range of values across a two dimensional landscape
How can we assess the accuracy/goodness of fit of our model?
Using the previously minimized value of the RSS
What is one way of quantifying accuracy of model?
Compare RSS with total sum of squares (which can be reformulated as the sum of squares of null model as null model is model with only y intercept)
What is R2 also known as?
Coefficient of determination
What does R2 measure?
Proportion of variance in the dependent variable explained by the independent variable.
For simple regression, the R2 can be shown to be equivalent to what?
Correlation of the IV with the DV.That is,where R2 and the square of Cor(Y , X ) are equal.
What is a random variable?
A function, from a sample space Ω to the real numbers, R, such that
X : Ω ›→ R.
Uppercase X is random variable
For every point in the sample space, ω ∈ Ω, the random variable X may (or may not) take what?
A different value, such that we have:
X (ω) = x.
We call x , the realization of X(random variable) at ω(point in sample space)
What does the probability to obtain x , count?
The number of ω producing x , written as
P[X = x ] := P[{ω ∈ Ω : X (ω) = x }].
What is the random variable for the toss of an even coin, with head, H, and tail, T?
X : {H, T } ›→ {0, 1},
with X (H) = 0 and X (T ) = 1, producing the probabilities
In other words, X has H and T as a sample space and X is going to assign to those events 0 for heads and 1 for tails
If we have a single-faced coin, Y : {H, T } ›→ {0, 1}, such that Y (H) = 0, and Y (T ) = 1, what are the probabilities?
P[Y = 0] = 1, and P[Y = 1] = 0
The measure P is used to give probability mass to each element in Ω.
What is the discrete expectation?
For a discrete value y, the expectation E[Y] is the sum of the value of y’s(the realizations- all the possible values taken by y over values in sample space). Finite number of values a y as this is for a discrete random variable. Weight each values by probability of obtaining those values. This is almost identical to arithmetic mean.
What is the Arithmetic mean ?
A special case of expectation in which p of y are uniform across all values of possible values of y(1/n).
In simple regression what are we given?
Two sequences of data points.
Each pair of observations is a case, (yi , xi ), with i = 1, . . . , n.
What is the deterministic and stochastic part of a statistical model?
yi = β0 + β1xi + εi
B0 + B1xi = deterministic
ei = stochastic
What is one difference between regression and correlation?
In regression one of the variables is treated as the outcome variable or dependent variable, generally denoted by the yi ’s
We will then use the other variables for predicting that outcome. As a result, the other variables are referred to as predictors, or independent variables, and are denoted by xi ’s, or features in the machine learning literature.
What is the deterministic part of a univariate simple linear regression made up of?
- Mean expressed as a conditional expectation:
E[Y |X = xi ] = β0 + β1xi ,
- Variance function, expressed as a conditional variance operator:
Var[Y |X = xi ] = σ2, ∀i = 1, . . . , n.
The (unknown) parameters in this model are therefore (β0, β1, σ2).
What are the unknown parameters in the deterministic part of a simple linear regression?
- β0 is the y -intercept of E[Y |X ], when X = 0. Thus, we have
E[Y |X = 0] = β0. - β1 is the rate of change of E[Y |X ], such that
E[Y |X = x + 1] − E[Y |X = x ] = β1. - σ2 is the (conditional) variance of Y , given X . It is strictly positive,
σ2 > 0.
What does the stochastic part of a simple linear regression?
Random noise- In general, the observables or observed data, denoted by yi ’s, differ from the expected values of Y , given xi , such that
yi = E[Y |X = xi ] + εi , i = 1, . . . , n,
where the εi ’s are the statistical errors, collectively referred to as additive noise.
The εi ’s are defined as the difference between the observables and the conditional expected values –that is,
εi = yi − E[Y |X = xi ].
Geometrically, the errors correspond to the vertical distances between each yi and its conditional expectation. Note that the error terms are not observable, since they depend on the unknown parameters (β0, β1).