PRE FI LEC 1: REGRESSION ANALYSIS Flashcards
✔ A form of PREDICTIVE MODELING TECHNIQUE which investigates
the relationship between a DEPENDENT (target) and
INDEPENDENT variable (predictor).
REGRESSION ANALYSIS
Father of regression analysis
Carl F. Gauss (1777-1855)
first person who used the term
regression
Francis Galton (1877)
graphical representation of the relation between two or more variables.
- two variables x and y, each point on the plot is an x-y pair.
A. GRAPH PLOT
B. REGRESSION PLOT
C. SCATTER PLOT
Scatter plot
We use _________ and ___________ to describe the variation in
one or more variables.
A. REGRESSION; CORRELATION
B. CORRELATION; REGRESSION
regression; correlation
The _______ is the SUM of the squared deviations
of a variable.
Variation
The variation is the numerator of the _______ of a
sample
Variance
Both the variation and the variance are ____________________ of a sample.
measures of the dispersion
The ___________between two random variables is a statistical measure of the DEGREE TO WHICH THE 2 VARIABLES MOVE TOGETHER.
- captures how one variable is different from its mean as the other variable is different from its mean.
- is calculated as the RATIO OF THE COVARIATION of the SAMPLE SIZE LESS ONE
- actual value is NOT MEANINGFUL because it is AFFECTED BY THE SCALE of 2 VARIABLES. That is why we calculate the correlation coefficient – to make something interpretable from the covariance information.
covariance
- indicates that the variables TEND TO MOVE TOGETHER
POSITIVE COVARIANCE
- indicates that the variables tend to move in
OPPOSITE DIRECTIONS.
NEGATIVE COVARIANCE
is a measure of the STRENGTH OF THE RELATIONSHIP between or among variables.
correlation coefficient (r)
is an EXTREME VALUE of a variable.
- may be quite large or small (where large and small are defined relative to the rest of the sample).
-may affect the sample statistics, such as a correlation coefficient.
- may result in spurious correlation.
OUTLIER
is the appearance of a relationship when in fact there is no relation.
Spurious correlation
The correlation coefficient DOES NOT INDICATE A CAUSAL RELATIONSHIP. Certain data items may be highly correlated, but not necessarily a result of a causal relationship.
T or F?
T
- is the analysis of the relation between one variable and some other variable(s), assuming a linear relation.
- Also referred to as LEAST SQUARES REGRESSION
and ORDINARY LEAST SQUARES (OLS).
a. The purpose is to explain the variation in a variable (that is, how a variable differs from it mean value) using the variation in one or more other variables.
b. Suppose we want to describe, explain, or
predict why a variable differs from its mean.
c. The least squares principle is that the
regression line is determined by minimizing
the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y. A line is fit through the XY points such that the sum of the squared residuals (that is, the sum of the squared the vertical distance between the observations and the line) is minimized.
Regression
- is the variable whose variation is BEING EXPLAINED by the other variable(s).
Also referred to as the
EXPLAINED VARIABLE, the ENDOGENOUS VARIABLE, or the PREDICTED VARIABLE.
DEPENDENT VARIABLE
- is the variable whose variation is used to explain that of the dependent variable.
- Also referred to as the EXPLANATORY VARIABLE , the EXOGENOUS VARIABLE, or the PREDICTING VARIABLE
INDEPENDENT VARIABLE
The parameters in a simple regression
equation are the slope (b1) and the intercept
(b0):
yi = b0 + b1 xi + i
b1, is the change in Y for a given one unit change in X.
- can be positive, negative, or zero
SLOPE
- b0, is the line‟s intersection with the Y-axis at X=0
- can be positive, negative, or zero
INTERCEPT
exists between dependent and independent variable.
LINEAR RELATIONSHIP
if the relation is not linear, it is not possible to transform one or both variables so that there is a linear relation.
true or false?
FALSE, it is POSSIBLE
Linear regression assumes the following:
The independent variable is ____________ with the residuals; that is, the independent variable is not random.
uncorrelated
Linear regression assumes the following:
The expected value of the disturbance term is
ZERO
homoskedastistic [A violation of this is referred to as
heteroskedasticity
Linear regression assumes the following:
The residuals are ___________________; that is, the residual or disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as
AUTOCORRELATION.]
independently distributed
Linear regression assumes the following:
The disturbance term (a.k.a. residual, a.k.a.
error term) is __________________
normally distributed.
is the standard deviation of predicted dependent variable values about the estimated regression line.
- helps us gauge the of the regression line; that is, how well we have described the variation in the dependent variable.
(also referred to as the standard error of the residual or standard error of the regression, and often indicated as se)
- is a measure of close the estimated values (using
the estimated regression), the,’s, are to the actual values, the, Y’s
c. The i‟s (a.k.a. the disturbance terms; a.k.a. the residuals) are the vertical distance between the observed value of Y and that predicted by the equation, the ‘s.
-The i‟s are in the same terms (unit of measure) as the Y‟s (e.g., dollars, pounds, billions)
standard error of the estimate (SEE)
The smaller the standard error, the better the
fit.
T OR F?
T
is the PERCENTAGE OF THE VARIATION in the dependent variable (variation of Yi;s or the sum of squares total, SST) explained by the independent variable(s).
coefficient of determination (R2)
is the RANGE OF REGRESSION coefficient values for a given value estimate of the coefficient and a given level of probability.
confidence interval
The __________ is the square root of the ratio of the variance of the regression to the variation in the independent variable
standard error (SE) of the coefficient
Interpretation of coefficients.
- is interpreted as the VALUE of the dependent variable (the Y) if the independent variable (the X) takes on a value of ZERO
ESTIMATE INTERCEPT
Interpretation of coefficients
- is interpreted as the CHANGE in the dependent variable for a given one-unit change in the independent variable.
ESTIMATE SLOPE COEFFICIENT
- is using regression involves making predictions about the dependent variable based on average relationships observed in the estimated regression.
FORECASTING
are values of the dependent variable based on the ESTIMATED REGRESSION COEFFICIENTS and a prediction about the values of the independent variables.
PREDICTED VALUES
- is regression analysis with MORE THAN ONE INDEPENDENT VARIABLE
- is identical to that of simple regression analysis except that two or more independent variables are used simultaneously to explain variations in the dependent variable.
y = b0 + b1x1 + b2x2 + b3x3 + b4x4 - the goal is to MINIMIZE THE SUM OF THE SQUARED ERRORS. Each slope coefficient is estimated while holding the other variables constant.
MULTIPLE REGRESSION
The assumptions of the multiple regression model are
as follows:
a. A LINEAR RELATIONSHIP EXISTS between dependent and independent variables.
b. The independent variables are UNCORRELATED with the residuals; that is, the independent variable is not random. In addition, there is no
exact linear relation between two or more independent variables. [Note: this is modified slightly from the assumptions of the simple regression model.]
c. The expected value of the disturbance term is ZERO; that is, E( i)=0
d. There is a constant variance of the
disturbance term; that is, the disturbance or residual terms are all drawn from a distribution
with an identical variance. In other words, the disturbance terms are homoskedastistic. [A
violation of this is referred
e. The residuals are independently
distributed; that is, the residual or
disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as autocorrelation.]
f. The disturbance term (a.k.a. residual, a.k.a. error term) is normally distributed.
g. The residual (a.k.a. disturbance term, a.k.a. error term) is what is not explained by the independent variables.
- are the NUMBER OF INDEPENDENT PIECES of information that are used to estimate the regression parameters. In calculating the regression parameters, we use the following pieces of information:
a. The mean of the dependent variable.
b. The mean of each of the independent variables.
c. Therefore,
if the regression is a simple regression, we use the degrees of freedom in estimating the regression line.
if the regression is a multiple regression with four independent variables, we used FIVE (5) degrees of freedom in the estimation of the regression line.
DEGREES OF FREEDOM
is a measure of HOW WELL a set of independent variables, as a group, explain the variation
in the dependent variable.
F - STATISTIC
are qualitative variables that take on
a value of zero or one.
DUMMY VARIABLE
is the situation in which the VARIANCE of the RESIDUALS is NOT CONSTANT across all observations
- An assumption of the regression methodology
is that the sample is drawn from the same
population, and that the variance of residuals
is constant across observations; in other
words, the residuals are homoskedastic.
- is a problem because the
estimators DO NOT HAVE THE SMALLEST POSSIBLE VARIANCE, and therefore the standard errors of
the coefficients would not be correct.
Heteroskedasticity
is the situation in which the residual
terms are CORRELATED WITH ONE ANOTHER. This occurs frequently in TIME-SERIES analysis.
- usually appears in time series data. If last year‟s earnings were high, this means that this year‟s earnings may have a greater probability of being high than being low. This is an example of positive
autocorrelation. When a good year is always
followed by a bad year, this is negative
autocorrelation.
- is a problem because the
estimators DO NOT HAVE THE SMALLEST POSSIBLE VARIANCE, and therefore the standard errors of
the coefficients would not be correct.
Autocorrelation
- is the problem of HIGH CORRELATION
between or among two or more independent variables.
IT IS A PROBLEM BECAUSE:
The presence __________of it can cause DISTORTIONS in the standard error and may lead to problems with significance testing of individual coefficients, and
Estimates are SENSITIVE TO CHANGES in the
sample observations or the model
specification.
b. If there is ___________, we are more likely
to conclude a variable is not important.
c. is likely present to some degree in most economic models. - PERFECT ________ would prohibit us from estimating the regression parameters. The issue then is really a one of degree.
Multicollinearity
Form of regression that allows the prediction of
DISCRETE VARIABLES by a mix of continuous and
discrete predictors.
Addresses the same questions that discriminant
function analysis and multiple regression do but
with no distributional assumptions on the predictors
(the predictors do not have to be normally distributed, linearly related or have equal variance in each group)
LOGISTIC REGRESSION
TYPES OF LOGISTIC REGRESSION
- It is used when the dependent variable is
DICHOTOMOUS
BINARY LOGISTIC REGRESSION
TYPES OF LOGISTIC REGRESSION
- It is used when the dependent or
outcomes variable has MORE THAN TWO CATEGORIES
MULTINOMIAL LOGISTIC REGRESSION
WHEN TO USE LOGISTIC REGRESSION?
o When the dependent variable is non parametric
and we don’t have homoscedasticity (variance of dependent variable and independent variable is not equal).
o Used when the dependent variable has only 2 LEVE;S. (Yes/No, Male/Female, Taken/Not Taken)
o If multivariate normality is SUSPECTED
o If we DON’T have LINEARITY
ASSUMPTIONS ON LOGISTIC REGRESSION
o No assumptions about the distributions of the
predictor variables
o Predictors do not have to be normally distributed
o Does not have to be linearly related
o Does not have to have equal variance within each
group
o there should be a minimum of 20 cases per
predictor, with a minimum of 60 total cases.