PRE FI LEC 1: REGRESSION ANALYSIS Flashcards by Sunwoo's Berry

✔ A form of PREDICTIVE MODELING TECHNIQUE which investigates
the relationship between a DEPENDENT (target) and
INDEPENDENT variable (predictor).

REGRESSION ANALYSIS

How well did you know this?

Not at all

Perfectly

Father of regression analysis

Carl F. Gauss (1777-1855)

How well did you know this?

Not at all

Perfectly

first person who used the term
regression

Francis Galton (1877)

How well did you know this?

Not at all

Perfectly

graphical representation of the relation between two or more variables.
- two variables x and y, each point on the plot is an x-y pair.

A. GRAPH PLOT
B. REGRESSION PLOT
C. SCATTER PLOT

Scatter plot

How well did you know this?

Not at all

Perfectly

We use _________ and ___________ to describe the variation in
one or more variables.

A. REGRESSION; CORRELATION
B. CORRELATION; REGRESSION

regression; correlation

How well did you know this?

Not at all

Perfectly

The _______ is the SUM of the squared deviations
of a variable.

Variation

How well did you know this?

Not at all

Perfectly

The variation is the numerator of the _______ of a
sample

Variance

How well did you know this?

Not at all

Perfectly

Both the variation and the variance are ____________________ of a sample.

measures of the dispersion

How well did you know this?

Not at all

Perfectly

The ___________between two random variables is a statistical measure of the DEGREE TO WHICH THE 2 VARIABLES MOVE TOGETHER.
- captures how one variable is different from its mean as the other variable is different from its mean.
- is calculated as the RATIO OF THE COVARIATION of the SAMPLE SIZE LESS ONE
- actual value is NOT MEANINGFUL because it is AFFECTED BY THE SCALE of 2 VARIABLES. That is why we calculate the correlation coefficient – to make something interpretable from the covariance information.

covariance

How well did you know this?

Not at all

Perfectly

indicates that the variables TEND TO MOVE TOGETHER

POSITIVE COVARIANCE

How well did you know this?

Not at all

Perfectly

indicates that the variables tend to move in
OPPOSITE DIRECTIONS.

NEGATIVE COVARIANCE

How well did you know this?

Not at all

Perfectly

is a measure of the STRENGTH OF THE RELATIONSHIP between or among variables.

correlation coefficient (r)

How well did you know this?

Not at all

Perfectly

is an EXTREME VALUE of a variable.
- may be quite large or small (where large and small are defined relative to the rest of the sample).
-may affect the sample statistics, such as a correlation coefficient.
- may result in spurious correlation.

OUTLIER

How well did you know this?

Not at all

Perfectly

is the appearance of a relationship when in fact there is no relation.

Spurious correlation

How well did you know this?

Not at all

Perfectly

The correlation coefficient DOES NOT INDICATE A CAUSAL RELATIONSHIP. Certain data items may be highly correlated, but not necessarily a result of a causal relationship.
T or F?

How well did you know this?

Not at all

Perfectly

is the analysis of the relation between one variable and some other variable(s), assuming a linear relation.
Also referred to as LEAST SQUARES REGRESSION
and ORDINARY LEAST SQUARES (OLS).
a. The purpose is to explain the variation in a variable (that is, how a variable differs from it mean value) using the variation in one or more other variables.
b. Suppose we want to describe, explain, or
predict why a variable differs from its mean.
c. The least squares principle is that the
regression line is determined by minimizing
the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y. A line is fit through the XY points such that the sum of the squared residuals (that is, the sum of the squared the vertical distance between the observations and the line) is minimized.

Regression

How well did you know this?

Not at all

Perfectly

is the variable whose variation is BEING EXPLAINED by the other variable(s).
Also referred to as the
EXPLAINED VARIABLE, the ENDOGENOUS VARIABLE, or the PREDICTED VARIABLE.

DEPENDENT VARIABLE

How well did you know this?

Not at all

Perfectly

is the variable whose variation is used to explain that of the dependent variable.
Also referred to as the EXPLANATORY VARIABLE , the EXOGENOUS VARIABLE, or the PREDICTING VARIABLE

INDEPENDENT VARIABLE

How well did you know this?

Not at all

Perfectly

The parameters in a simple regression
equation are the slope (b1) and the intercept
(b0):

yi = b0 + b1 xi + i

How well did you know this?

Not at all

Perfectly

b1, is the change in Y for a given one unit change in X.
- can be positive, negative, or zero

SLOPE

How well did you know this?

Not at all

Perfectly

b0, is the line‟s intersection with the Y-axis at X=0
can be positive, negative, or zero

Study These Flashcards

INTERCEPT

exists between dependent and independent variable.

Study These Flashcards

LINEAR RELATIONSHIP

if the relation is not linear, it is not possible to transform one or both variables so that there is a linear relation.

true or false?

Study These Flashcards

FALSE, it is POSSIBLE

Linear regression assumes the following:
The independent variable is ____________ with the residuals; that is, the independent variable is not random.

Study These Flashcards

uncorrelated

Linear regression assumes the following: The expected value of the disturbance term is

ZERO

homoskedastistic [A violation of this is referred to as

heteroskedasticity

Linear regression assumes the following: The residuals are ___________________; that is, the residual or disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as AUTOCORRELATION.]

independently distributed

Linear regression assumes the following: The disturbance term (a.k.a. residual, a.k.a. error term) is __________________

normally distributed.

is the standard deviation of predicted dependent variable values about the estimated regression line. - helps us gauge the of the regression line; that is, how well we have described the variation in the dependent variable. (also referred to as the standard error of the residual or standard error of the regression, and often indicated as se) - is a measure of close the estimated values (using the estimated regression), the,'s, are to the actual values, the, Y's c. The i‟s (a.k.a. the disturbance terms; a.k.a. the residuals) are the vertical distance between the observed value of Y and that predicted by the equation, the 's. -The i‟s are in the same terms (unit of measure) as the Y‟s (e.g., dollars, pounds, billions)

standard error of the estimate (SEE)

The smaller the standard error, the better the fit. T OR F?

is the PERCENTAGE OF THE VARIATION in the dependent variable (variation of Yi;s or the sum of squares total, SST) explained by the independent variable(s).

coefficient of determination (R2)

is the RANGE OF REGRESSION coefficient values for a given value estimate of the coefficient and a given level of probability.

confidence interval

The __________ is the square root of the ratio of the variance of the regression to the variation in the independent variable

standard error (SE) of the coefficient

Interpretation of coefficients. - is interpreted as the VALUE of the dependent variable (the Y) if the independent variable (the X) takes on a value of ZERO

ESTIMATE INTERCEPT

Interpretation of coefficients - is interpreted as the CHANGE in the dependent variable for a given one-unit change in the independent variable.

ESTIMATE SLOPE COEFFICIENT

- is using regression involves making predictions about the dependent variable based on average relationships observed in the estimated regression.

FORECASTING

are values of the dependent variable based on the ESTIMATED REGRESSION COEFFICIENTS and a prediction about the values of the independent variables.

PREDICTED VALUES

- is regression analysis with MORE THAN ONE INDEPENDENT VARIABLE - is identical to that of simple regression analysis except that two or more independent variables are used simultaneously to explain variations in the dependent variable. y = b0 + b1x1 + b2x2 + b3x3 + b4x4 - the goal is to MINIMIZE THE SUM OF THE SQUARED ERRORS. Each slope coefficient is estimated while holding the other variables constant.

MULTIPLE REGRESSION

The assumptions of the multiple regression model are as follows:

a. A LINEAR RELATIONSHIP EXISTS between dependent and independent variables. b. The independent variables are UNCORRELATED with the residuals; that is, the independent variable is not random. In addition, there is no exact linear relation between two or more independent variables. [Note: this is modified slightly from the assumptions of the simple regression model.] c. The expected value of the disturbance term is ZERO; that is, E( i)=0 d. There is a constant variance of the disturbance term; that is, the disturbance or residual terms are all drawn from a distribution with an identical variance. In other words, the disturbance terms are homoskedastistic. [A violation of this is referred e. The residuals are independently distributed; that is, the residual or disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as autocorrelation.] f. The disturbance term (a.k.a. residual, a.k.a. error term) is normally distributed. g. The residual (a.k.a. disturbance term, a.k.a. error term) is what is not explained by the independent variables.

- are the NUMBER OF INDEPENDENT PIECES of information that are used to estimate the regression parameters. In calculating the regression parameters, we use the following pieces of information: a. The mean of the dependent variable. b. The mean of each of the independent variables. c. Therefore,  if the regression is a simple regression, we use the degrees of freedom in estimating the regression line.  if the regression is a multiple regression with four independent variables, we used FIVE (5) degrees of freedom in the estimation of the regression line.

DEGREES OF FREEDOM

is a measure of HOW WELL a set of independent variables, as a group, explain the variation in the dependent variable.

F - STATISTIC

are qualitative variables that take on a value of zero or one.

DUMMY VARIABLE

is the situation in which the VARIANCE of the RESIDUALS is NOT CONSTANT across all observations - An assumption of the regression methodology is that the sample is drawn from the same population, and that the variance of residuals is constant across observations; in other words, the residuals are homoskedastic. - is a problem because the estimators DO NOT HAVE THE SMALLEST POSSIBLE VARIANCE, and therefore the standard errors of the coefficients would not be correct.

Heteroskedasticity

is the situation in which the residual terms are CORRELATED WITH ONE ANOTHER. This occurs frequently in TIME-SERIES analysis. - usually appears in time series data. If last year‟s earnings were high, this means that this year‟s earnings may have a greater probability of being high than being low. This is an example of positive autocorrelation. When a good year is always followed by a bad year, this is negative autocorrelation. - is a problem because the estimators DO NOT HAVE THE SMALLEST POSSIBLE VARIANCE, and therefore the standard errors of the coefficients would not be correct.

Autocorrelation

- is the problem of HIGH CORRELATION between or among two or more independent variables. IT IS A PROBLEM BECAUSE:  The presence __________of it can cause DISTORTIONS in the standard error and may lead to problems with significance testing of individual coefficients, and  Estimates are SENSITIVE TO CHANGES in the sample observations or the model specification. b. If there is ___________, we are more likely to conclude a variable is not important. c. is likely present to some degree in most economic models. - PERFECT ________ would prohibit us from estimating the regression parameters. The issue then is really a one of degree.

Multicollinearity

 Form of regression that allows the prediction of DISCRETE VARIABLES by a mix of continuous and discrete predictors.  Addresses the same questions that discriminant function analysis and multiple regression do but with no distributional assumptions on the predictors (the predictors do not have to be normally distributed, linearly related or have equal variance in each group)

LOGISTIC REGRESSION

TYPES OF LOGISTIC REGRESSION - It is used when the dependent variable is DICHOTOMOUS

BINARY LOGISTIC REGRESSION

TYPES OF LOGISTIC REGRESSION - It is used when the dependent or outcomes variable has MORE THAN TWO CATEGORIES

MULTINOMIAL LOGISTIC REGRESSION

WHEN TO USE LOGISTIC REGRESSION?

o When the dependent variable is non parametric and we don't have homoscedasticity (variance of dependent variable and independent variable is not equal). o Used when the dependent variable has only 2 LEVE;S. (Yes/No, Male/Female, Taken/Not Taken) o If multivariate normality is SUSPECTED o If we DON'T have LINEARITY

ASSUMPTIONS ON LOGISTIC REGRESSION

o No assumptions about the distributions of the predictor variables o Predictors do not have to be normally distributed o Does not have to be linearly related o Does not have to have equal variance within each group o there should be a minimum of 20 cases per predictor, with a minimum of 60 total cases.

PRE FI LEC 1: REGRESSION ANALYSIS Flashcards

(50 cards)