PRE FI LEC 1: REGRESSION ANALYSIS Flashcards

1
Q

✔ A form of PREDICTIVE MODELING TECHNIQUE which investigates
the relationship between a DEPENDENT (target) and
INDEPENDENT variable (predictor).

A

REGRESSION ANALYSIS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Father of regression analysis

A

Carl F. Gauss (1777-1855)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

first person who used the term
regression

A

Francis Galton (1877)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

graphical representation of the relation between two or more variables.
- two variables x and y, each point on the plot is an x-y pair.

A. GRAPH PLOT
B. REGRESSION PLOT
C. SCATTER PLOT

A

Scatter plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

We use _________ and ___________ to describe the variation in
one or more variables.

A. REGRESSION; CORRELATION
B. CORRELATION; REGRESSION

A

regression; correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The _______ is the SUM of the squared deviations
of a variable.

A

Variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The variation is the numerator of the _______ of a
sample

A

Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Both the variation and the variance are ____________________ of a sample.

A

measures of the dispersion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The ___________between two random variables is a statistical measure of the DEGREE TO WHICH THE 2 VARIABLES MOVE TOGETHER.
- captures how one variable is different from its mean as the other variable is different from its mean.
- is calculated as the RATIO OF THE COVARIATION of the SAMPLE SIZE LESS ONE
- actual value is NOT MEANINGFUL because it is AFFECTED BY THE SCALE of 2 VARIABLES. That is why we calculate the correlation coefficient – to make something interpretable from the covariance information.

A

covariance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • indicates that the variables TEND TO MOVE TOGETHER
A

POSITIVE COVARIANCE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • indicates that the variables tend to move in
    OPPOSITE DIRECTIONS.
A

NEGATIVE COVARIANCE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

is a measure of the STRENGTH OF THE RELATIONSHIP between or among variables.

A

correlation coefficient (r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

is an EXTREME VALUE of a variable.
- may be quite large or small (where large and small are defined relative to the rest of the sample).
-may affect the sample statistics, such as a correlation coefficient.
- may result in spurious correlation.

A

OUTLIER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

is the appearance of a relationship when in fact there is no relation.

A

Spurious correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The correlation coefficient DOES NOT INDICATE A CAUSAL RELATIONSHIP. Certain data items may be highly correlated, but not necessarily a result of a causal relationship.
T or F?

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  • is the analysis of the relation between one variable and some other variable(s), assuming a linear relation.
  • Also referred to as LEAST SQUARES REGRESSION
    and ORDINARY LEAST SQUARES (OLS).
    a. The purpose is to explain the variation in a variable (that is, how a variable differs from it mean value) using the variation in one or more other variables.
    b. Suppose we want to describe, explain, or
    predict why a variable differs from its mean.
    c. The least squares principle is that the
    regression line is determined by minimizing
    the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y. A line is fit through the XY points such that the sum of the squared residuals (that is, the sum of the squared the vertical distance between the observations and the line) is minimized.
A

Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  • is the variable whose variation is BEING EXPLAINED by the other variable(s).
    Also referred to as the
    EXPLAINED VARIABLE, the ENDOGENOUS VARIABLE, or the PREDICTED VARIABLE.
A

DEPENDENT VARIABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
  • is the variable whose variation is used to explain that of the dependent variable.
  • Also referred to as the EXPLANATORY VARIABLE , the EXOGENOUS VARIABLE, or the PREDICTING VARIABLE
A

INDEPENDENT VARIABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The parameters in a simple regression
equation are the slope (b1) and the intercept
(b0):

A

yi = b0 + b1 xi + i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

b1, is the change in Y for a given one unit change in X.
- can be positive, negative, or zero

A

SLOPE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
  • b0, is the line‟s intersection with the Y-axis at X=0
  • can be positive, negative, or zero
A

INTERCEPT

22
Q

exists between dependent and independent variable.

A

LINEAR RELATIONSHIP

23
Q

if the relation is not linear, it is not possible to transform one or both variables so that there is a linear relation.

true or false?

A

FALSE, it is POSSIBLE

24
Q

Linear regression assumes the following:
The independent variable is ____________ with the residuals; that is, the independent variable is not random.

A

uncorrelated

25
Q

Linear regression assumes the following:
The expected value of the disturbance term is

A

ZERO

26
Q

homoskedastistic [A violation of this is referred to as

A

heteroskedasticity

27
Q

Linear regression assumes the following:
The residuals are ___________________; that is, the residual or disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as
AUTOCORRELATION.]

A

independently distributed

28
Q

Linear regression assumes the following:
The disturbance term (a.k.a. residual, a.k.a.
error term) is __________________

A

normally distributed.

29
Q

is the standard deviation of predicted dependent variable values about the estimated regression line.
- helps us gauge the of the regression line; that is, how well we have described the variation in the dependent variable.
(also referred to as the standard error of the residual or standard error of the regression, and often indicated as se)
- is a measure of close the estimated values (using
the estimated regression), the,’s, are to the actual values, the, Y’s
c. The i‟s (a.k.a. the disturbance terms; a.k.a. the residuals) are the vertical distance between the observed value of Y and that predicted by the equation, the ‘s.
-The i‟s are in the same terms (unit of measure) as the Y‟s (e.g., dollars, pounds, billions)

A

standard error of the estimate (SEE)

30
Q

The smaller the standard error, the better the
fit.

T OR F?

A

T

31
Q

is the PERCENTAGE OF THE VARIATION in the dependent variable (variation of Yi;s or the sum of squares total, SST) explained by the independent variable(s).

A

coefficient of determination (R2)

32
Q

is the RANGE OF REGRESSION coefficient values for a given value estimate of the coefficient and a given level of probability.

A

confidence interval

33
Q

The __________ is the square root of the ratio of the variance of the regression to the variation in the independent variable

A

standard error (SE) of the coefficient

34
Q

Interpretation of coefficients.
- is interpreted as the VALUE of the dependent variable (the Y) if the independent variable (the X) takes on a value of ZERO

A

ESTIMATE INTERCEPT

35
Q

Interpretation of coefficients
- is interpreted as the CHANGE in the dependent variable for a given one-unit change in the independent variable.

A

ESTIMATE SLOPE COEFFICIENT

36
Q
  • is using regression involves making predictions about the dependent variable based on average relationships observed in the estimated regression.
A

FORECASTING

37
Q

are values of the dependent variable based on the ESTIMATED REGRESSION COEFFICIENTS and a prediction about the values of the independent variables.

A

PREDICTED VALUES

38
Q
  • is regression analysis with MORE THAN ONE INDEPENDENT VARIABLE
  • is identical to that of simple regression analysis except that two or more independent variables are used simultaneously to explain variations in the dependent variable.
    y = b0 + b1x1 + b2x2 + b3x3 + b4x4
  • the goal is to MINIMIZE THE SUM OF THE SQUARED ERRORS. Each slope coefficient is estimated while holding the other variables constant.
A

MULTIPLE REGRESSION

39
Q

The assumptions of the multiple regression model are
as follows:

A

a. A LINEAR RELATIONSHIP EXISTS between dependent and independent variables.
b. The independent variables are UNCORRELATED with the residuals; that is, the independent variable is not random. In addition, there is no
exact linear relation between two or more independent variables. [Note: this is modified slightly from the assumptions of the simple regression model.]
c. The expected value of the disturbance term is ZERO; that is, E( i)=0
d. There is a constant variance of the
disturbance term; that is, the disturbance or residual terms are all drawn from a distribution
with an identical variance. In other words, the disturbance terms are homoskedastistic. [A
violation of this is referred
e. The residuals are independently
distributed; that is, the residual or
disturbance for one observation is not correlated with that of another observation. [A violation of this is referred to as autocorrelation.]
f. The disturbance term (a.k.a. residual, a.k.a. error term) is normally distributed.
g. The residual (a.k.a. disturbance term, a.k.a. error term) is what is not explained by the independent variables.

40
Q
  • are the NUMBER OF INDEPENDENT PIECES of information that are used to estimate the regression parameters. In calculating the regression parameters, we use the following pieces of information:
    a. The mean of the dependent variable.
    b. The mean of each of the independent variables.
    c. Therefore,
     if the regression is a simple regression, we use the degrees of freedom in estimating the regression line.
     if the regression is a multiple regression with four independent variables, we used FIVE (5) degrees of freedom in the estimation of the regression line.
A

DEGREES OF FREEDOM

41
Q

is a measure of HOW WELL a set of independent variables, as a group, explain the variation
in the dependent variable.

A

F - STATISTIC

42
Q

are qualitative variables that take on
a value of zero or one.

A

DUMMY VARIABLE

43
Q

is the situation in which the VARIANCE of the RESIDUALS is NOT CONSTANT across all observations
- An assumption of the regression methodology
is that the sample is drawn from the same
population, and that the variance of residuals
is constant across observations; in other
words, the residuals are homoskedastic.
- is a problem because the
estimators DO NOT HAVE THE SMALLEST POSSIBLE VARIANCE, and therefore the standard errors of
the coefficients would not be correct.

A

Heteroskedasticity

44
Q

is the situation in which the residual
terms are CORRELATED WITH ONE ANOTHER. This occurs frequently in TIME-SERIES analysis.
- usually appears in time series data. If last year‟s earnings were high, this means that this year‟s earnings may have a greater probability of being high than being low. This is an example of positive
autocorrelation. When a good year is always
followed by a bad year, this is negative
autocorrelation.
- is a problem because the
estimators DO NOT HAVE THE SMALLEST POSSIBLE VARIANCE, and therefore the standard errors of
the coefficients would not be correct.

A

Autocorrelation

45
Q
  • is the problem of HIGH CORRELATION
    between or among two or more independent variables.
    IT IS A PROBLEM BECAUSE:
     The presence __________of it can cause DISTORTIONS in the standard error and may lead to problems with significance testing of individual coefficients, and
     Estimates are SENSITIVE TO CHANGES in the
    sample observations or the model
    specification.
    b. If there is ___________, we are more likely
    to conclude a variable is not important.
    c. is likely present to some degree in most economic models.
  • PERFECT ________ would prohibit us from estimating the regression parameters. The issue then is really a one of degree.
A

Multicollinearity

46
Q

 Form of regression that allows the prediction of
DISCRETE VARIABLES by a mix of continuous and
discrete predictors.
 Addresses the same questions that discriminant
function analysis and multiple regression do but
with no distributional assumptions on the predictors
(the predictors do not have to be normally distributed, linearly related or have equal variance in each group)

A

LOGISTIC REGRESSION

47
Q

TYPES OF LOGISTIC REGRESSION
- It is used when the dependent variable is
DICHOTOMOUS

A

BINARY LOGISTIC REGRESSION

48
Q

TYPES OF LOGISTIC REGRESSION
- It is used when the dependent or
outcomes variable has MORE THAN TWO CATEGORIES

A

MULTINOMIAL LOGISTIC REGRESSION

49
Q

WHEN TO USE LOGISTIC REGRESSION?

A

o When the dependent variable is non parametric
and we don’t have homoscedasticity (variance of dependent variable and independent variable is not equal).
o Used when the dependent variable has only 2 LEVE;S. (Yes/No, Male/Female, Taken/Not Taken)
o If multivariate normality is SUSPECTED
o If we DON’T have LINEARITY

50
Q

ASSUMPTIONS ON LOGISTIC REGRESSION

A

o No assumptions about the distributions of the
predictor variables
o Predictors do not have to be normally distributed
o Does not have to be linearly related
o Does not have to have equal variance within each
group
o there should be a minimum of 20 cases per
predictor, with a minimum of 60 total cases.