Regression Analysis Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

what does covariance measure?

A

—Measures the direction of linear relationship between two (continuous) variables.
—Can be positive or negative
—Positive: as x increases, y tends to increase
—Negative: as x increases, y tends to decrease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what does correlation coefficient measure?

A

strength (and direction) of the linear relationship between two variables, X and Y.

—Indicates the degree to which the variation in X is related to the variation in Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

is correlation coefficient is measured for population, it is called?

A

ρ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

is correlation coefficient is estimated for a sample,

A

—use r; i.e. r estimates ρ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

describe results from correlation coefficient

A

—Always between -1 and +1.
—ρ =+1: perfect positive linear relationship
—ρ =-1: perfect negative linear relationship
—ρ =0: no linear relationship (could be a different sort of relationship between the variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is covariance and correlatoin coefficient is not used with continuous variables and the variables are not normally distributed?

A

—r will be “deflated” and underestimates ρ.
E.g. in marketing research, often use 1-5 likert scales: if rating scales have a small number of categories, data is not strictly continuous, so r will underestimate ρ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

covariance and correlation coefficient are appropriate for use with

A

—continuous variables whose distributions have the same shape (e.g. both normally distributed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

describe hypothesis test for correlation

1. Hypothesis

  1. Test statistic
  2. Decision rule
  3. Conclusion
A

—H0: ρ=0 (if correlation is zero, then there is no significant linear relationship)
—H1: ρ≠0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

describe hypothesis test for correlation

  1. Hypothesis

2. Test statistic

  1. Decision rule
  2. Conclusion
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

describe hypothesis test for correlation

  1. Hypothesis
  2. Test statistic
  3. Decision rule

4. Conclusion

A

in terms of whether a significant linear relationship exists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

describe hypothesis test for correlation

  1. Hypothesis
  2. Test statistic

3. Decision rule

  1. Conclusion
A

: Compare to a t-distribution with n-2 degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

if correlation is positive, conclude that?

A

the greater the increase in A, the greater the increase in B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

describe moderate to weak relationship of r

A

r is around 0.4-0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why do we use regression analysis?

A

whether and how (cts) variables are related to each other
—“Whether” – does the value of one variable have any effects on the values of another?
—“How” – as one variable changes, does another tend to increase or decrease?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what data is used in regression analysis?

A

—One continuous response variable (called y - dependent variable, response variable)
—One or more continuous explanatory variables (called x - independent variable, explanatory variable, predictor variable, regressor variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what regression does

A

—Develops an equation which represents the relationship between the variables.

  • —Simple linear regression* – straight line relationship between y and x (i.e. one explanatory variable)
  • —Multiple linear regression* – “straight line” relationship between y and x1, x2, .., xk where we have k explanatory variables
  • —Non-linear regression* – relationship not a “straight line” (i.e. y is related to some function of x, e.g. log(x))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is the objective of regression?

A

Interested in predicting values of Y when X takes on a specific value
—model relationship through a linear model
—Express random variable Y in terms of random variable X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

feature of the population/ true regression line

A

—β0 and β1 are constants to be estimated
—εi is a random variable with mean = 0

Yi = —β0 + —β1xi +— εi

—response of particular retail spending to a particular value of disposable income will be in two parts – <strong>an expectation (β0+β1x) which reflects the systematic relationship</strong>, and a <strong>discrepancy (εi) which represents all the other many factors</strong> (apart from disposable income) which may affect spending.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is the residual?

A

—Vertical distance between observed point and fitted line is called the residual.
—That is ri=yi-(b0+b1xi)
—ri estimates εi, the error variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

how do you determine values of b0+b1 that best fit the data?

A

— choose values of slope and intercept which minimise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

describe the residual sum of squares method

A

—Choose our estimates of slope and intercept to give the smallest residual sum of squares
—Uses calculus to find estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

residual sum of squares method

how do you estimate slope and intercept?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

how can we show that residual sum of squares is minimised by solution?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what is a residual?

A

ri= yi - yhat is a residual

—Residuals are observed values of the errors, εi, i=1, 2, …, n.

—The error sum of squares is then

The procedure gives the “line of best fit” in the sense that the SSE is minimised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

what is the general rule?

A

can’t determine the value of y for a value of X outside our sample range of x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

how to assess the model?

A

—If fit is poor, discard the model and fit another
—Different shape, e.g. not a straight line – could mean fitting a quadratic, cubic etc; or fitting something completely different
—Different predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

assessing the model,

in our fitting, we assume

A

—the errors have a particular distribution – that is, ε~N(0,σε²)
—Normal distribution
—Mean = 0
—Constant variance = σε²
—If σε² is small, then small spread of observations around fitted line
—If σε² is large, then observations have wide spread around fitted line
—Errors associated with any two y values are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

how do you test the slope?

  1. Hypothesis
  2. Test Statistic
  3. Decision Rule
  4. Conclusion
A

—H0: β1=constant

HA: β1≠constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

how do you test the slope?

  1. Hypothesis

2. Test Statistic

  1. Decision Rule
  2. Conclusion
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

how do you test the slope?

  1. Hypothesis
  2. Test Statistic

3. Decision Rule

4. Conclusion

A

—Decision Rule: Compare to a t-distribution with n-2 degrees of freedom
—
—Conclusion: In terms of whether evidence is sufficient to reject null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

why do you have to be careful when testing intercept?

A

as it may be outside the prediction range, and so will not have an interpretation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

what assumptions do we have for error terms?

A

—Assumptions:
—Error terms are normally distributed
—Error terms have mean of 0, constant variance
Error terms are independent – observations are indepen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

what does the intercept refer to?

A

what happens to Y when X= 0

that is,
—what happens to gross sales when no money is spent on newspaper advertising

34
Q

how to determine

the strength and significance of association<!--EndFragment-->

A

—Measured by R2coefficient of determination.
—This measures proportion of variation in Y that is explained by variation in the independent variable X in the regression model

R2 = explained variation / total variation = (correlation coefficient)2

eg. 90.42% of the<strong> variation in </strong>annual sales is explained by variability in the size of the store

35
Q

features of R2 ?

A

—Will be between 0 and 1; a value close to 1 indicates most of the variation in y is explained by the regression equation

note.

36
Q

you cannot say model fits data well unless,

A

Cannot say model fits data well unless assumptions about errors are met:

  1. —Independence
  2. —Normally distributed
  3. —Zero mean, constant variance

—(Note that zero mean of residuals is ensured by estimation process)
—Examine residuals (estimates of errors) to see if assumptions are met

—Graphical techniques
—Assess normality from histogram, normality plot
—Assess independence and variance from scatterplots of residuals vs fitted values, predictor values, order

37
Q

when plotting residuals,

A

for preference use Standardised residuals (will have standard deviation of 1; if normally distributed, will fit a standard normal distribution).

38
Q

how is normality shown?

A

normality shown by points being close to the line

39
Q

how is normality depicted in histograms?

A

by the bell shape

40
Q

Residuals vs fitted values

How do you indicate independence and constant variation?

A

—Randomness indicates independence; equal spread indicates constant variation

41
Q

Residuals vs order

how do you indicate independence and constant variation?

A

—Randomness indicates independence; equal spread indicates constant variation

42
Q

what does homoscedasticity appear like?

A
43
Q

how does heteroscedasticity appear?

A
44
Q

distinguish between homoscedasticity and heteroscedasticity

A

—If variation is constant (residuals show constant spread around zero), called homoscedastic
—If variation is non-constant (residuals show varying spread around zero), called heteroscedastic

45
Q

independence of error terms

if error terms are correlated over time,

A

—If error terms are correlated over time (or in order of collection/entry) they are said to be autocorrelated or serially correlated.
—If residuals independent, should be no relationship among them
—If residuals related, autocorrelation present – often happens with economic and financial data

46
Q

when there is an outlier, what steps should be taken?

A
  • —Should investigate further
    • —Might have been a typo (should have been 30,000)
    • —Might not have been appropriate for sample (only 3 months old)

—If all evidence indicates it is valid, should still be included (i.e. don’t just throw out data because it is unusual!)

47
Q

describe influential observations

A

—If an x-value is far away from the mean, far away from other x-observations, called “influential”
—Will have a great impact on where line goes – a small change in response will result in a big change in fitted line (coefficients estimated)

48
Q

what should you do when influential observations are present

A

—Should be checked for validity, accuracy etc

49
Q

when can you assume model fits data well?

A

—High R-sq, small std error of estimate
—All assumptions appear valid

50
Q

what should you do when in assumption the model fits data well?

A

May want to use model to predict values of response for given values of predictor.
—Remember: predictions should only be made for values of x within or not too far from the upper and lower observed x limits.
eg sub values for X in the equation to yield Y values

51
Q

describe the two types of confidence intervals

A

—A prediction interval for a single observation of y (an interval within which we expect single observations of the response

  • further away from the x average we are predicting, the wider our prediction interval will be.

—
—A confidence interval for the expected value of y (an interval within which we expect to find the average response)

  • CI is narrower than PI for same value of x
52
Q

describe multiple regressions

A

two or more indepdent variables are used to predict value of dependent variable

Example: Are consumers’ perceptions of quality determined by the perceptions of prices, brand image and brand attributes

53
Q

Multiple regressions

describe additive effects

A

—Combined effects of X1 and X2 are additive – if both X1 and X2 are increased by one unit, expected change in Y would be (β12).

54
Q

Multiple regressions

For least squares solution, we can find solution only if

A

we can only find a solution if

  • —Number of predictors is less than number of observations
  • —None of the independent variables are perfectly correlated with each other
55
Q

Describe strength of associtiation (R2) for multiple regression

A

—coefficient of multiple determination

  • Will go up as we add more explanatory terms to the model whether they are “important” or not
  • —Often we use “adjusted R2” – accounts for independent variable
  • —So, if comparing models with differing numbers of predictors, use “adjusted R2” to compare how much variation in response is explained by model.
56
Q

multiple regression,

when significance testing what can we test?

A

—Can test two different things
1.Significance of the overall regression
2.Significance of specific partial regression coefficients.
—

57
Q

multiple regression,

when significance testing

1. hypothesis

  1. Test statistic

3. Decision Rule

4. Conclusion

A

—H0: β1= β2= β3=…= βk=0 (no linear relationship between dependent variable and independent variables)
—HA: not all slopes = 0
(—at least one of the independent variables is related to sales)

—Test Statistic: Found in Minitab’s “ANOVA” table
—Decision Rule: Compared to an F-distribution with k, (n-k-1) degrees of freedom.
—If H0 is rejected, one or more slopes are not zero. Additional tests are needed to determine which slopes are significant.

58
Q

<!--StartFragment-->

Significance of specific partial regression coefficients<!--EndFragment-->

  1. hypothesis
  2. Testing statistic

3. decision rule

4. conclusion

A

—Decision Rule: Compared to a t-distribution with (n-k-1) degrees of freedom (i.e. residual d.f. from ANOVA table) [k is the number of predictors being fitted.]

—If H0 is rejected, the slope of the ith variable is significantly different from zero. That is, once the other variables are considered, the ith predictor has a significant linear relationship with the response.

59
Q

<!--StartFragment-->

Significance of specific partial regression coefficients<!--EndFragment-->

1. hypothesis

  1. Testing statistic
  2. decision rule
  3. conclusion
A

—H0: βi=0
—HA: βi≠0

60
Q

what assumptions are made for residuals?

A

—Assumptions made:
L—inearity: relationship between variables is linear

Independence of errors: errors are independent of one another

Normality: errors (εi) are normally distributed at each value of X. regression analysis robust against departure from normality assumption

**E: **homoscedasticity( equal/constant variance). Variance of the errors (εi) be constant for all values of X. variability of Y values is the same when X is a low value as when X is high

Have mean 0

61
Q

what is the definition of residual

A

A residual (also called error term) is the difference between the observed response value Yi, and the value predicted by the regression equation Y hati

-
—(Vertical distance between point and line/plane.)

62
Q

Residuals

—Error terms normally distributed
—Error terms have mean 0, constant variance
—Error terms are independent

A

—Can be checked by looking at a histogram of the residuals - look for bell-shaped distribution.
—Also normal probability plot – look for straight line.
—For preference, use standardised residuals – have a std dev of 1.

63
Q

Residuals

—Error terms normally distributed
—Error terms have mean 0, constant variance
—Error terms are independent

A

Checked by using plots of

  • residuals vs predicted values
  • residuals vs independent variables.

—Look for random scatter of points around zero.
—If not, (esp res vs indep), may indicate linear regression is not appropriate – may need to transform data (see tutorial)

64
Q

Residuals

—Error terms normally distributed
—Error terms have mean 0, constant variance
—Error terms are independent

A

—Check in previous plots; also in residuals vs time/order.
—Look for random scatter of residuals.

65
Q

what model does a polynomial model fit?

A

X1 = x

X2 = x2

X3 = x3

66
Q

Polynomial regression

<!--StartFragment-->

Interaction term<!--EndFragment-->

A

—This is needed if the level of X1 affects the relationship between X2 and Y.

67
Q

purpose of regression analysis

A

develop model to predict values of a numerical variable, based on value of other variables

68
Q

describe simple linear regression

A

single numerical indepdent variable, X is used to predict numerical dependant variable Y

69
Q

the simplest relationship between two variables is known as

A

a linear relationship or straight-line relationship

70
Q

what is the simple linear regression model and what does each symbol represent?

A

Yi = β0 + β1Xi + εi

β0 = Y intercept for population (mean value of Y when X= 0)

β1 = slope for population (change in Y per unit change in X)

εi = random error in Y for each observation i (vertical distance of actual value of Yi above or below the expected value of Yi on the line)

Yi = dependent variable (response variable) for observation i

Xi = independent variable (predictor/explanatory variable) for observation i

71
Q

list 6 possible relationships found in scatterplots

A
  1. positive linear relationship
  2. negative linear relationship
  3. positive curvilinear relationship
  4. negative curvilinear relationship
  5. U shaped curvilinear relatnioship
  6. No relationship
72
Q

what is the simple linear regression equation: the predicition line and why is it used?

A

ŷi= b0 + b1Xi

population parameters in practice are estimated

ŷi= predicted value of Y for observation i

  • X*i = value of X for observation i
  • b*0 = sample Y intercept
  • b*1 = sample slope
73
Q

how do you determine two regression coefficients b0 and b1?

A

by using least squares estimation

minimises the sum of the squared differences between the actual values Yi and predicted values Yhati using simple linear regression equation

74
Q

least squares method/solution produces

A

the line that fits the data with the minimum amount of prediction error

provides line of best fit so SSE is minimised

75
Q

what does standard error of the estimate show?

A

measures variability of observed Y values from the predicted Y values

standard deviation around the prediction line

76
Q

when there is autocorrelation,

A

there is pattern in residuals. This can put validity of regression model in serious doubt because it obviates the independence of error assumption

eg. after plotting residual, residual fluctuate up and down in cyclical pattern, high chance autocorrelation exists, violating independence of errors assumption

77
Q

regression coefficients in multiple regression are called…. why?

A

net regression coefficients. they estimate the predicted change in Y per unit change in a particular X , holding constant the effect of the other X variables

78
Q

what is a dummy variable regression?

A

—Include categorical variable in a regression model, you use a dummy variable. (converts categorical variable to numerical variable)
—Recodes categories of a categorical variable using the numerical values 0 and 1
—0 assigned to absence of a characteristic. 1 assigned to presence of characteristic
—X2 = 0 if the house does not have fireplace
—X2 = 1 if house does have fireplace (substituted in model)

79
Q

describe interactions in multiregression models

A

Interaction occurs if effect of an independent variable on the dependent variable changes according to the value of a second independent variable. Interaction between the two indepedent variables

eg. advertising has large effect on sales of product when price of product is low

80
Q

when there is interaction, what should you do?

A

use an interaction term (cross-product term) to model an interaction effect in a regression model. Then assess whether the interaction variable makes a significant contribution to the regression model. If significant, cannot use roriginal regression model for prediciton

X3= X1 x X2

Size*fireplacecoded = size x fireplace coded

81
Q

When the assumptions about residuals are violated, then?

A

The violation of assumptions means that the regression is invalid and should not be used for prediction or further analysis.