Regression Analysis Flashcards

Question

what is the general rule?

Answer 1

can’t determine the value of y for a value of X outside our sample range of x.

Answer 2

If fit is poor, d**iscard the model and fit another** Different shape, e.g. not a straight line – could mean fitting a quadratic, cubic etc; or fitting something completely different Different predictors

Answer 3

the errors have a particular distribution – that is, ε~N(0,σ_ε²) Normal distribution Mean = 0 **Constant varianc**e = σ_ε² If σ_ε² is small, then ***small spread of observations*** around fitted line If σ_ε² is large, then observations have ***wide spread around fitted line*** Errors associated with any two y values are independent

Answer 4

H₀: β₁=constant ## Footnote H_A: β₁≠constant

Answer 5

## Footnote Decision Rule: Compare to a t-distribution with n-2 degrees of freedom Conclusion: In terms of whether evidence is sufficient to reject null hypothesis.

Answer 6

as it may be outside the prediction range, and so will not have an interpretation.

Answer 7

Assumptions: Error terms are **normally distributed** Error terms have **mean of 0, constant variance** Error terms are **independen**t – observations are indepen

Answer 8

what happens to Y when X= 0 that is, what happens to gross sales when no money is spent on newspaper advertising

Answer 9

Measured by R² – **coefficient of determination.** This measures proportion of variation in Y that is explained by variation in the independent variable X in the regression model R² = explained variation / total variation = (correlation coefficient)² ^{eg. 90.42% of the variation in annual sales is explained by variability in the size of the store}

Answer 10

Will be between 0 and 1; a value close to 1 indicates most of the variation in y is explained by the regression equation note.

Answer 11

Cannot say model fits data well unless assumptions **about errors** are met: 1. Independence 2. Normally distributed 3. Zero mean, constant variance (Note that zero mean of residuals is ensured by estimation process) **Examine residuals** (estimates of errors) to see if assumptions are met **Graphical techniques** Assess normality from histogram, normality plot Assess independence and variance from scatterplots of residuals vs fitted values, predictor values, order

Answer 12

for preference use Standardised residuals (will have standard **deviation of 1; if normally distributed,** will fit a **standard normal** distribution).

Answer 13

normality shown by points being close to the line

Answer 14

by the bell shape

Answer 15

**Randomness** indicates independence; **equal spread** indicates constant variation

Answer 16

Randomness indicates independence; equal spread indicates constant variation

Answer 17

If variation is constant (**residuals show constant spread** around zero), called homoscedastic If variation is non-constant (**residuals show varying spread** around zero), called heteroscedastic

Answer 18

If error terms are correlated over time (or in order of collection/entry) they are said to be **autocorrelated** or **serially correlated**. If *_residuals independent_*, should be no relationship among them If *_residuals related,_* autocorrelation present – often happens with economic and financial data

Answer 19

* Should investigate further * Might **have been a typo** (should have been 30,000) * Might ***not have been appropriate*** for sample (only 3 months old) If all evidence indicates it is valid, s*_hould still be included_* (i.e. don’t just throw out data because it is unusual!)

Answer 20

If an x-value is far away *_from the mean,_* far away from *_other x-observations_*, called “influential” Will have a *_great impact on where line goes_* – a small change in response will result **in a big change in fitted line** (coefficients estimated)

Answer 21

Should be checked for validity, accuracy etc

Answer 22

High R-sq, small std error of estimate All assumptions appear valid

Answer 23

May want to **use model to predict values of response** for given values of *_predictor._* Remember: predictions should only be made for values of x **within or not too far** from the upper and lower observed x limits. eg sub values for X in the equation to yield Y values

Answer 24

A **_prediction interval f_**or a **single observation of y** (an interval within which we expect *_single observations of the response_*) * further away from the *x average we are predicting*, the wider our prediction interval will be. A **_confidence interval_** for the **expected value of y** (an interval within which we expect to *_find the average response_*) * CI is narrower than PI for same value of x

Answer 25

two or more indepdent variables are used to predict value of dependent variable Example: Are consumers’ perceptions of quality determined by the perceptions of prices, brand image and brand attributes

Answer 26

Combined effects of X₁ and X₂ are additive – if both X₁ and X₂ are increased by one unit, expected change in Y would be (β₁+β₂).

Answer 27

we can only find a solution if * Number of predictors is ***less than number of observations*** * None of the independent variables are ***perfectly correlated*** with each other

Answer 28

**coefficient of multiple determination** * Will go up ***as we add more explanatory terms*** to the model whether they are “important” or not * Often we use **“adjusted R²**” – accounts for independent variable * * So, if comparing models with **differing numbers of predictors**, use “adjusted R²” to compare _how much variation in response_ is explained by model.

Answer 29

Can test two different things 1.Significance of the **overall regression** 2.Significance of **specific partial regression coefficients.**

Answer 30

H₀: β₁= β₂= β₃=…= β_k=0 (no linear relationship between dependent variable and independent variables) H_A: not all slopes = 0 (at least one of the independent variables is related to sales) Test Statistic: Found in Minitab’s “ANOVA” table Decision Rule: Compared to an F-distribution with *k, (n-k-1)* degrees of freedom. If H₀ is rejected, o*ne or more slopes are not zero*. Additional tests are needed to determine which slopes are significant.

Answer 31

Decision Rule: Compared to a t-distribution with (n-k-1) degrees of freedom (i.e. residual d.f. from ANOVA table) [k is the number of predictors being fitted.] If H0 is rejected, the **slope of the ith variable is significantly different from zero.** That is, once the other variables are considered, the **ith predictor** has a ***significant linear relationship with the response***.

Answer 32

H₀: β_i=0 H_A: β_i≠0

Answer 33

Assumptions made: **L**inearity: relationship between variables is linear **I**ndependence of errors: errors are independent of one another **N**ormality: errors (ε_i) are normally distributed at each value of X. regression analysis robust against departure from normality assumption **E: **homoscedasticity( equal/constant variance). Variance of the errors (ε_i) be constant for all values of X. variability of Y values is the same when X is a low value as when X is high Have mean 0

Answer 34

A residual (also called error term) is the difference between the **observed response value** Y_i, and the **value predicted by the regression equation** Y hat_i - (Vertical distance between point and line/plane.)

Answer 35

Can be checked by looking at a **histogram of the residuals** - look for bell-shaped distribution. Also **normal probability plot** – look for straight line. For preference, use **standardised residuals** – have a std dev of 1.

Answer 36

Checked by using plots of * _residuals vs predicted values_ * _residuals vs independent variables._ Look for **random scatter of points around zero.** If not, (esp res vs indep), may indicate linear regression is not appropriate – may need to transform data (see tutorial)

Answer 37

Check in previous plots; also in **residuals vs time/order.** Look for random scatter of residuals.

Answer 38

X₁ = x X₂ = x² X₃ = x³

Answer 39

This is needed if the level of X1 affects the relationship between X2 and Y.

Answer 40

develop model to predict values of a numerical variable, based on value of other variables

Answer 41

single numerical indepdent variable, X is used to predict numerical dependant variable Y

Answer 42

a linear relationship or straight-line relationship

Answer 43

Y_i = β₀ + β₁X_i + ε_i β₀ = Y intercept for population (mean value of Y when X= 0) β₁ = slope for population (change in Y per unit change in X) ε_i = random error in Y for each observation i (vertical distance of actual value of Yi above or below the expected value of Yi on the line) Y_i = dependent variable (**response variable**) for observation *i* X_i= independent variable (**predictor/explanatory variable**) for observation *i*

Answer 44

1. positive linear relationship 2. negative linear relationship 3. positive curvilinear relationship 4. negative curvilinear relationship 5. U shaped curvilinear relatnioship 6. No relationship

Answer 45

ŷ_i= b₀ + b₁X_i population parameters in practice are estimated ŷ_i= predicted value of Y for observation i * X*_i = value of X for observation i * b*₀ = sample Y intercept * b*₁ = sample slope

Answer 46

by using least squares estimation minimises the sum of the squared differences between the actual values Y_i and predicted values Yhat_i using simple linear regression equation

Answer 47

the line that fits the data with the minimum amount of prediction error provides line of best fit so SSE is minimised

Answer 48

measures variability of observed Y values from the predicted Y values standard deviation around the prediction line

Answer 49

there is pattern in residuals. This can put validity of regression model in serious doubt because it obviates the independence of error assumption eg. after plotting residual, residual fluctuate up and down in cyclical pattern, high chance autocorrelation exists, violating independence of errors assumption

Answer 50

net regression coefficients. they estimate the predicted change in Y per unit change in a particular X , holding constant the effect of the other X variables

Answer 51

Include **categorical variable in a regression model, you use a dummy variable.** (converts categorical variable to numerical variable) Recodes categories of a categorical variable using the numerical values 0 and 1 0 assigned to absence of a characteristic. 1 assigned to presence of characteristic X₂ = 0 if the house does not have fireplace X₂ = 1 if house does have fireplace (substituted in model)

Answer 52

Interaction occurs if effect of an independent variable on the dependent variable changes according to the value of a second independent variable. **Interaction between the two indepedent variables** eg. advertising has large effect on sales of product when price of product is low

Answer 53

use an interaction term (cross-product term) to model an interaction effect in a regression model. Then assess whether the interaction variable makes a **significant contribution** to the regression model. If significant, cannot use roriginal regression model for prediciton X₃= X₁ x X₂ _{Size*fireplacecoded = size x fireplace coded}

Answer 54

The violation of assumptions means that the regression is invalid and should not be used for prediction or further analysis.

Regression Analysis Flashcards

(81 cards)