Regression Analysis, Time Series Analysis Flashcards
How is correlation coefficient notated/
If measured for a population, called ? (rho)
If estimated from a sample, sure r; i.e r estimates ?
What is true of the correlation coefficient?
- Correlation and covariance are appropriate for use with continuous variables whose distributions have the same shape (e.g. both normally distributed)
- If these assumptions are not met, r will be ‘deflated’ and underestimates ?.
What are the data in a regression analysis?
- One continuous response variable (called y - dependent variable, response variable)
- One or more continuous explanatory variable (called x - independent variable, explanatory variable, predictor variable, regressor variable)
What is true of εi?
Mean of εi will be zero
What does E(Y|X=x) = β0 + β1x mean?
The expected value of Y when X = x is β0 + β1x
What is the true/population regression line?
- Yi = β0 + β1xi + εi
- β0 and β1 are constant to be estimated
- εi is a random variable with mean = 0 if our line is going through the middle of our data
How is the population regression line estimated?
- ŷ = b0 + b1x
- b0 and b1 are estimated values
- ŷ is a fitted value of the response
What is a residual?
vertical distance between observed response and fitted value of response
How are residuals estimated?
ri estimates εi, the error variable
What is SSE?
The error sum of squares
SSE = nΣi=1 (yi - ŷ)2
What error assumptions do we make in regression analysis?
- In our fitting we assume the errors have a particular distribution - that is, ε ~ N(o, σε2)
- Normal distibution
- Mean = 0
- Constant variance = σε2
- Errors associated with any two y values are independent
What is sε?
- sε = standard error of the estimate
- Interpretation - standard deviation of residuals; standard error in predicting Y from the regression equation - best definition: standard deviation around prediction line
What are the t stats in regression analysis output?
T = test statistic (that population intercept/slope = 0 against two sided alternative), compared to t with n-2 degrees of freedom finds P = 0, i.e. intercept/slope is not 0
What is S in regression analysis output?
Standard Error of the Regression (S) = average distance that values fall from regression line
What is R^2?
- Determine the strength and significance of association
- coefficient of determination
- measures proportion of total variation explained, i.e.
- = explained variation / total variation = SSreg / SSy =(correlation coefficient)^2
- Will be between 0 and 1; a value close to 1 indicates most of the variation in y is explained by the regression equation
What is important about R?
r = ± √r2
What is Homoscedasticity?
If variation is constant (residuals show constant spread around zero), called homoscedastic
What is Hetroscedasticity?
If variation is non-constant (residuals show varying spread around zero), called heteroscedastic
What is true about Large Standardised Residuals?
Minitab flags “Large Standardised Residuals” R - should be about 5%, - indicates normality of residuals
What must be true to make predictions from a regression analysis?
- High R-sq, small std error of estimate
- All assumptions appear valid
- Predictions should only be made for values inside the observed limits
What does β1 represent in a multiple regression with 2 predictors?
β1 represents the expected change in Y when X1 is increased by one unit, but X2 is held constant or otherwise controlled
What is meant by additive effects of multiple regression?
Combined effects of X1 and X2 are additive - if both X1 and X2 are increased by one unit, expected change in Y would be ( β1 + β2 )
What must be true for us to find a Least Squares solution for a multiple regression?
- Number of predictors is less than number of observations
- Non of the independent variables are perfectly correlated with each other
What is true of the coefficient of multiple determination?
- Will go up as we add more explanatory terms to the model whether they are important or not
- Often we use adjusted R-sq - compensates for adding more variables, so it lower than R-Sq when variables are not “important”
- So, if comparing models with differing numbers of predictors, use Adjusted R-Sq to compare how much variation in response is explained by model
What are the rules of dummy variable regression?
- Can code any discrete variable with k categories into (k-1) distinct dummy variables
- Usually only used when variables have 2 (sometimes 3) categoreis/levels
What is a polynomial regression?
- Y = β0 + β1X + β2X^2 + β3X^3 + … + ε
- Equivalent to fitting a multiple regression where
- X1 = x
- X2 = x^2
- Xk = x^k
What is completeness and interactions terms in polynomial regression?
- Called “complete’ if all lower order terms of x are significant
- If only had x and x^3 would be incomplete, third order polynomial regression
- Interaction Term
- This is needed if the level of X1 affects the relationship
between X2 and Y
- This is needed if the level of X1 affects the relationship
- e.g. Second order model with interaction
- Y = β0 + β1X1 + β2X1^2 + β3X2 + β4X2^2 + β5X1X2 + ε
What is overparamaterisation?
- Polynomial regression
- Because we’re fitting so many predictors (parameters) to so few observations, the regression may fit to data too well
- Meaning that it might not predict the population accurately
- Model doesn’t generalise
- High r SQ
What are the components of a time series?
- Long term trend
- Cyclical variation
- Seasonal variation
- Random variation
What is long term trend?
- Also called secular trend
- Relatively smooth pattern or direction
- Can be linear or non-linear
What is cyclical variation?
- Wave-like pattern describing long term trend apparent over a number of years - cyclical effect
- Recurrence period over 1 year (definition)
- e.g. Business cycles
- Rare to find cyclical patterns that are consistent and predictable
What is seasonal variation?
- Cycles that occur over short repetitive calendar periods
- Duration less than one year (definition)
- “seasonal” may mean 4 seasons, or systematic patterns over a month/week/day
- e.g. restaurant demand features “seasonal” variation throughout the day, monthly traffic volume
What is random variation?
- Irregular, unpredictable changes
- Not caused by other components (trend, cyclical, seasonal variation)
- Often referred to as “noise”
- Can mask the existence of other components
- Exists in all time series
- Goal of most time series analysis is to reduce impact of random variation on forecasting or interpretation