Quant Methods Flashcards
Yi
Dependent and independent variables // Graph function
Yi = b0 + b1Xi + ei
- Dependent variable is Yi
- Independent variable is Xi
- Error term is εi
- Coefficients are b0 (intercept) and b1 (slope coefficent)
Scatter Plots types
Correlation coefficient (p or r) (Formula)
Correlation standardizes covariance by dividing it by the product of the standard deviations
Perfect postive correlation: +1
Perfect negative correlation: -1
No correlation: 0
Covariance (Formula)
A statistical measure of the degree to which two variables move together
(Sample) Standard Deviation Formula
Sx = [E (xi - xmean)2 / n-1] 1/2
Easier with calculator!!
Using calculator for Data Series to get Sx, Sy, r
- Add Data Series: [2nd] + [7]
- View Stats / Results: [2nd] + [8] > LIN [Down arrow]
Does not calculate Covariance!
BUT
Cov = rxy * Sx *Sy
Limitations of correlation analysis
- Correlation coefficient assumes linear relationship (no parabloic etc.)
- Presence of outliers can be distortive
- Spurious correlation (Fehlkorrelation)
- Correlation does not imply causation (Rain in NYC has no effect on LON Bus routes altough there might be a statistical correlation)
- Correlations without sound basis are suspect
Assumptions underlying simple linear regression
- Linear relationship – might need transformation to make linear
- Independent variable is not random – assume expected values of independent variable are correct
- Expected value of error term is zero
- Variance of error term is same across all observations (homoskedasticity)
- Error terms uncorrelated (no serial/auto correlation) across observations
- Error terms normally distributed
Standard error of the estimate (SEE)
Standard error of the distribution of the errors about the regression line
The smaller the SEE, the better the fit of the estimated regression line. Tigther the points to the line
k = # of independent variables (single regression: 1)
Sum of squared errors (SSE)
UNEXPLAINED: Actual (yi) - Prediction (^y)
The estimated regression equation will not predict the values y, it will only estimate them
A measure of this error is SSE (^y is predicted)
The coefficient of determination (R2)
Describes the percentage variation in the dependent variable explained by movements in the independent variable
Just r2 (loses + / -) add back when calculating r again
R<strong>2</strong> = 80% = 0.8 > r = 0.81/2 = 0.89 = -0.89 (see below)
y^ (predicted) = 0.4 - 0.3x > b1 = -0.3
Alternatively: R2 = RSS / TSS (if the same, R2 = 1 > perfect fit)
R2 = 1 - SSE/ TSS (if SSE = 0, R2 = 1 > perfect fit)
Total sum of the squares (TSS)
ACTUAL (yi) - MEAN
Alternatively, TSS = RSS +SSE
Regression sum of the squares (RSS)
EXPLAINED: PREDICTION (^y) - MEAN
Difference between the estimated values for y and the mean value of y
Graphic: Relationship between TSS, RSS and SSE
Relationship between TSS, RSS and SSE
- Using SSE, TSS and RSS to measure the goodness of fit of the estimated regression equation
- The estimated regression equation would be a perfect fit if every value of the dependent variable yi happened to lie on the estimated regression line. This would result in SSE=0 and RSS=TSS
- RSS/TSS is known as the coefficient of determination and is denoted by R2 :
Hypothesis testing on regression parameters
- Confidence Interval on b0 and b1
- For a 90% confidence interval, 10% significance, 5% (a/2) in each tail
- More HT in Multiple Regressions
ANOVA tables
- ANOVA stands for ANalysis Of VAriance
- It is a summary table produced by statistical software such as Excel
- Using the ANOVA table, calculate the coefficient of determination
- The global test for the significance of the slope coefficient
- Use of the F-statistic
Prediction intervals on the dependent variable
- Range of dependent variable (Y) values for a given value of the independent variable (X) and a given level of probability
- Two sources of error: Regression line and SEE
eg. 20 ——– 40
Limitations of regression analysis
- Parameter instability - Regression r_elationships can change over time_
- Public knowledge of relationships - If a number of analysts identify a regression relationship that works, prices will change to reflect the inflow of funds, possibly removing the trading opportunity
- Assumption violation - If regression assumptions are violated then hypothesis test and predictions will be invalid
Multiple Regression
Assumptions
- The relationship between the dependent variable and each independent variable is linear
- The independent variables are not random and there is no multicollinearity (x:x)
- The expected value of the error term is zero
- Error term is homoskedastic (E Variance constant; having the same scatter)
- No serial correlation
- Error term is normally distributed
ANOVA
Work out:
- Degrees of freedom (DF) with k = # variables ; n = sample size
- Sum of squares: 2 will be given (TSS = RSS + SSE)
Using the regression equation to estimate the value
Becomes: Ŷ = 0.163 - (0.28 x 11) + (1.15 x 18) + (0.09 x 215) = 37.13
But this is only an estimate, we will want to apply confidence intervals to this
Individual test: T-test
Testing the significance of each of the individual regression coefficients and the
intercept
Tcalc: bi / S.E.
Tcrit: 2 (given in CFA)
TCalc > TCrit (in absolut) = REJECT NULL (H0: b1 = 0)
then b1 not equal to 0 = SIGNIFICANT
Global F-Test: Testing the validity of the whole regression
Testing to see whether or not all of the regression coefficients as a group are insignificant
FCalc > FCrit (in absolut) = REJECT NULL: at least one does not equal zero
T-Test: Specified Value
Determining whether a regression coefficient is significantly different from a specified value e.g. 1
Tcalc: bi - 1 / S.E.
Tcrit: 2 (given in CFA)
TCalc > TCrit (in absolut) = REJECT NULL (H0: b1 = 0)
then b1 not equal to 0 = SIGNIFICANT
R<strong>2</strong> Recap
“The percentage of the total variation in the dependent variable (Y) that is explained by the regression equation”
Adjusted R2
- The problem with R2 is that it will automatically increase if new independent variables are added, even if the new variable adds very little to the regression
- Adjusted R2 takes into account the number of independent variables
- It will only increase if the new independent variable pulls its weight
Example: Adding in a 4th variable and R2 increases (which is good). However, Adjusted R2 decreases and that is WORSE. Prefer option were R2 stays the same / gets worse and Adjusted is flat.
Interpret rather than use formula.
Dummy variables in regression analysis
- Qualitative variables are important - E.g. investor confidence
- Incorporate by dummy variables - Assigned either “1” or “0”
- If you want to describe j circumstances with dummy variables you need j-1 dummy variables - E.g. month of year effect requires 11 dummy variables
Write a suitable regression equation and test significance (t-test: Tcalc with [b1 / S.E.] > Tcrit = REJECT = Significant]
Homoskedasticity
Variance of the error terms is constant across all of the observed data
Heteroskedasticity
Variance of the error terms is not constant across all of the observed data
Testing for conditional heteroskedasticity: Breusch-Pagen test
Breusch-Pagen test
Testing for conditional heteroskedasticity
- Regress the squared errors against each independent variable
- Determine R2 of these regressions
- If no conditional heteroskedasticity there will not be a strong relationship
- If a high R2 there may be a strong relationship
- But also need to consider the number of observations
Correcting for heteroskedasticity
How we would correct for conditional heteroskedasticity:
- Compute robust standard errors
- Modify the regression equation by using generalized least squares method
Robust standard errors correct Tcalc
Autocorrelation / Serial correlation
E:E
- The residuals of a regression are correlated across observations, so that a positive (or negative) error in one observation affects the probability that there will be a positive (or negative) error in the next observation (previous error predicts the next error; E:E)
- Effect is that standard errors may be incorrect
- Thus we may incorrectly reject/fail to reject null hypotheses about the population values
- If one or more of the independent variables is a lagged value of the dependent variable, then serial correlation causes all regression parameters to be invalid – very serious problem as you may be performing the wrong type of regression
- Detect with Durbin-Watson statistic
Durbin-Watson statistic
Detect autocorrelation
DW = 2 * (1 - r)
- Obtain the critical value of the DW statistic (given in exam)
- If positive correlation
-
H0 : No positive autocorrelation
- IF DWcalc < dl reject Ho
- IF DWcalc > du do not reject Ho
Example:
- DW Statistic = 1.87
- Assume the lower and upper critical values are 1.61 and 1.74
=> DWcalc (1.87) > du (1.74) => do not reject = No positive autocorrelation
=> if DWcalc was 1.65 => Inconclusive
=> if DWcalc was 1.00 => smaller than dl => REJECT => +ve Autocorrelation
Correcting for serial correlation
- Hansen method of adjusting the standard errors of the regression coefficients upwards
- Change the regression equation so that the autocorrelation is eliminated (do something different!!!!)
Hansen adjusts for both serial correlation and heteroskedasticity. It does not eliminate serial correlation.
Multicollinearity
(X:X)
Definition
- Multicollinearity occurs when two or more independent variables (or combinations of independent variables) in a regression model are highly (but not perfectly) correlated with each other (x:x
- Estimates of regression coefficients will be unreliable
- Cannot distinguish individual impacts of independent variables
Detection of multicollinearity
- High R2 (this works - your equation is predicting is movement in y)
- Significant F-stat (At least one bi is significant)
- but low t-stats on each regression coefficient (due to overstated standard errors) - not significant: might be prooffor Multicol.
- Can also be tested by pairwise correlation matrix but only when there are two independent variables (just look at correlation of each two if close to -/+ 1 = multicollinear)
Correcting for multicollinearity
- Reformulate the regression model, leaving out variables that appear to be redundant
- Rerun the regression model
- In practice it can be difficult to determine which variables to exclude so experimentation may be necessary
Summary violation of assumptions
Principles of model specification
- Model should be grounded in sensible economic reasoning - E.g. avoid data mining
- Functional form of variables should be appropriate - E.g. use logs of inputs if appropriate
- Model should be parsimonious i.e. achieving a lot with a little
- Model should be examined for violations of regression assumptions before being accepted
- Model should be tested ‘out of sample’, i.e. use new sample data before being accepted
The model could fail because:
- One or more important variables are omitted (forget to put a variable in)
- One or more of the regression variables may need to be transformed - E.g. using natural logs for exponential data (or from millions in thousand)
- Data from different samples is pooled, e.g. using data from different stages of a company’s growth (mixing relationships)
Models with qualitative dependent variables
NOT dummy (independent)
Qualitative dependent variables are where dummy variables are used as dependent rather than independent variables
There are three main models:
-
Probit
- Estimate the probability of a discrete outcome (e.g. that a company will go bankrupt). Uses normal distribution
-
Logit model
- is based on the ‘logistic distribution’, a simplified version of the normal distribution that was useful before computers were developed
- Discriminant analysis
- Yields a linear function that is similar to a regression equation that will create an overall ‘score’ for the dependent variable based on the values of the independent variables. If the score is above a certain number, the dependent variable is assigned a value of ‘1’; otherwise, it is assigned a value of ‘0’
Qualatative dependant output!!
Time-series // Time-series analysis
A time series is a set of observations on a variable’s outcomes in different time periods
Models to use: Trend model (linear / log linear) & Auto Regressive (AR)
Key issues:
- How do we predict a future value based on past values?
- How do we model seasonality?
- How do we choose which models to use?
- How do we model changes in the variance of the time series over time?
Linear trend models
Probably serial correlation use DW to spot it