Quant Flashcards
What is linear regression?
Finding the relationship between 2 variables for predictive analysis
What is the SSE, SSR and SST
On a slope, one must determine the error between the line of best fit and the data points. These 3 varibles quantify that
SSR is the pRedicted deviation - its is the difference between the line of best fit and the mean of the data set
SSE is the ERROR deviation and is the difference between the line of best fit and the data point
SST is the sum of SSE and SSR - it shows the total deviation from the mean to the data point
Remember these are all SQUARED
What is the formula for r squared
R squared = SSR / SST
It shows how well explained/predictive the model is
Is a high or low r squared meaning the relationship is greater?
High r squared means HIGH relationship
R squred, what are the highest and lowest numbers it could be
It is between 0 and 1
What is the degrees of freedom
Degrees of freedom are the number of variables you have in the model minus how many variables you have minus 1.
You want the degrees of freedom high to have a good mdoel
DF = n-k-1
As Degrees of freedom increases, R squared ________? and why
As Degrees of freedom increases, R squared decreases.
Think if you only had 2 data points, the r^2 (relationship) would be 1. Putting in more variables would DECREASE r^2.
Formula for Y / relationship between x and Y
Y = β0 + β1 x + error
When comparing y = beta 0 + beta 1 * x + e, which is the independant and dependant variable?
Y is dependant, x is independant
If the confidence interval rises (from 90- 99%) does the probability of rejecting the null hypothesis go up or down? WHy
The probabily will go….. down. The confidence interval will get wider (to ensure we are more confient we have the right number).
What is a t statistic? What is the formula?
A t statisitc test is checking whether a hypothesised number could be the actual statistic/value of a score based on a t score, standard error, and the score we know to be true.
So it is the score we know it true + and - the t score * standard error.
The t score is found using the degrees of freedom minus 2. Get the score from the t table.
What is the stnadard error
SD / square root n
OR
Epsilon (which is Y -β1 - β2) < the formula for Y in reverse.
(Epsilon squared / n-2) ^.5
How to find SSR
It is the line of best fit - mean
How to find SSE
Value - line of best fit
What is an f test?
It compares 2 data sets to check if they’re statistically consistent
Confidence interval formula
= mean +- t or z score * standard error
What are the z scores for 90,95 and 99% ?
- 64
- 96
- 68
Coefficient of determination is
r^2
What is correlation squared?
r^2
in the formula y = Y = β0 + β1 x + error , What is B0
β0 is the y intercept
Confidence interval explanation and formula
Mean + - t or z stat * standard error.
Check if the OTHER mean (be it the actual or standard mean) is within those boundaries
What is the p value,
The pathetic value, we want that low to reject the null
What are some key assumptions to simple linear regression
the relationship between x and y is linear
x is uncorrelated with the error terms
Sum of residuals = 0
there is a constant variance
Formula for standard deviation with Standard error
Square root of Standard error / n-1
Is variance the same as SST?
Yes
Formula for DOF
DOF = k+ (n-k-1)
MSR (mean squared regression) and MSE (mean squared Error) formulas
MSR = SSR / k MSE = SSE / n-k-1
What is MSR / MSE
F stat
Formula for standard error in regression
square root sse / n - k - 1
Correlation formula, then R squared formula
Cor = Cov / omega omega
R^2 = cor^2
F stat formula, what is means, and how to interperet it
F stat is testing if there is even a relationship between the y and x variables
It is MSR/MSE
Over 1 means that there is a relationship
Calcualte MSR and MSE
MSR = SSR / n-k-1 MSE = SSE/k
MSE/MSR = F
What does adjusted r^2 do
It adjusts the r^2 so that increasing the dof does NOT increase the r^2
Downfall of R^2?
It is not bound by 0 and 1
What is a dummy variable? And how to incorporate into formula?
Introducing a QUALatative variable. You give it a value of 1, and every alternative a value of 0. If it is months of the year, and you want only results collected in Jan, Jan has a value of 1, and the rest (minus one month) have a value of 0
What is heteroskatacity?
It is unequal variances. Pretty much that there is a relationship between the standard error and the variable’s variance. You don’t want that
What are the assumptions of multiple regression
There is a linear relationship The independant variables are NOT random Error = 0 Variance is constant Errors are not correlated Error is normally distributed
How does Heteroskatacity effect the standard error, and what does this mean?
It makes the standard error lower (because it can be more easily predicted etc. from variable value) meaning that it is HARDER to reject a null.
How to reject heteroskatcity?
Broysche Pagan Test
What is serial correlation?
Than an independant varialbe is correlated with itself, so it is more predictable and therefore variance is lowered. So if a stock goes up one day, it is more likley to go up the next day. That is not constant variance
What will serial correlation do to the t stat
Increase it meaning you wont be able to reject the null
What is multicollinarity
Multicolunarity means that two independent variables are closely correlated
What will multicolliarity do to the t stat and standard error
increase standard error and reduce t stat
How do you resolve multicolliarity?
Remove a variable
The null hypothesis is the….
not true hypothesis
What is autoregression?
Variable yesterday explains a variable today
Formula for an autoregression equation
x = b0 + b1(X-1n) +E
How do you detect if error terms are correlated?
Durbin Watson Test - you cant use this data if the error terms are correlated
I have x1, how do i get x2 using autoregression
x2 = b0+b1*X1
X1 is the same as x-1 from x2
Autoregressive correlation. How do you test for this, and what does the test mean?
Normal t test for this one. Find the autocorrelation / Standard error. Compare against t value.
If it is NOT REJECTED, the data is all okay
You do a t test on the serial correlation on some time series data and find out that the null is rejected, meaning that the t stat is outside the t value, what does this mean?
Rejected null means reject that data, it is autocorrelated and not good
Mean regression line, what is the formula for this?
B0 / 1-b1
THis is what the data points should revert to
How do you work out which autoregression line you should use? e.g. data from 2 years ago or 3 years ago.
You use Root Mean Squared Error. Pretty much the Square root of MSE of both series - the smallest means you use that data set
What is the mean reversion from a random walk and why
There is none! It is B0 / 1-b1
B1 is always 1, so 0/0 = 0
Formula for a random walk and what it means
x = x-1 + random error term.
It is the best guess of the value beyond that of the one in the past. x-1 + a random variable
Multicollinarity, Heteroskadacity and serial correlation, how are eachs’ standard error effected?
Multi = multicorrelation = Multiple increase in standard error, so Multicorrelation has a higher standard error, the other ones don’t have multi, meaning they have lower standard error
How can a model be misfitted?
Types:
- Time-series: Serial correlation with a lagged variable, or forecasting the past
- Functional: Omitting a variable or data pooled improperly
You use the Durbin Watson test to test for what?
Autocorrelation
When testing for Autocorrelation in Linear and Log Linear models, what do you use? And do you use something different for AR models?
Yes. Durbin watson for Linear and Log Linear.
T test for AR models
Important, what does covariance stationary mean. What are the assumptions.
Finite expected value
Constant Variance, Constant covariance
Has a mean reverting level
No root unit problem
Important, how do you make data covariance stationary
By First differencing data. You take the difference between a period and the period prior, that is now the new data point
What is first differencing data
Making data covariance stationary
What does the Durbin Watson test test for? ANd what is the magic number
Autocorrelation. It has like a permant t stat of 2. Less than 2 = NO serial correlation
What is the difference between an AR1 model and AR 2 model
AR1 only has 1 lagged variable, AR2 has 2.
In an autoregression model, if b1=>1, what happens?
The data is NOT covariance stationary because there is NO mean reverting level. You can not use the data.
Which is better for data sets and why. Long term or short term data
Short term (yes short term). Why? Well, long term data may contain data points that have structural changes in the underlying economy or like data environment. Not good to model off
Is a random walk covariance stationary?
NO.`
How to make a random walk covariance stationary
First difference the data. Period 1 - Period 1-1.
What is first differencing
Making data convariance stationary by taking the difference between 2 data points
In an ar model, how to check for autocorrelation, and how do you interperet the data
T test. If the autocorrelation t score is BELOW/within the critical t, autocorrelation is NOT present, so the data is good. If the data is correlated, use the next AR model (AR2, AR3 etc) til the serial correlation goes away
What is the dicky fuller test
The unit root test (if present we are in the clear). Basically ensuring that b1 is not a 1 (meaning no mean reversion
How do you conduct a dicky fuller test? And what does it test for?
Subtract x-1 from both sides of equation. It tests if the formula has a unit root, which is needed for an AR model to be covariance stationary.
What would adding a second lag do to an AR model? Adjust for what?
Seasonaolity
If a model has arch 1, what does that mean?
Variance can be predicted
What is machine learning?
Finding patterns then applying those patterns.
What is a feature and what is a target is machine learning
A target is the y variable, the dependent variable, while the feature is the x variable
What are training, validation and test samples
Training samples help a algorithm learn a pattern or relationship
Validation samples TUNE the model
Data or Test samples test the model on out of sample data
Which is new data, in or out of sample data
Out
What is undersupervised learning
Undersupervised learning is when a Machine learning alogrithm learns the relationships between variables when they are not labelled. They find the patterns and relationships themselves
What is supervised learning?
Supervised learning is when an analyst enters the labels of a dataset
What is a hyperparametre
It is when the analyst enters, it is something that contracins the learning progress of the model
What is overfitting? What are the deteriments. Is it for supervised or undersupervised models
Having too many features to describe a target. The model can NOT process or explain out of sample data.
Supervised only
Bias Error, and Variable error, what are they
Bias error means you have inputs that do not explain the changes in Y. This means the model is underfitted.
Variable error is when the model is overfitted. The model is great at explaining in sample data, but bad at out of sample
How to reduce varitation error?
Holdout samples and K Fold cross variation
Name the types of supervised models (5)
Penalised model (penalty for including increased variables) Support Vector - classification model K nearest neighbour - classification model - finding similarities in inputs CART - Binary model -classification and regression tree Emsemble/random forest - complex but low variation model
Types of UNsupervised models
Principal Components - only showing the most relevant features
Clustering - K clustering - putting outputs into K clusters
Heirach Cluster - dividing clusters as they appear,
Neutral network ML, what is it
Super complex and very effective. good for nonlinear
When dealing with a random walk, if the intercept and the coefficients do not significantly differ from zero, what should you do?
Assume that they equal zero, so y = error
Do random walks have unit roots
Yes
Convertible bond ratio
Market conversion price = Convertible bond price/Conversion ratio