Quantitative methods Flashcards
Multiple regression model assumptions
- linearity,
- homoskedasticity –> variance of residuals constant
- independence of errors, –> Residuals are not serially correlated
- normality, –> error term is normally distributed evaluated with QQ plot
- independence of independent variables.–> no linear relationships between independent variables
MSR
MSR = RSS/k
MSE
MSE = SSE/(n−k−1)
SST
RSS+SSE
R2
RSS/SST
oppure
(SST-SSE)/SST
oppure
(total variation – unexplained variation )/total variation
indica quanto l’indipendent variable puo spiegare
Breusch pagan
n*R^2
Adjusted R2
1-((n-1)/(n-k-1))*(1-R^2)
o measure of goodness of fit that adjusts for the number of independent variables
o adj R2<R2
o decreases when the added independent variable adds little value to regression model
Cook’s D
If observation > √(k/n)–> influential point
Odds
Prob given odds
Odds= e^coefficient
Prob with odds = odds/(1+odds)
F statistic
((SSEr-SSEu)/q) / (SSEu/(n-k-1))
=MSR/MSE with K and N-K-1 df
H0 all coefficients are zero
reject H0 if F (test-statistic) > Fc (critical value)
to explain whether at least one coefficient is significant
Conditional Heteroskedasticity
Residual variance is related to level of independent variables
- Coefficients consistent.
- St. errors underestimated
- Type I errors
DETECTION
* Breusch–Pagan chi-square test
* >5% hetero
* <5% no hetero
CORRECTIOn
robust or White-corrected standard errors
Serial Correlation
Residuals are correlated with each other
- Coefficients consistent
- St errors underestimated
- Type I errors (positive correlation)
DETECTION
* Breusch–Godfrey (BG) F-test
* Durbin Watson (DW)
* DW<2–> pos. serial corre.
CORRECtION
Use robust or Newey–West corrected standard errors
Multicollinearity
Two or more independent variables are highly correlated
- Coefficients are consistent (but unreliable).
- St errors are overestimated
- Type II errors
DETECTION
* Conflicting t and F-statistics
* variance inflation factors (VIF)
* VIF >5 o 10 problema
CORRECTION
* Drop 1 of the correl. variables
* use a different proxy for an included independent variable
MISSPECIFICATIONS
Omission of important independent variable(s)–>May lead to serial correlation or heteroskedasticity in the residuals
Inappropriate transformation / variable form–> May lead to heteroskedasticity in the residuals
Inappropriate scaling–>May lead to heteroskedasticity in the residuals or multicollinearity
Data improperly pooled
Solve it by running regression for multiple periods–May lead to heteroskedasticity or serial correlation in the residuals
prob with odds
P=(odds)/(1+odds)
Autoregressive (AR) Model
- only 1 lag–>dependent variable is regressed against previous values of itself
- no distinction between the dependent and independent variables (i.e., x is the only variable).
- USE t-test to determine whether any of the correlations between residuals at any lag are statistically significant.
- if not covariance stationary To correct add one lag at a time–> first differencing
- Ex: pattern of currency using historical price
add one lag at a time - Chain rule forecasting
Covariance Stationary
- Statistic significant = cov stationary
o Constant and finite mean. E(xt) = E(xt-1) ATTENZIONE no growth rate della mean
o Constant and finite variance.
o Constant and finite covariance - determine cov. StationaryDickey-Fuller test
Mean Reversion
A time series is mean reverting if it tends towards its mean over tim
=b0/(1-b1)
Se b1 =1–> mean reverting è undefined perchè b0/0
Unit Root = Random walk
- B1=1 devo first differencing I dati
Undefined mean rev. level–>Not covariance stationary
Random Walk
- random walk = value in one period is equal to the value in another period, plus a random error.
- Random walk without a drift: xt = xt−1 + εt b0=0 and b1=1
- Random walk with a drift (con b0): xt = b0 + xt−1 + εt b1=1
Seasonality
- More than 1 lag
o quarterly data = seasonal lag is 4;
o monthly data = seasonal lag is 12.
Root Mean Squared Error (RMSE)
to assess accuracy of autoregressive models.
* lower RMSE = better
* Out-of-sample forecasts
- structural change
significant shift in the plotted data at a point in time that seems to divide the data into two distinct patterns
- Cointegration:
two time series are economically linked (same macro variables) or follow the same trend and that relationship is not expected to change
Autoregressive Conditional Heteroskedasticity (ARCH
variance of residuals in 1 time period is dependent on the variance of the residuals in another period.–> st. errors of the coefficients and the hypothesis tests are invalid.
- Penalized regression
Regression–> good when I have features reduces overfitting by imposing a penalty on—and reducing—the nonperforming features.
- Support vector machine
classification; separates the data into one of two possible classifiers based on a model-defined hyperplane.
- K-nearest neighbor
classification based on nearness to the observations in the training sample
- Classification and regression tree
. Classification of target variables
* when there are significant nonlinear relationships among variables.
* Binary classification (categorical data)
* Provides a visual explanation
- Ensemble learning
This combines predictions from multiple models, resulting in a lower average error rate.
- Random forest
This is a variant of the classification tree whereby a large number of classification trees are trained using data bagged from the same data set; solution for overfitting
- Dimension reduction=Principal components analysis
summarizes info into smaller set of uncorrelated factors called eigenvectors.
- K-means clustering.
split observations into k non-overlapping clusters; a centroid is associated with each cluster
* Hyperparameter = parameter set before analysis begins ex. 20 groups
- Hierarchical clustering
hierarchy of clusters without any predefined number of clusters
- Neural networks
o input layer
o hidden layers (which process the input)
The nodes in hidden layer =neurons–> summation operator (that calculates a weighted average) and an activation function (a nonlinear function).
o output layer.
o Good for speech recognition and natural language processing
o Good for modelling complex interactions among many features
- Deep learning nets
many hidden layers (more than 20) useful for pattern, speech, and image recognition
- Reinforcement learning
seek to learn from their own errors maximizing a defined reward.
- precision (P)
true positives / (false positives + true positives)
recall
= true positives / (true positives + false negatives)
- accuracy
true positives + true negatives) / (all positives and negatives
- F1 score
2 × P × R) / (P + R)
Covariance stationarity
o Constant and finite mean. E(xt) = E(xt-1) ATTENZIONE no growth rate della mean
o Constant and finite variance.
o Constant and finite covariance with leading or lagged values
- determine cov. StationaryDickey-Fuller test
Logistic model
dependent variable is binary
steps in a data analysis project
1-conceptualization of the modeling task,
2- data collection,
3- data preparation and wrangling,
4- data exploration,
5- and model training.