More Quant Stuff Flashcards
Linear Regression
In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.
Geometric Interpretation of Linear Regression
A line in 2D space, or a plane in 3D space, depending on how many variables you have interacting
Under what assumptions is Linear Regression unbiased?
Linearity, No Autocorrelation, Multivariate Normality, Homoscedasticity, No/low Multicollinearity
Hypothesis testing of coefficients
- Set the hypothesis
- Set the significance level, criteria for a decision
- Compute the test statistics
- Make a decision
Can test using manual feature elimination (e.g. build a model with all the features, drop the features that have a high p-value, drop redundant features using correlations and VIF) and automated (e.g. RFE and Regularization) techniques
Outlier detection
Z-score/Extreme Value Analysis, Probabilistic and Statistical Modeling, Linear Regression Models, Information Theory Models, High Dimensional Outlier Detection Methods
Cooks distance
Cook’s distance is the scaled change in fitted values, which is useful for identifying outliers in the X values (observations for predictor variables). Cook’s distance shows the influence of each observation on the fitted response values. An observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier.
Leverage Point
A leverage point is determined by a point whose x-value is an outlier, while the y-value is on the predicted line (y-value is not an outlier). Therefore, this point is undetected by the y-outlier detection statistics
p-value
The probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Reject the null hypothesis at p < 0.05
t-statistic
The ratio of the difference in a number’s estimated value from its assumed value to its standard error.
Maximum Likelihood Estimation
A method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable
Estimation mean of Gaussian
Σx_i/N
Variance of Gaussian
σ² where σ = sqrt[(1/(n-1))Σ(x_i - mean of x)²]
Multivariate Gaussian
A generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution
If X and Y are joint Gaussians, how do you compute E(X|Y)?
𝐸(𝑋|𝑌) = 𝐸(𝑋) + 𝑐𝑜𝑣(𝑋,𝑌)𝑐𝑜𝑣−1(𝑌)(𝑌−𝐸(𝑌))=1+𝑌/2
Basic Time Series Models
Autoregressive (AR), integrated (I), Moving-Average (MA), Autoregressive Moving Average (ARMA), Autoregressive Integrated Moving Average (ARIMA), Autoregressive Fractionally Integrated Moving Average (ARFIMA); can use vector-valued data and add initial V out front; Autoregressive Conditional Heteroskedasticity (ARCH) and associates (GARCH, TARCH, EGARCH, FIGARCH, CGARCH, etc); Markov Switching Multifractal (MSMF) for modeling volatility evolution; Hidden Markov Model (HMM) → many of htem are in sktime package