More Quant Stuff Flashcards
Linear Regression
In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.
Geometric Interpretation of Linear Regression
A line in 2D space, or a plane in 3D space, depending on how many variables you have interacting
Under what assumptions is Linear Regression unbiased?
Linearity, No Autocorrelation, Multivariate Normality, Homoscedasticity, No/low Multicollinearity
Hypothesis testing of coefficients
- Set the hypothesis
- Set the significance level, criteria for a decision
- Compute the test statistics
- Make a decision
Can test using manual feature elimination (e.g. build a model with all the features, drop the features that have a high p-value, drop redundant features using correlations and VIF) and automated (e.g. RFE and Regularization) techniques
Outlier detection
Z-score/Extreme Value Analysis, Probabilistic and Statistical Modeling, Linear Regression Models, Information Theory Models, High Dimensional Outlier Detection Methods
Cooks distance
Cook’s distance is the scaled change in fitted values, which is useful for identifying outliers in the X values (observations for predictor variables). Cook’s distance shows the influence of each observation on the fitted response values. An observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier.
Leverage Point
A leverage point is determined by a point whose x-value is an outlier, while the y-value is on the predicted line (y-value is not an outlier). Therefore, this point is undetected by the y-outlier detection statistics
p-value
The probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Reject the null hypothesis at p < 0.05
t-statistic
The ratio of the difference in a number’s estimated value from its assumed value to its standard error.
Maximum Likelihood Estimation
A method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable
Estimation mean of Gaussian
Σx_i/N
Variance of Gaussian
σ² where σ = sqrt[(1/(n-1))Σ(x_i - mean of x)²]
Multivariate Gaussian
A generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution
If X and Y are joint Gaussians, how do you compute E(X|Y)?
𝐸(𝑋|𝑌) = 𝐸(𝑋) + 𝑐𝑜𝑣(𝑋,𝑌)𝑐𝑜𝑣−1(𝑌)(𝑌−𝐸(𝑌))=1+𝑌/2
Basic Time Series Models
Autoregressive (AR), integrated (I), Moving-Average (MA), Autoregressive Moving Average (ARMA), Autoregressive Integrated Moving Average (ARIMA), Autoregressive Fractionally Integrated Moving Average (ARFIMA); can use vector-valued data and add initial V out front; Autoregressive Conditional Heteroskedasticity (ARCH) and associates (GARCH, TARCH, EGARCH, FIGARCH, CGARCH, etc); Markov Switching Multifractal (MSMF) for modeling volatility evolution; Hidden Markov Model (HMM) → many of htem are in sktime package
AR(1)
An autoregressive model of order 1: X_t = Σ φ_iX_(t-i) + ε_t up to from i = 1 to p (the number 1 in this case)
MA(1)
Moving Average model of order 1:
X_t = μ + Σθ_iε_(t-i) + ε_t from i = 1 to q (1 in this case)
ARMA
Given a time series of data X_t, the ARMA model is a tool for understanding and, perhaps, predicting future values in this series. The AR part involves regressing the variable on its own lagged (i.e., past) values. The MA part involves modeling the error term as a linear combination of error terms occurring contemporaneously and at various times in the past.
Lagrange optimization
A strategy for finding local maxima and minima of a function where you optimize the Lagrangian: L(x,Λ) = f(x) + Λg(x). The partial derivatives with respect to x and Λ should equal zero.
Standard errors of fitted coefficients of models calculation
Calculate the residuals (observed - predicted), calculate the sum squared of the residuals SSE, compute the mean squared error SSE/(n - k - 1) where n is number of observations and k is number of independent variables, compute the variance-covariance matrix of the regression coefficients V = (X’X)^-1 * MSE. Take the square root of each corresponding diagonal element of V
Standard errors of fitted coefficients of sample means calculation
σ / sqrt(n)
Central Limit Theorem
Under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed.
Bootstrapping
Any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates. It’s very simple and it can check the stability of the results, but it depends heavily on the estimator used.
Lasso Regression
Least Absolute Shrinkage and Selection Operator; a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.
Ridge Regression
A method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated.
Regression Trees
Decision tree learning where the target variable can take a discrete set of values (called trees). The leaves represent class labels and branches represent conjunctions of features that lead to those class labels.
Logistic Regression
When you have a binary set of outcomes, it models the log odds of an event as a linear combination of one or more independent variable.
k-Means Clustering
A method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
Boosting
An ensemble meta-algorithm for primarily reducing bias, variance. It is used in supervised learning and a family of machine learning algorithms that convert weak learners to strong ones
Linearity assumption
The target and each independent variable have a linear relationship
No (or little) autocorrelation assumption
There are no residuals that are dependent on each other
Multivariate Normality assumption
The data should be normally distributed, and the average of the residuals should be zero (the normal distribution residuals would be a straight line)
Homoscedasticity assumption
The error term is the same across all values of independent variables (if you plot the residual values vs predicted values, there is no discernible pattern)
No (or low) Multicollinearity assumption
No or low numbers of independent variables are correlated to one another
Z-score
A metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution. This makes z-score a parametric method.
z = (x-μ)/σ
Dbscan
Density Based Spatial Clustering of Applications with Noise: clustering algorithm that can identify outliers
Isolation Forest
Isolation forest’s basic principle is that outliers are few and far from the rest of the observations. To build a tree (training), the algorithm randomly picks a feature from the feature space and a random split value ranging between the maximums and minimums. This is made for all the observations in the training set. To build the forest a tree ensemble is made averaging all the trees in the forest.
Then for prediction, it compares an observation against that splitting value in a “node”, that node will have two node children on which another random comparisons will be made. The number of “splittings” made by the algorithm for an instance is named: “path length”. As expected, outliers will have shorter path lengths than the rest of the observations.
s(x,n) = 2^(-E(h(x))/c(n) where h(x) is the path length