Quant Flashcards
Confidence Interval for a Predicted Y-Value
coefficient ± (critical t-value)(standard error of forecast)
t test for each variable
est. regression parameter/std
df = n-k-1
R square
coefficient of deterination
= RSS/ SST
regression sum of squares/ total sum of squares
=SST- SSE (sum of squared errors)
/ SST
=explained variation/ total variation
SEE standard error of estimate
= square root of mean squared error (MSE)
MSE = SSE / (N-K-1)
RSS sum of squares
MSR = RSS / K
F test all coefficients collectively
-one tail
= MSR regression mean square
/ mean squared error MSE
reject H0 IF F> Critical, at least one of the coefficients is significantly different than zero, at least one of the independent variables in the regression model makes a significant contribution to the explanation of the dependent variables.
conditional heteroskedasticity
residual variance related to level of independent variables
Std error are unreliable, but the slope coefficients are consistent and unbiased.
Use BP chi-square test
use white-corrected std errors
Serial correlation
residuals are correlated.
Type I errors but the slope coefficients are consistent and unbiased.
Durbin-watson test
use Hansen method to adjust standard error
Multicollinearity
Two or more independent variables are correlated.
Too many type iI errors and the slope coefficients are unreliable.
Drop one of the correlated variables.
Both multicollinearity and serial correlation biases the standard errors of the slope coefficients.
log-linear trend model
In(yt) = bo+b1(t)
best for data series that exhibits a trend or for which the residuals are correlated or predictable or the mean is non-constant.
AR model
dependent variable is regressed against previous values of itself.
is correct if the autocorrelation of residuals from the model are not statistically significant at any lag.
use t-test
if significant, model is incorrectly specified and a lagged variable at the indicated lag should be added.
covariance stationary
meet the following 3 conditions:
- constant and finite mean
- constant and finite variance
- constant and finite covariance with leading or lagged values
use Dickey-fuller test
if AR is not stationary, correct with 1st differencing.
if it is, the mean-reverting level must be defined, b1 must <1
mean reversion
b0/(1-b1)
value of the variable tends to fall when above its mean and rise when below its mean.
unit root
if the value of the lag coefficient = 1
the time series has a unit root and will follow a random walk process.
uniti root is not covariance stationary.
if it is unit root, value at t = value t-1 + a random error
mean reverting level is undefined
random walk
one for which the value in one period = value in another period + a random error
with a drift = xt = bo + xt-1 + error
without drift = xt-1 + error
1st differencing
to correct autoregressive model
subtract the value of the time series in the immediately preceding period from the current value of the time series to define a new variable
yt = xt - xt-1 (bo=b1=0)
covariance stationary
seasonality
tested by calculating autocorrelation of error terms.
to adjust for seasonality, an additional lag of the variable is added to the original model.
RMSE root mean squared error
used to assess the predictive accuracy autoregressive models
the lower the better
cointegration
2 time series are economically linked or follow the same trend and that relationship is not expected to change
the error term is covariance stationary and t-test is reliable.
test for unit root use DF test
if reject null hypothesis of a unit root, the error terms generated by the 2 times series are covariance stationary and the two series are coingegrated.
If both time series are covariance stationary, model is reliable.
If only the dependent variable time series or only the independent time series is covariance stationary, the model is not reliable.
If neither time series is covariance stationary, you need to check for cointegration.
ARCH autoregressive conditional heteroskedasticity
describes the condition where the variance of the residuals in one time period within a time series is dependent on the variance of the residuals in another period.
if true, std of regression coefficient in AR model and the hypothesis test of these coefficients are invalid.
use generalized least squares
supervised learning
a machine learning technique in which a machine is given labelled input and output data and models the output data based on the input data.
unsupervised learning
labeled data not provided, use unlabelled data that the algorithm uses to determine the structure of the data
a machine is given input data in which to identify patterns and relationships, but no output data to model.
deep learning algorithms
algorithms such as neural networks and reinforced learning learn from their prediction errors and are used for complex tasks such as image recognition and natural language processing.
technique to identify patterns of increasing complexity and may use supervised or unsupervised learning.
> 20 networks
have an agent seeking to max a defined reward given defined constrains.
overfitting
results from having a large # of independent variables, resulting in an overly complex model which may have generalized random noise that improves in-sample forecasting accuracy, but not for out-of sample.
use complexity reduction
- a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy
and cross validation
supervised learning algorithms include:
- penalized regression
- support vector machine
- k-nearest neighbor
- classification and regression tree CART
- ensemble learnning (random forest)
unsupervised machine learning algorithm include:
- principal components analysis PCA
- k-means clustering
- hierarchical clustering
neutral networks
comprises an input layer, hidden layers, and an output layer.
consist of nodes connected by links; learning takes place in the hidden layer nodes, each of which consists of a summation operator and an activation function.
Neural networks with many hidden layers (often more than 20) are known as deep learning nets (DLNs) and used in artificial intelligence.
deep learning nets
neutral networks with many hidden layers useful for pattern, speech, and imagine recognization.
reinforcement learning
seeks to learn from their own errors maximizing a defined reward
data wrangling
data transformation and scaling
scaling (normalization & standardization)
conversion of data to a common unit of measurement
normalization scales variables between the values of 0 and 1
standardization centers the variables at a mean of 0 and a stf of 1, assumes normal distribution
n-grams
technique that defines a taken as a sequence of words and is applied when the sequence is importantt
bag-of-words (BOW)
procedure then collects all the token in a document
collection of a distinct set of tokens from all the texts in a sample dataset
evaluate the fit of machine learning algorithm
precision (P) = true positives / (false positives + true positives)
=tp/(fp+tp)
recall (R) = true positives / (true positives + false negatives)
=tp/(tp+fn)
accuracy = (true positives + true negatives) / (all positives and negatives)
=(tp+tn)/(all)
F1 score = (2 × P × R) / (P + R)
standard error of estimate
=square root of (unexplained variation/(n-k-1))
sample variance of dependent variable
total variation / (n-1)
sample standard deviation = squared root (total variation / (n-1))
test statistic t=
(bi-b)/si
df = n-k-1
simulation is best for
continuous risk, accommodates correlated variables
Correlation across risks can be modeled explicitly using simulation.
2 advantages of using simulation in decision making are
1) Better input estimation
2) Simulation yields a distribution for expected value rather than a point estimate.
Simulations will yield great-looking output, even when the inputs are random.
Scenario analysis
discrete, accommodates corrected variables
decision trees
discrete, sequential, not accommodates correlated variables
structure data analysis steps:
- conceptualization of model risk
- data collection
- data preparation and wrangling (cleaning data)
- Data exploration
- Model training
unstructured data analysis steps:
- text problem formulation
- data collection
- text preparation and wrangling
- text exploration
- modeling
big data is characterized by
- volume (quantity, Terabyte)
- variety (data sources)
- velocity (speed, latency)
- Veracity (reliability of data source)
feature engineering
involves optimizing and improving the selected features; prevent underfitting in the training of the model.
feature selectioin
involves selecting a subset of tokens in the bow, reduce feature-induced noise.
appropriate feature selection is a key factor in minimizing model overfitting.
token
process of splitting a given text into separate tokens.
K-nearest neighbor (KNN).
More commonly used in classification (but sometimes in regression), this technique is used to classify an observation based on nearness to the observations in the training sample.
need to specify hyper parameter.
Classification and regression trees (CART).
Classification trees are appropriate when the target variable is categorical, and are typically used when the target is binary
provides a visual explanation of the prediction process, compared to other algorithms that are often described as black boxes due to their opacity.
Principal component analysis (PCA).
Problems associated with too much noise often arise when the number of features in a data set (i.e., its dimension) is excessive
unsupervised machine learning algorithm that reduces highly correlated features into fewer uncorrelated composite variables by transforming the feature covariance matrix.
Clustering.
Given a data set, clustering is the process of grouping observations into categories based on similarities in their attributes (called cohesion).
K-means
partitions observations into a fixed number (k) of non-overlapping clusters.
unsupervised
Hierarchical clustering
Hierarchical clustering is an unsupervised iterative algorithm used to build a hierarchy of clusters.
In an agglomerative (or bottom-up) clustering, we start with one observation as its own cluster and add other similar observations to that group, or form another nonoverlapping cluster. A divisive (or top-down) clustering algorithm starts with one giant cluster, and then it partitions that cluster into smaller and smaller clusters.
neural networks (NNs),
(also called artificial neural networks, or ANNs) are constructed as nodes connected by links. The input layer consists of nodes with values for the features (independent variables).
values are scaled so that the information from multiple nodes is comparable and can be used to calculate a weighted average.
The nodes that follow the input variables are called neurons because they process the input information.
These neurons comprise a summation operator that collates the information (as a weighted average) and passes it on to a (typically nonlinear) activation function, to generate a value from the input values. This value is then passed forward to other neurons in subsequent hidden layers (a process called forward propagation). A related process, backward propagation, is employed to revise the weights used in the summation operator as the network learns from its errors.
Deep Learning Networks (DLNs)
Deep learning networks (DLNs) are unsupervised neural networks with many hidden layers (often more than 20). DLNs are often used for image, pattern, and character recognition. The last layer in a DLN calculates the expected probability of an observation belonging to a category, and the observation is assigned to the category with the highest probability. Additional applications of DLNs include credit card fraud detection, autonomous cars, natural language processing, and investment decision-making.
Reinforcement Learning (RL) algorithms
have an agent that seeks to maximize a defined reward given defined constraints. The RL agent does not rely on labeled training data, but rather learns based on immediate feedback from (millions of) trials. When applied to the ancient game of Go, DeepMind’s AlphaGo algorithm was able to beat the reigning world champion. The efficacy of RL in investment decision-making is not yet conclusive.
constraints that are introduced into simulations used in in risk analysis are:
1) book value constraints,
2) earnings and cash flow constraints,
3) market value constraints.
Underfitting
describes a machine learning model that is not complex enough to describe the data it is meant to analyze.
An underfit model treats true parameters as noise and fails to identify the actual patterns and relationships.
overfit (too complex) model
will tend to identify spurious relationships in the data. Labelling of input data is related to the use of supervised or unsupervised machine learning techniques.
LASSO (least absolute shrinkage and selection operator)
is a popular type of penalized regression in which the penalty term comprises summing the absolute values of the regression coefficients.
The more included features, the larger the penalty will be. The result is that a feature needs to make a sufficient contribution to model fit to offset the penalty from including it.
Curation
is ensuring the quality of data,
by adjusting for bad or missing data.
Word clouds
are a visualization technique.
Support vector machine (SVM)
is a linear classifier that aims to seek the optimal hyperplane, i.e. the one that separates the two sets of data points by the maximum margin. SVM is typically used for classification.
supervised ML
issues that might prevent a simulation from generating meaningful output include:
Ad-hoc specification (rather than specification based on sound analysis) of parameter estimates (i.e. the garbage-in, garbage-out problem),
changing correlations across inputs,
non-stationary distributions,
and real data that does not fit (pre-defined) distributions.
Data exploration encompasses
exploratory data analysis, feature selection, and feature engineering.
Stemming
is the process of converting inflected word forms into a base word.
Reinforcement learning algorithms
involve an agent that will perform actions that will maximize its rewards over time, taking into consideration the constraints of its environment.
unsupervised learning algorithms:
Dimension reduction
clustering
Generalization
describes the degree to which, when predicting out-of-sample, a machine learning model retains its explanatory power.
Big data is defined as data with
high volume, velocity, and variety. Big data often suffers from low veracity, because it can contain a high percentage of meaningless data.
precision (P) =
true positives / (false positives + true positives)
=tp/(AP)
ratio of correctly predicted positive classes to all predicted positive classes.
recall (R)
= true positives / (true positives + false negatives)
=tp/(tp+fn)
ratio of correctly predicted positive classes to all actual positive classes.
accuracy
= (true positives + true negatives) / (all positives and negatives)
=(tp+tn)/(all)
percentage of correctly predicted classes out of total predictions.
F1 score
= (2 × P × R) / (P + R)
Out-of-sample error
equals bias error + variance error + base error.
Bias error
is the extent to which a model fits the training data.
Variance error
describes the degree to which a model’s results change in response to new data from validation and test samples.
Base error
comes from randomness in the data.
Random forest
is a collection of randomly generated classification trees from the same data set.
random forests can mitigate the problem of overfitting.
increase the signal-to-noise ratio.
sample variance
= sum of (x-x)*(y-y)/(n-1)
=total explained / (N-1)
correlation coefficient r
covxy/(sxsy)
= sum of ((x-x)(y-y))
/(Sqr root of (x-x)^2(y-y)^2)
A probit model
is a qualitative dependant variable which is based on a normal distribution.
A logit model is
a qualitative dependant variable which is based on the logistic distribution.
A discriminant model
returns a qualitative dependant variable based on a linear relationship that can be used for ranking or classification into discrete states.