3 (QM) - Machine Learning Flashcards
What is machine learning (ML)?
The use of algorithms to make decisions by generalizing (or finding patterns) in a given data set. The goal is to use data to automate decision-making.
Target Variable
The dependent variable (i.e. the “Y” variable). Can be continuous, categorical, or ordinal.
Features
These are the independent variables (i.e. the “X” variables).
Training data set
The sample data set used to fit the ML-model.
Hyperparameter
The ML-model input specified by the researcher.
Supervised Learning
A ML-algorithm uses labeled training data (inputs and outputs are identified) to model relationships in the data and achieve superior forecasting accuracy.
Unsupervised Learning
An ML algorithm is not given labeled training data. Instead, the inputs (i.e. features) are provided without any conclusions around those inputs and the algorithm aims to determine the structure of the data.
Deep Learning
An ML-algorithm that is used for complex tasks such as image recognition, natural language processing, and so on.
What are the three major types of deep learning algorithms?
- Reinforced learning algorithms (RL) - An ML model that learns from their own prediction errors. Learn from their errors to maximize a defined reward.
- Neural Networks - Comprise an input layer, hidden layers (which process the input), and an output layer. The nodes in the hidden layer are called neurons, which comprise a summation operator (that calculates a weighted average) and an activation function (a nonlinear function).
- Deep learning networks (DLN) - Neural networks with many hidden layers (20+) that are useful for pattern, speech, and image recognition.
Overfitting
Do overfit models generalize well to new data?
An issue with supervised ML that results when a large number of features (i.e. independent variables) are included in the data sample. This results in an overly complex model that may have generalized random noise that improves in-sample forecasting, however, it decreases the accuracy of the model to forecast other out-of-sample data.
No, they do not generalize well to new data. This results in a low out-of-sample R-squared.
Bias error
The in-sample error that results from models with a poor fit.
Variance error
The out-of-sample error that results from overfitted models that do not generalize well.
Base error
The residual errors due to random noise.
Training sample
v.
Validation sample
v.
Test sample
What are these three data sets used for?
What are the types of errors associated with each data set?
These data sets are used to measure how well a model generalizes. All three datasets are nonoverlapping.
Training - Data set used to develop the model. In-sample prediction errors.
Validation - Data set used for tuning the model. Out-of-sample prediction errors.
Test - Data set used to evaluate the model using new data. Out-of-sample prediction errors.
Learning curve
Curve that plots the accuracy rate (i.e. 1 - error rate) in the validation or text sample versus the size of the training sample.
Accuracy Rate =
Accuracy Rate = 1 - Error Rate
What two methods to data scientists use to mitigate the problem of overfitting?
- Complexity Reduction - A penalty is imposed to exclude features that aren’t meaningfully contributing to out-of-sample prediction accuracy. The penalty value increases with the number of independent variables used by the model.
- Cross Validation - A sampling technique that estimates out-of-sample error rates directly from the validation sample. This ensures the validation sample is both large and representative of the population, just like the training sample.
K-fold cross validation is a method of randomly dividing a dat aset into any number of parts
What method is used to collect average in-sample and out-of-sample error rates?
K-fold cross validation. Randomly dividing a data set into any number of parts - “k”.
The training sample comprises k-1 parts with one part left for validation. Error rates are the measures for the model in each of the parts. This process is then repeated “K” times.
Penalized regression
Supervised learning
An algorithm that reduces overfitting by imposing a penalty on or reducing the nonperforming features.
Support vector machine (SVM)
Supervised learning
A linear classification algorithm that separates the data into one of two possible classifiers based on the model-defined hyperplane.
K-nearest neighbor (KNN)
Supervised learning
An algorithm used to classify an observation based on nearness to the observations in the training sample.
Classification and regression tree (CART)
Supervised learning
An algorithm is used for classifying categorical target variables when there are significant nonlinear relationships among variables.
Ensemble learning
Supervised learning
An algorithm that combined predictions from multiple models, resulting in a lower average error rate.
Random forest
Supervised learning
A variant of the classification and regression tree (CART) whereby a large number of classification tress are trained using data bagged from the same data set.
Principal components analysis (PCA)
Unsupervised learning
An algortihm that summarized the information in a large number of correlated factors into a much smaller set of uncorrelated factors, called eigenvectors.
K-means clustering
Unsupervised learning
An algorithm that partitions observations into “K” nonoverlapping clusters; a centroid is associated with each cluster.
Hierarchical clustering
Unsupervised learning
An algorithm that builds a hierarchy of clusters without any predefined number of clusters.
Hierarchical clustering
Agglomerative vs. divisive
method of building a hierarchy of clusters without any predefined number of clusters.
Agglomerative - Bottom-up. starts with one observation and its own cluster and adds similar observations to that group
Divisive - Top-down. Starts with one giant cluster and then partitions that cluster into smaller and smaller clusters.