3 (QM) - Machine Learning Flashcards
What is machine learning (ML)?
The use of algorithms to make decisions by generalizing (or finding patterns) in a given data set. The goal is to use data to automate decision-making.
Target Variable
The dependent variable (i.e. the “Y” variable). Can be continuous, categorical, or ordinal.
Features
These are the independent variables (i.e. the “X” variables).
Training data set
The sample data set used to fit the ML-model.
Hyperparameter
The ML-model input specified by the researcher.
Supervised Learning
A ML-algorithm uses labeled training data (inputs and outputs are identified) to model relationships in the data and achieve superior forecasting accuracy.
Unsupervised Learning
An ML algorithm is not given labeled training data. Instead, the inputs (i.e. features) are provided without any conclusions around those inputs and the algorithm aims to determine the structure of the data.
Deep Learning
An ML-algorithm that is used for complex tasks such as image recognition, natural language processing, and so on.
What are the three major types of deep learning algorithms?
- Reinforced learning algorithms (RL) - An ML model that learns from their own prediction errors. Learn from their errors to maximize a defined reward.
- Neural Networks - Comprise an input layer, hidden layers (which process the input), and an output layer. The nodes in the hidden layer are called neurons, which comprise a summation operator (that calculates a weighted average) and an activation function (a nonlinear function).
- Deep learning networks (DLN) - Neural networks with many hidden layers (20+) that are useful for pattern, speech, and image recognition.
Overfitting
Do overfit models generalize well to new data?
An issue with supervised ML that results when a large number of features (i.e. independent variables) are included in the data sample. This results in an overly complex model that may have generalized random noise that improves in-sample forecasting, however, it decreases the accuracy of the model to forecast other out-of-sample data.
No, they do not generalize well to new data. This results in a low out-of-sample R-squared.
Bias error
The in-sample error that results from models with a poor fit.
Variance error
The out-of-sample error that results from overfitted models that do not generalize well.
Base error
The residual errors due to random noise.
Training sample
v.
Validation sample
v.
Test sample
What are these three data sets used for?
What are the types of errors associated with each data set?
These data sets are used to measure how well a model generalizes. All three datasets are nonoverlapping.
Training - Data set used to develop the model. In-sample prediction errors.
Validation - Data set used for tuning the model. Out-of-sample prediction errors.
Test - Data set used to evaluate the model using new data. Out-of-sample prediction errors.
Learning curve
Curve that plots the accuracy rate (i.e. 1 - error rate) in the validation or text sample versus the size of the training sample.