3: Machine Learning Flashcards
Data are shuffled randomly and then divided into k equal subsamples.
One sample is saved to be used as validation sample, and the other k-1 samples are used as training samples
K-fold cross validation
Technique of combining predictions from a number of models, with the objective of canceling out noise
Ensemble Learning
Results in: more accuracy & stable predictions (vs single model)
- Nodes connected by links
- Useful in: Supervised Regression & Classification models
- Works well in presence of: nonlinearities & complex interactions among variables
- Recognizes: patterns, clusters, and classifies
Neural Networks
Unsupervised Neural Networks with many hidden layers (often >20), and reinforcorced learning learn from their own prediction errors
Used for: complex tasks; image, pattern, & character recognition
Deep Learning Networks
- Algorithm learns from success & mistakes
- Seeking to maximize reward and minimize punishment
- Defined constraints
Reinforcement Learning
Inputs & outputs are identified for the computer, and the algorithm uses this labeled training data to model relationships
Supervised Learning
Computer is provided unlabeled data that the algorithm uses to determine the structure of the data
Unsupervised Data
Least Absolute Shrinkage and Selection Operator (LASSO) is useful in building:
Penalized regression model
Parsimonious models, through feature reduction
K-Nearest Neighbor, investment application includes:
Used in: classification & regression
- predicting bankrupcty
- assigning bond ratings class
- predicting stock prices
- creating customized indicies
Random Forest investment applications include:
- factor based asset allocation
- prediction models for IPO success
Linear relationships
A penalized regression model tries to use a limited number of most important features that…
explain the variation in the dependent variable
Example: monthly returns on 100 stocks
Overfitting occurs when:
Bias error:
Variance error:
when model fits the training too well
Bias error: low
Variance error: high
displaying non linear characteristics
Generalize is the degree to which the model retains it’s explanatory power when:
predicting out of sample
Bias error is the degree to which:
the model fits the training data
Variance error shows how much the model responds to:
new data
How to prevent overfitting:
- don’t let model become too complex
- proper data sampling using cross validation (k-fold)
Complexity Reduction:
Dimensional Reduction
Use: PCA
With supervised data, the training data contains:
ground truth
Supervised ML algorithm
Classification focuses on sorting observation into:
distinct categories:
* pass or failure
Regression based uses:
continuous variables
Regression:
CART & Random forests are used for:
complex & non-linear
Classified unsupervised data:
K-means is used for:
complex & linear data
with a known number of k clusters