Machine Learning Flashcards
Hierarchical clustering is most likely used when the problem involves
Classifying unlebeled data
What is Supervised machine learning
Involves training an algorithm to take a set of inputs (x variables) and find a model that best relates them to outputs (Y variables)
Training algorithm - Set of inputs - find models that relates to outputs.
What is unsupervised machine learning
Same as supervised learning, but does not make use of labeled training data.
We give it data and expect the algorithm to make sense of it.
What is Overfitting
ML models can produce overly complex models that may fit the training data too well and thereby not generalize new data well.
The prediction model of the traning sample (in-Sample data) is too complex.
The traning Data does not work well with the new data
Name Supervised ML Algorithms
Penalized regression
Support Vector Machine (SVM)
K - Nearest Neighbor
Classification and Regression Trees (CART)
Ensemble learning
Random Forest
Name unsupervised ML Algortihms
Principle component analysis
K-Mean clustering
Hierarachical clustering
High Bias Error in ML
High Bias Error means the model does not fit the training data well.
High Variance Error in ML
High variance error means the model does not predict well on the test data
Name Dimension Reduction in ML
Principle component analysis (unsupervised ML)
Penalized Regression
(Supervised ML)
What does Penalized Regression do?
- Simmilar to maximizing adjusted R square.
- Demension Reduction
- Eliminates/minimazie overfitting
Regression coefficients are chosen to minimize the sum of the squared error, plus a penalty term that increases with the number of included features
What is SVM
Support Vector Machine
It is Classification, Regression, and Outlier detection
Classifying data that is not complex or non-linear.
Is a linear classifier that determines the hyperplane that optimally seperates the observation into two sets of data points.
Does not requier any hyperparameter.
Maximize the probability of making a correct prediction by determining the boundry that is furthest from all observation.
Outliers do not affect either the support vectors or the discriminant boundry.
What is K-Nearest Neighbor
Classification
Classify new observation by finding similarities in the existing data.
Makes no assumption about the distribution of the data.
It is non-parametric.
KNN results can be sensitive to inclusion of irrelevant or correlated featuers, so it may be neccessary to select featuers manually.
Thereby removing less irrelevant information.
What is CART
Classification and Regression Trees
Part of supervised ML
Typically applied when the target is binary.
If the goal is regression, the prediction would be the mean of the values of the terminal node.
Makes no assumption about the characteristics of the traning data, so if left unconstrained, potentially it can perfectly learn the traning data.
To avoid overfitting, regulation paramterers can be added, such as the maximum dept of the tree.
What are the 3 types of layer in Neural Network
- Input layer
- Hidden layer
- Output layer
What are non-linear functions more susceptiable to?
Variance error and overfitting