Machine Learning Flashcards
Hierarchical clustering is most likely used when the problem involves
Classifying unlebeled data
and when number of categories are unknown
What is Supervised machine learning
Involves training an algorithm to take a set of inputs (x variables) and find a model that best relates them to outputs (Y variables)
Training algorithm - Set of inputs - find models that relates to outputs.
What is unsupervised machine learning
Same as supervised learning, but does not make use of labeled training data.
We give it data and expect the algorithm to make sense of it.
What is Overfitting
ML models can produce overly complex models that may fit the training data too well and thereby not generalize new data well.
The prediction model of the traning sample (in-Sample data) is too complex.
The traning Data does not work well with the new data
Name Supervised ML Algorithms
Penalized regression
Support Vector Machine (SVM)
K - Nearest Neighbor
Classification and Regression Trees (CART)
Ensemble learning
Random Forest
Name unsupervised ML Algortihms
Principle component analysis
K-Mean clustering
Hierarachical clustering
High Bias Error in ML
High Bias Error means the model does not fit the training data well.
High Variance Error in ML
High variance error means the model does not predict well on the test data
Name Dimension Reduction in ML
Principle component analysis (unsupervised ML)
Penalized Regression
(Supervised ML)
What does Penalized Regression do?
- Simmilar to maximizing adjusted R square.
- Demension Reduction
- Eliminates/minimazie overfitting
Regression coefficients are chosen to minimize the sum of the squared error, plus a penalty term that increases with the number of included features
What is SVM
Support Vector Machine
It is Classification, Regression, and Outlier detection
Classifying data that is not complex or non-linear.
Is a linear classifier that determines the hyperplane that optimally seperates the observation into two sets of data points.
Does not requier any hyperparameter.
Maximize the probability of making a correct prediction by determining the boundry that is furthest from all observation.
Outliers do not affect either the support vectors or the discriminant boundry.
What is K-Nearest Neighbor
Classification
Classify new observation by finding similarities in the existing data.
Makes no assumption about the distribution of the data.
It is non-parametric.
KNN results can be sensitive to inclusion of irrelevant or correlated featuers, so it may be neccessary to select featuers manually.
Thereby removing less irrelevant information.
What is CART
Classification and Regression Trees
Part of supervised ML
Typically applied when the target is binary.
If the goal is regression, the prediction would be the mean of the values of the terminal node.
Makes no assumption about the characteristics of the traning data, so if left unconstrained, potentially it can perfectly learn the traning data.
To avoid overfitting, regulation paramterers can be added, such as the maximum dept of the tree.
What are the 3 types of layer in Neural Network
- Input layer
- Hidden layer
- Output layer
What are non-linear functions more susceptiable to?
Variance error and overfitting
What are linear functions more susceptiable to?
Bias error and underfitting
The main distinction between clustering and classification algorithms is that
The groups in clustering are determined by the data
Classification they are determined by the analyst/researcher
What is K-Means clustering in ML?
K-means partitions observations into a fixed number, k, of non-overlaping cluster.
Each cluster is characterized by its centroid, and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest.
High bias error and high variance error are indicative of…
Underfitting
High bias error = model does not fit on the traning data.
High variance = Model does not predict well on test data.
Both combination results in a underfitted model.
Low bias error but high variance error is indicative of ..
Overfitting
Bias error = model does not fit the traning data well.
Variance error = Model does not predict well on test data.
What are linear models more susceptible to?
Bias Error (underfitting)
What are non-linear models more prone to?
Variance Error
(overfitting)
What is Principal Components Analysis
It is part of unsupervised ML
Dimension Reduction
Use to reduce highly correlaed featuers of data into few main uncorrelated composite variables.
What are the 3 types of error in ML?
Bias error
Variance error
Base error
What is variance error in ML?
Variance Error or how much the model’s results change in response to new data from
validation and test samples.
Unstable models pick up noise and produce high variance
causing overfitting and ↑ out of-sample error.
What is Bias error in ML?
Bias Error or the degree to which a model fits the training data.
Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and ↑ in-sample error.
(Adding more training samples will not improve the model)
What is Bias error in ML?
Base Error due to randomness in the data.
(Out-of-sample accuracy increases as the training sample size increases)
Name 2 ways to Preventing Overfitting in Supervised Machine Learning
Ocean’s Razor: The problem solving principle that the simplest solution tends to be the correct one.
In supervised ML, it means preventing the algorithm from getting too complex during selection and training by limiting the no. of features and penalizing algorithms that are too complex or too flexible by constraining them to include only parameters that reduce out-of-sample error.
K-Fold Cross Validation: This strategy comes from the principle of avoiding sampling bias.
The challenge is having a large enough data set to make both training and testing possible on representative samples.