Machine Learning Flashcards
What is overfitting?
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.
What is underfitting?
Underfitting refers to a model that can neither model the training data nor generalize to new data.
An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting
How to detect overfitting?
K-fold cross-validation is one of the most popular techniques to assess accuracy of the model.
In k-folds cross-validation, data is split into k equally sized subsets, which are also called “folds.” One of the k-folds will act as the test set, also known as the holdout set or validation set, and the remaining folds will train the model. This process repeats until each of the fold has acted as a holdout fold. After each evaluation, a score is retained and when all iterations have completed, the scores are averaged to assess the performance of the overall model.
How to avoid overfitting?
Early stopping:
-pause training before the model starts learning the noise within the model. This approach risks halting the training process too soon, leading underfitting. Finding the “sweet spot” between underfitting and overfitting is the ultimate goal here.
Train with more data:
this can increase the accuracy of the model by providing more opportunities to parse out the dominant relationship among the input and output variables. That said, this is a more effective method when clean, relevant data is injected into the model. Otherwise, you could just continue to add more complexity to the model, causing it to overfit.
Data augmentation:
While it is better to inject clean, relevant data into your training data, sometimes noisy data is added to make a model more stable. However, this method should be done sparingly.
Feature selection:
When you build a model, you’ll have a number of parameters or features that are used to predict a given outcome, but many times, these features can be redundant to others. Feature selection is the process of identifying the most important ones within the training data and then eliminating the irrelevant or redundant ones. This is commonly mistaken for dimensionality reduction, but it is different. However, both methods help to simplify your model to establish the dominant trend in the data.
Regularization:
If overfitting occurs when a model is too complex, it makes sense for us to reduce the number of features. But what if we don’t know which inputs to eliminate during the feature selection process? If we don’t know which features to remove from our model, regularization methods can be particularly helpful. Regularization applies a “penalty” to the input parameters with the larger coefficients, which subsequently limits the amount of variance in the model. While there are a number of regularization methods, such as L1 regularization, Lasso regularization, and dropout, they all seek to identify and reduce the noise within the data.
Ensemble methods:
Ensemble learning methods are made up of a set of classifiers—e.g. decision trees—and their predictions are aggregated to identify the most popular result. The most well-known ensemble methods are bagging and boosting. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these models are then trained independently, and depending on the type of task—i.e. regression or classification—the average or majority of those predictions yield a more accurate estimate. This is commonly used to reduce variance within a noisy dataset.
What is Bias and Variance? What is the trade off between them.
- Bias: An error due to Bias is the distance between the predictions of a model and the true values. In this type of error, the model pays little attention to training data and oversimplifies the model and doesn’t learn the patterns. The model learns the wrong relations by not taking in account all the features
- Variance: Variability of model prediction for a given data point or a value that tells us the spread of our data. In this type of error, the model pays a lot of attention in training data, to the point to memorize it instead of learning from it. A model with a high error of variance is not flexible to generalize on the data which it hasn’t seen before.
Bias- Variance trade-off is about balancing and about finding a sweet spot between error due to bias and errors due to variance. (minimize the varience + Bias^2)
What is Regularization?
regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.
Explain the difference between Lasso and Ridge regularization.
Lasso (L1) - The penalty function is defined by the sim pf the absolute values of the coefficients.
Ridge (L2) - The penalty function is defined by the sum of the squares of the coefficient
What is ElasticNet?
ElasticNet is a hybrid of Lasso and Ridge, where both the absolute value penalization and squared penalization are included, being regulated with another coefficient l1_ratio:
What is k-means algorithm?
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.
- Select k – number of clusters
- Select random k points from the data
- Measure the distance from the first data point to k points chosen.
- Assign the first point to the nearest cluster based on the minimum distance.
- Repeat #3, 4 all points.
- Calculate the mean of each cluster
- Repeat 3-7 for another iteration until clusters doesn’t change any more.
- Calculate the variance of the each cluster and add to find total variation.
- Repeat 1-8 based on another random k points and repeat as many times you want to find best cluster based on lowest total variation.
How to decide k in k-means algorithm?
Use elbow method
Goal is to minimize the distance between points in the cluster and maximize the distance between clusters.
Minimize within cluster sum of squares WCSS (but zero means k = number of samples which is useless)
1. Try different k values and find the total variation
2. Plot total variance and k identify best based on elbow location.
What are the pros and cons of k-means?
Pros Simple to understand Fast to cluster Widely available Easy to implement Always create a result (also a con)
Cons (Remedy) We need to pick k (Elbow method) Sensitive to initiation K-means++ (run initial algorithm to pick most appropriate seed points) Sensitive to outliers (Remove outliers) Produces spherical solutions
What are the pros and cons of k-means?
Pros Cons Remedy
Simple to understand We need to pick k Elbow method
Fast to cluster Sensitive to initiation K-means++ (run initial algorithm to pick most appropriate seed points)
Widely available Sensitive to outliers Remove outliers
Easy to implement Produces spherical solutions
Always create a result (also a con)
What is k-nearest neighbors algorithm?
Supervised classification algorithm. (can be used as none linear regression)
- Get the training data already classified into group.
- Get a new data point, find the classification of nearest N points.
- Classify the new point based on the best vote from N points.
How to optimize k in k - nearest neighbors algorithm?
Select K by optimizing error on test data set.
Explain pros and cons of KNN
Pros
Almost no assumptions
Simple and easy to implement (only need k and distance function)
Good value of k will makes it robust to noise
KNN learns a non-linear decision boundary
There is no training required
Can be used as classification and regression
Cons
Inefficient (need to calculate distance to all n points to classify)
Does not play well when we have higher dimensions.
Does not handle categorical features well.
Lower k will susceptible to outliers.
Explain the differences between KNN and K-means.
K-means Unsupervised algorithm to do group similar data points together.
KNN Supervised algorithm to do classification give new classification based on k number of nearest neighbors.
What is linear regression?
Method of modeling depending variable based on independent variable by using a linear equation using least Squares method. Is there a significant linear relationship between independent variable and dependent variable?
Y=C+mx C-intercept , m-slope Residual = (Y-Yreal) Correlation Standard Error TTS – total sum of squared Σ(y_real-(y_real ) ̅ )^2 RSS – Residual sum of squared Σ(y_real-y)^2 ESS – Explained sum of squared (y-(y_real ) ̅ )^2 TSS = RSS+ESS R^2=1-RSS/TSS=ESS/TSS anything about 0.3 give a good correlation F-Test Degrees of freedom = number observations - number of coefficients (independent variables) -1 DF=n-k-1 T-test P-test Confidence interval if not zero is not included in this interval, it is good! We have a linear relationship.
What are the assumptions for linear regression?
- Linear relationship between independent and dependent
- Residual errors or residuals are normally distributed and indigent from each other
- There are no correlation between multiple independent variables
- Homoscedasticity – Variance around the regression line is the same for all values of the predictor variable
What is time-series analysis?
• Use of past univariant values to predict future.
• Univariant - only one y value changing with time, Eg. stock price for last 30 days)
• Interval should be exactly the same.
• Components of TSA data
o Trend
o Seasonal
o White noise
o Residual
• Must be stationary
o Variance and covariance of the series are time invariant.
How to measure accuracy in classification?
Accuracy = # of correct/Total # of predictions
Accuracy = (TP + TN)/(TP+TN+FP+FN)
How to calculate Precision?
True positives from all positive results
Precision = TP/(TP+FP)
How to calculate Recall?
Recall is true positive from actual positives
TP / (TP + FN)
What is F1 score?
F1 score is a combination of Recall and precision
2RP/(R+P)
What is stationary time series?
Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations
When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.
Classical time series analysis and forecasting methods are concerned with making non-stationary time series data stationary by identifying and removing trends and removing seasonal effects
If you have clear trend and seasonality in your time series, then model these components, remove them from observations, then train models on the residuals.
How to check time series is stationary?
3 methods:
- Look at Plots: You can review a time series plot of your data and visually check if there are any obvious trends or seasonality.
- Summary Statistics: You can review the summary statistics for your data for seasons or random partitions and check for obvious or significant differences.
- Statistical Tests: You can use statistical tests to check if the expectations of stationarity are met or have been violated.