ML - Preprocessing & Evaluation Flashcards

1
Q

What are the three types of error in a ML model? Briefly describe them.

A
  1. Bias - error caused by choosing an algorithm that cannot accurately model the signal in the data, i.e. the model is too general or was incorrectly selected. For example, selecting a simple linear regression to model highly non-linear data would result in error due to bias.
  2. Variance - error from an estimator being too specific and learning relationships that are specific to the training set but do not generalize to new samples well. Variance can come from fitting too closely to noise in the data, and models with high variance are extremely sensitive to changing inputs. Example: Creating a decision tree that splits the training set until every leaf node only contains 1 sample.
  3. Irreducible error (a.k.a. Noise) - error caused by raindomness or inherent noise in the data that cannot be removed through modeling. Example: inaccuracy in data collection causes irreducible error. Example: Trying to predict if someone will sneeze tomorrow. Even the best model can’t account for a rogue pollen grain.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the bias-variance trade-off?

A

Bias refers to an error from an estimator that is too general and does not learn relationships from a data set that would allow it to make better predictions.

**Variance ** refers to error from an estimator being too specific and learning relationships that are specific to the training set but will not generalize to new records well.

In short, the bias-variance trade-off is a the trade-off between underfitting and overfitting. As you decrease variance, you tend to increase bias. As you decrease bias, you tend to increase variance.

Your goal is to create models that minimize the overall error by careful model selection and tuning to ensure sure there is a balance between bias and variance: general enough to make good predictions on new data but specific enough to pick up as much signal as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some naive approaches to classification that can be used as a baseline for results?

A
  1. Predict only the most common class: if the majority of samples have a target of 1, predict 1 for the entire validation set. This is extremely useful as a baseline for imbalanced data sets.

2.** Predict a random class:** if you have two classes, 1 and 0, randomly select either 1 or 0 for each sample in the validation set.

  1. Randomly draw from a distribution matching that of the target variable in the training set: if you have two classes, 70% of training samples are A and 30% of training samples are B, then you’ll randomly sample this distribution to create predictions for your validation set.

These baseline results are good to calculate at the start and you should include at least one when making any assertions about the efficacy of your model, e.g. “our model was 50% more accurate than the naive approach of suggesting all customers to buy the most popular car.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the classification metric Area Under the Curve (AUC)

A

AUC (Area Under the Curve) refers to the area under the ROC (Receiver Operating Characteristic) curve in a classification model. It measures the model’s ability to distinguish between classes, with a value ranging from 0 to 1. A higher AUC indicates a better performing model.

AUC is the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one (In terms of giving it a probability of belonging to a positive class). AUC ranges from 0 to 1, where 1 indicates perfect classification and 0.5 suggests no discrimination (similar to random guessing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the classification metric Gini?

A

Gini is a similar metric that scales AUC between -1 and 1 so that 0 represents a model that makes random predictions. Gini = 2*AUC-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the difference between bagging and boosting?

A

Bagging and boosting are both ensemble methods, meaning they combine many weak predictors to create a strong predictor. One key difference is that bagging builds independent models in parallel, whereas boosting builds models sequentially, at each step emphasizing the observations that were missed in previous steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you tell if your model is underfitting your data?

A

If your training and validation error are both relatively equal and very high, then your model is most likely underfitting your training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you tell if your model is overfitting your data?

A

If your training error is low and your validation error is high, then your model is most likely overfitting your training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name and briefly explain several evaluation metrics that are useful for classification problems.

A
  1. Accuracy - measures the percentage of the time you correctly classify samples: (true positive + true negative) / all samples
  2. Precision - measures the percentage of the predicted members that were correctly classified: true positives / (true positives + false positives)
  3. Recall - measures the percentage of true members that were correctly classified by the algorithm: true positives / (true positives + false negative)
  4. F1 - measurement that balances accuracy and precision (or you can think of it as balancing Type I and Type II error)
  5. AUC - describes the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
  6. Gini - a scale and centered version of AUC
  7. Log-loss - similar to accuracy but increases the penalty for incorrect classifications that are “further” away from their true class. For log-loss, lower values are better.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name and briefly explain several evaluation metrics that are useful for regression problems.

A
  1. Means squared error (MSE) - the average of the squared error of each prediction
  2. Root mean squared error (RMSE) - square root of MSE
  3. Mean absolute error (MAE) - the average of the absolute error of each prediction
  4. Coefficient of determination (R^2) - proportion of variance in the target that is predictable from the features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When should you reduce the number of features used by your model?

A

Some instances when features selection is necessary:

• When there is strong collinearity between features

• There are an overwhelming number of features

• The is not enough computational power to process all features

• The algorithm forces the model to use all features, even when they are not useful (most often in parametric or linear models)

• When you wish to make the model simpler for any reason, e.g. easier to explain, less computational power needed, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When is feature selection unnecessary?

A

Some instances when feature selection is not necessary:

• There are relatively few features

• All features contain useful and important signal

• There is no collinearity between features

• The model will automatically select the most useful features

• The computing resources can handle processing all of the features

• Thoroughly explaining the model to a non-technical audience is not critical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the three types of feature selection methods?

A

• Filter Methods - feature selection is done independent of the learning algorithm, before any modelling is done. One example is finding the correlation between every feature and the target and throwing out those that don’t meet a threshold. Easy, fast, but naive and not as performant as other methods.

• Wrapper Methods - train models on subsets of the features and use the subset that results in the best performance. Examples are Stepwise or Recursive feature selection. Advantages are that it considers each feature in the context of the other features, but can be computationally expensive

• Embedded Methods - learning algorithms have built-in feature selection e.g. L1 regularization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are two common ways to automate hyperparameter tuning?

A
  1. Grid search - test every possible combination of pre-defined hyperparameter values and select the best one
  2. Randomized search - randomly test possible combinations of pre-defined hyperparameter values and select the best tested one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the pros and cons of grid search?

A

Pros: Grid search is great when you need to fine-tune hyperparameters over a small search space automatically. For example, if you have 100 different datasets that you expect to be similar (e.g. solving the same problem repeatedly with different populations), you can use grid search to automatically fine-tune the hyperparameters for each model.

Cons: Grid search is computationally expensive and inefficient, often searching over parameter space that has very little chance of being useful, resulting it being extremely slow. It’s especially slow if you need to search a large space since it’s complexity increases exponentially as more hyperparameters are optimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the pros and cons of randomized search?

A

Pros: Randomized search does a good job finding near-optimal hyperparameters over a very large search space relatively quickly and doesn’t suffer from the same exponential scaling problem as grid search.

Cons: Randomized search does not fine-tune the results as much as grid search does since it typically does not test every possible combination of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some naive feature engineering techniques that improve model efficacy?

A
  1. Summary statistics (mean, median, mode, min, max, std) for each group of similar records, e.g. all male customers between the ages of 32 and 44 would get their own set of summary stats
  2. Interactions or ratios between features, e.g. var1/var2 or var1*var2
  3. Summaries of features, e.g. the number of purchases a customer made in the last 30 days (raw features may be last 10 purchase dates)
  4. Splitting feature information manually, e.g. customer taller than 6’ may be a critical piece of information when recommending car vs SUV
  5. kNN using records in the training set to produce a “kNN” feature that is fed into another model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are three methods for scaling your data?

A
  1. “Normalization” or “scaling” - general terms that refer to transforming your input data to a new scale (often a linear transformation) such as to 0 to 1, -1 to 1, 0 to 10, etc
  2. Min-Max - linear transformation of data that maps the minimum value to 0 and the maximum value to 1
  3. Standardization - transforms each feature to a normal distribution with a mean of 0 and standard deviation of 1. May also be referred to as Z-score transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain one major drawback to each of the three scaling methods.

A

General normalization scaling is sensitive to outliers since the presence of outliers will compress most values and make them appear extremely close together.

Min-Max scaling is also sensitive to outliers since the presence of outliers will compress most values and make them appear extremely close together.

Standardization (or Z-score transformation) rescales to an unbounded interval which can be problematic for certain algorithms, e.g. some neural networks, that expect input values to be inside a specific range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When should you scale your data? Why?

A

When your algorithm will weight each input, e.g. gradient descent used by many neural nets, or use distance metrics, e.g. kNN, model performance can often be improved by normalizing, standardizing, or otherwise scaling your data so that each feature is given relatively equal weight.

It is also important when features are measured in different units, e.g. feature A is measured in inches, feature B is measured in feet, and feature C is measured in dollars, that they are scaled in a way that they are weighted and/or represented equally.

In some cases, efficacy will not change but perceived feature importance may change, e.g. coefficients in a linear regression.

Scaling your data typically does not change performance or feature importance for tree-based models since the split points will simply shift to compensate for the scaled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe basic feature encoding for categorical variables.

A

Feature encoding involves replacing classes in a categorical variable with new values such as integers or real numbers. For example, [red, blue, green] could be encoded to [8, 5, 11].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When should you encode your features? Why?

A

You should encode categorical features so that they can be processed by numerical algorithms, e.g. so that machine learning algorithms can learn from them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are three methods for encoding categorical data?

A

• Label encoding (non-ordinal) - each category is assigned a numeric value not representing any ordering. Example: [red, blue, green] could be encoded to [8, 5, 11].

• Label encoding (ordinal) - each category is assigned a numeric value representing an ordering. Example: [small, medium, large] could be encoded to [1, 2, 3]

• One-hot encoding aka binary encoding - each category is transformed into a new binary feature, with all records being marked 1/True or 0/False. Example: color = [red, blue, green] could be encoded to color_red = [1, 0, 0], color_blue = [0, 1, 0], color_green = [0, 0, 1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are some common uses of decision tree algorithms?

A
  1. Classification
  2. Regression
  3. Measuring feature importance
  4. Feature selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the main hyperparameters that you can tune for decision trees?

A

Generally speaking, we have the following parameters:

max depth - maximum tree depth

min samples split - minimum number of samples for a node to be split

min samples leaf - minimum number of samples for each leaf node

max leaf nodes - the maximum number of leaf nodes in the tree

max features - maximum number of features that are evaluated for splitting at each node (only valid for algorithms that randomize features considered at each split)

Other similar hyperparameters may be derived from the above hyperparameters.

The “traditional” decision tree is greedy and looks at all features at each split point, but many modern implementations allow splitting on randomized features (as seen in sklearn), so max features is may or may not be a tuneable hyperparameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Decision Trees: Explain how each hyperparameter affects the model’s ability to learn.

A

Generally speaking…

max depth - increasing max depth will decreases bias and increases variance

min samples split - increasing min samples split increases bias and decreases variance

min samples leaf - increasing min samples leaf increases bias and decreases variance

max leaf nodes - decreasing max leaf node increases bias and decreases variance

max features - decreasing maximum features increases bias and decreases variance

There may be instances when changing hyperparameters has no effect on the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Decision Trees: What metrics are usually used to compute splits?

A

Gini impurity or entropy. Both generally produce similar results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Decision Trees: What is Gini impurity?

A

Gini impurity (also called the Gini index) is a measurement of how often a randomly chosen record would be incorrectly classified if it was randomly classified using the distribution of the set of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Decision Trees: What do high and low Gini scores mean?

A

Low Gini (near 0) = most records from the sample are in the same class

High Gini (maximum of 1 or less, depending on number of classes) = records from sample are spread evenly across classes

30
Q

Decision Trees: What is entropy?

A

Entropy is the measure of the purity of members among non-empty classes. It is very similar to Gini in concept, but a slightly different calculation.

31
Q

Decision Trees: What do high and low Entropy scores mean?

A

Low Entropy (near 0) = most records from the sample are in the same class

High Entropy (maximum of 1) = records from sample are spread evenly across classes

32
Q

Are decision trees parametric or non-parametric models?

A

Non-parametric. The number of model parameters is not determined before creating the model.

33
Q

What are some ways to reduce overfitting with decision trees?

A
  • Reduce maximum depth
  • Increase min samples split
  • Balance your data to prevent bias toward dominant classes
  • Increase the number of samples
  • Decrease the number of features
34
Q

How is feature importance evaluated in decision-tree-based models?

A

The features that are split on most frequently and are closest to the top of the tree, thus affecting the largest number of samples, are considered to be the most important.

35
Q

How does Random Forest differ from traditional Decision Tree algorithms?

A

Random forest is an ensemble method that uses bagged decision trees with random feature subsets chosen at each split point. It then either averages the prediction results of each tree (regression) or using votes from each tree (classification) to make the final prediction.

36
Q

What hyperparameters can be tuned for a random forest that are in addition to each individual tree’s hyperparameters?

A

Random forest is essentially bagged decision trees with random feature subsets chosen at each split point, so we have 2 new hyperparameters that we can tune:

num estimators - the number of decision trees in the forest

max features - maximum number of features that are evaluated for splitting at each node

37
Q

Are random forest models prone to overfitting? Why?

A

No, random forest models are generally not prone to overfitting because the bagging and randomized feature selection tends to average out any noise in the model. Adding more trees does not cause overfitting since the randomization process continues to average out noise (more trees generally reduces overfitting in random forest).

In general, bagging algorithms are robust to overfitting.

Having said that, it is possible to overfit with random forest models if the underlying decision trees have extremely high variance, e.g. extremely high depth and low min sample split, and a large percentage of features are considered at each split point, e.g. if every tree is identical, then random forest may overfit the data.

38
Q

How does gradient boosting, aka gradient boosting machines (GMB), differ from traditional decision tree algorithms?

A

Gradient boosting involves using multiple weak predictors (decision trees) to create a strong predictor. Specifically, it includes a loss function that calculates the gradient of the error with regard to each feature and then iteratively creates new decision trees that minimize the current error. More and more trees are added to the current model to continue correcting error until improvements fall below some minimum threshold or a pre-decided number of trees have been created.

39
Q

What hyperparameters can be tuned in gradient boosting that are in addition to each individual tree’s hyperparameters?

A

The main hyperparameters that can be tuned with GBM models are:

Loss function - loss function to calculate gradient of error

Learning rate - the rate at which new trees correct/modify the existing predictor

Num estimators - the total number of tress to produce for the final predictor

Additional hyperparameters specific to the loss function

Some specific implementations, e.g. stochastic gradient boosting, may have additional hyperparameters such as subsample size (subsample size affects the randomization in stochastic variations).

40
Q

How can you reduce overfitting when doing gradient boosting?

A

Reducing the learning rate or reducing the maximum number of estimators are the two easiest ways to deal with gradient boosting models that overfit the data.

With stochastic gradient boosting, reducing subsample size is an additional way to combat overfitting.

Boosting algorithms tend to be vulnerable to overfitting, so knowing how to reduce overfitting is important.

42
Q

Why would you want to use dimensionality reduction techniques to transform your data before training?

A

Dimensionality reduction can allow you to:

• Remove collinearity from the feature space

• Speed up training by reducing the number of features

• Reduce memory usage by reducing the number of features

• Identify underlying, latent features that impact multiple features in the original space

43
Q

Why would you want to avoid dimensionality reduction techniques to transform your data before training?

A

Dimensionality reduction can:

• Add extra unnecessary computation

• Make the model difficult to interpret if the latent features are not easy to understand

• Add complexity to the model pipeline

• Reduce the predictive power of the model if too much signal is lost

44
Q

Name four popular dimensionality reduction algorithms and briefly describe them.

A
  1. Principal component analysis (PCA) - uses an eigen decomposition to transform the original feature data into linearly independent eigenvectors. The most important vectors (with highest eigenvalues) are then selected to represent the features in the transformed space
  2. Non-negative matrix factorization (NMF) - can be used to reduce dimensionality for certain problem types while preserving more information than PCA
  3. Embedding techniques - various embedding techniques, e.g. finding local neighbors as done in Local Linear Embedding, can be used to reduce dimensionality
  4. Clustering or centroid techniques - each value can be described as a member of a cluster, a linear combination of clusters, or a linear combination of cluster centroids

By far the most popular is PCA and similar eigen-decomposition-based variations.

45
Q

After doing dimensionality reduction, can you transform the data back into the original feature space? How?

A

Yes and no.

Most dimensonality reduction techniques have inverse transformations, but signal is often lost when reducing dimensions, so the inverse transformation is usually only an approximation of the original data.

46
Q

How do you select the number of principal components needed for PCA?

A

Selecting the number of latent features to retain is typically done by inspecting the eigenvalue of each eigenvector. As eigenvalues decrease, the impact of the latent feature on the target variable also decreases.

This means that principal components with small eigenvalues have a small impact on the model and can be removed.

There are various rules of thumb, but one general rule is to include the most significant principal components that account for at least 95% of the variation in the features.

47
Q

Briefly explain how the k-nearest neighbor (kNN) algorithm works.

A

kNN makes a prediction by averaging the k neighbors nearest to a given data point.

For example, if we wanted to predict how much money a potential customer would spend at our store, we could find the 5 customers most similar to her and average their spending to make the prediction.

The average could be weighted based on similarity between data points and the similarity, aka “distance,” metric could be modified as well.

48
Q

Is kNN a parametric or non-parametric algorithm? Is it used as a classifier or regressor?

A

kNN is non-parametric and can be used as either a classifier or regressor.

49
Q

How do you select the ideal number of neighbors for kNN?

A

There is no closed-form solution for calculating k, so various heuristics are often used. It may be easiest to simply do cross validation and test several different values for k and choose the one that produces the smallest error during cross validation.

As k increases, bias tends to increase and variance decreases.

50
Q

Briefly explain how the k-means clustering algorithm works.

A

k-means clustering in a unsupervised clustering algorithm that partitions observations into k clusters.

The cluster means are usually randomized at the start (often by choosing random observations from the data) and then updated/shifted as more records are observed.

At each iterations, a new observation is assigned to a cluster based on which cluster mean it is nearest and then the means are recalculated, or updated, with the new observation information included.

51
Q

What is one common use case for k-mean clustering?

A

Customer segmentation is probably the most common use case for k-means clustering (although it has many uses in various industries).

Often, unsupervised clustering is used to identify groups of similar customers (or data points) and then another predictive model is trained on each cluster. Then, new customers are first assigned a cluster and then scored using the appropriate model.

52
Q

Why is it difficult to identify the ‘ideal’ number of clusters in a dataset using k-mean clustering?

A

There is no ‘ideal’ number of clusters since increasing the number of clusters always captures more information about the features (the limiting case is k=number of observations, i.e. each observation is a ‘cluster’).

Having said that, there are various heuristics that attempt to identify the ‘optimal’ number of clusters by recognizing when increasing the number of clusters only marginally increases the information captured.

The true answer is usually driven by the application, though. If a business has the ability to create 4 different offers, then they may want to create 4 customer clusters, regardless of the data.

53
Q

What is one heuristic to select ‘k’ for k-means clustering?

A

One such method is the elbow method. In short, it attempts to identify the point at which adding additional clusters only marginally increases the variance explained by the clusters. The elbow is the point at which we begin to see diminishing returns in explained variance when increasing k.

54
Q

Does k-means clustering always converge to the same clusters? How does this affect the use of k-means clustering in production models?

A

No, there is no guarantee that k-means converges to the same set of clusters, even given the same samples from the same population.

The clusters that are produced may be radically different based on the initial cluster means selected.

For this reason, it is important that the cluster definitions remain static when using k-mean clustering in production to ensure that different clusters aren’t created each time during training.

55
Q

What training algorithms are appropriate for a linear regression on large data sets? Which should be avoided?

A

Appropriate: stochastic gradient descent, mini-batch gradient descent

Avoided: normal equation (too computationally complex)

56
Q

What are some reasons that gradient descent may converge slowly and how can you address them?

A

Problem: Low learning rate
Solution: Increase the learning rate gradually (avoid making it so high that you jump over minima)

Problem: Features have very dissimilar scales
Solution: Rescale features using a rescaling technique

57
Q

How do you select the right order of polynomial for polynomial regressions? What if the data is high-dimensional?

A

This is a difficult question and there is no easy way to automate this selection.

It is suggested that you inspect the data and try to choose the order of polynomial that will best fit the data without overfitting.

If the data is high-dimensional and can’t be visualized, then you can train multiple models and observe when the validation error begins to increase instead of decrease. At this point you’re probably overfitting your training data and should reduce the polynomial order to the point where validation error is minimized.

58
Q

Is a polynomial regression non-linear?

A

No. It is a linear model that can be used to fit non-linear data.

59
Q

Name and briefly explain three regularized linear models.

A

• Ridge regression - linear regression that adds L2-norm penalty/regularization term to the cost function

• Lasso - linear regression that adds L1-norm penalty/regularization term to the cost function

• Elastic Net - linear regression that adds mix of both L1- and L2-norm penalties terms to the cost function

60
Q

What hyperparameters can be tuned in regularized linear models? Explain how they affect model learning.

A

You can tune the weight of the regularization term for regularized models (typically denoted as alpha), which affect how much the models will compress features.

alpha = 0 –> regularized model is identical to original model

alpha = 1 –> regularized model reduced the original model to a constant value

61
Q

When should you use no regularization vs ridge vs lasso vs elastic net?

A

Regularized models tend to outperform non-regularized linear models, so it is suggested that you at least try using ridge regression.

Lasso can be effective when you want to use to automatically do feature selection in order to create a simpler model but can be dangerous since it may be erratic and remove features that contain useful signal.

Elastic net is a balance of ridge and lasso, and it can be used to the same effect as lasso with less erratic behavior.

62
Q

What parameters be tuned in logistic regression models? Explain how they affect model learning.

A

Logistic regression models can be tuned using regularizations techniques (commonly L2 norm, but other norms may be used as well)

63
Q

Can gradient descent get stuck at a local minima when training a logistic regression model? Why?

A

No, because the cost function is convex.

64
Q

Can logistic regression produce a probability score along with its classification prediction?

65
Q

Is logistic regression a regressor or a classifier?

A

Logistic regression is usually used as a classifer because it predicts discrete classes.

Having said that, it technically outputs a continuous value associated with each prediction.

So we see that is is actually a regression algorithm, hence the name, that can solve classification problems.

It is fair to say that it is a classifier because it is used for classification, although it is also technically also a regressor.

66
Q

What parameters can you tune for SVM?

A

The hyperparameters that you can commonly tune for SVM are:

• Regularization/cost parameter

• Kernel

• Degree of polynomial (if using a polynomial kernel)

• Gamma (modifies the influence of nearby points on the support vector for Gaussian RBF kernels)

• Coef0 (influences impact of high vs low degree polynomials for polynomial or sigmoid kernels)

• Epsilon (a margin term used for SVM regressions)

67
Q

What are some possible uses of SVM models? E.g. classification, regression, etc

A

SVM can be used for:

• linear classification

• nonlinear classification

• linear regression

• nonlinear regression

68
Q

What common kernels can you use for SVM?

A
  1. Linear
  2. Polynomial
  3. Gaussian RBF
  4. Sigmoid
69
Q

Why is it important to scale feature before using SVM?

A

SVM tries to fit the widest gap between all classes, so unscaled features can cause some features to have a significantly larger or smaller impact on how the SVM split is created.

70
Q

Can SVM produce a probability score along with its classification prediction?