Concepts Flashcards
Machine Learning
The process of creating an algorithm that predicts an outcome from data and can improve its performance through experience.
Supervised learning algorithms are…
- Trained on labeled data
Unsupervised learning algorithms are…
- trained on unlabeled data
Regularization is…
Any process that reduces generalization error (i.e. testing error) but not training error. It controls a model’s capacity (it’s ability to fit a wide variety of functions) and therefore prevents overfitting.
Examples include:
- L1/L2 in linear/logistic regression
Hyperparameters are…
Parameters that can be used to control an algorithm’s behavior but are not learned. These should be tuned on a validation set.
Examples include:
- alpha in linear regression, or “learning rate”, which controls the step size in the gradient descent algorithm
Difference between regression and classification
Regression is a process to predict continuous output values.
Classification is a process to predict categorical output values.
What is a cost function?
A cost function measures the accuracy of our prediction. It quantifies how well our predicted outcome matches the actual outcome.
Linear regression
- When is it used?
- What is the hypothesis?
- What is the cost function?
- Are there any assumptions?
Linear regression is used to predict a continuous outcome (e.g. house prices) from one or more input variables. These input variables can be continuous or categorical, but they must be represented numerically.
The hypothesis is a linear model:
y = theta_0 + theta_1 * x ; y = transpose of theta * x (first column of x is all 1’s)
A common cost function is Mean Squared Error (MSE). This function is parabolic in the univariate case.
Assumptions (check?):
- linearity
- normality
- independence
- no multicollinearity
What is the Mean Standard Error (MSE) cost function for linear regression?
J = (1/2n) * sum from 1 to n(y_predicted - y_actual)^2
= (1/2n) * sum from 1 to n(theta_0 + theta_1 * x_i - y_i)^2
= (1/2n) * (x * theta - y) transpose * (x * theta - y)
There are two ways to determine the coefficients for a linear regression model. What are they?
The Mean Standard Error (MSE) cost function can be minimized using gradient descent.
In the special case of linear regression, the cost function can also be minimized analytically.
Explain gradient descent.
Gradient descent is an algorithm that updates the coefficients to minimize the cost function.
In the case of linear regression:
- initial values of theta are chosen
- these are updated iteratively based on the slope of the cost function; we take steps along the cost function in the direction of greatest descent
- the size of the steps is controlled by hyperparameter alpha (“learning rate”)
- this occurs until a minimum is found (stopping conditions?)
The update equations look something like:
theta_updated = theta_current - alpha * partial derivative of the cost function with respect to theta_current
Discuss the effect of learning rate (hyperparameter alpha) in linear regression.
Alpha controls the rate of gradient descent when minimizing a cost function for linear regression. A larger value of alpha produces a larger step size, while a smaller value of alpha produces a smaller step size. If alpha is too small, you can find the minimum very precisely, but the algorithm may take a long time to converge. If alpha is too large, you may overshoot the minimum, and the algorithm may fail to converge or even diverge.
Note that alpha naturally gets smaller as the number of iterations increases (and thus the slope of the cost function approaches zero).
Things to keep in mind when preparing data for Machine Learning.
Gradient descent will work best if all of the input values x are between -1 and 1, or even -0.5 and 0.5. To achieve this:
- Feature scaling: divide all input values by the range of input values to achieve a range of 1
- Mean normalization: subtract the average value of each input variable from the values of that input variable to achieve an average of 0
- Standardization: subtract the average value of each input variable from the values of that input variable and divide by the standard deviation of that input variable to achieve an average of 0 and a standard deviation of 1
How might you assess how well gradient descent is working?
Plot the value of the cost function at each iteration by the interaction number. The function should be strictly decreasing.
Define MAE
Mean Absolute Error - a metric for accessing the accuracy of a regression model.
It is the average of the absolute differences of the residuals.
MAE = (1/n) * sum from 1 to n ( abs( y_i, actual - y_i, predicted) )
Smaller values of MAE indicate better model performance.
MAE places bounds on root mean squared error (RMSE):
MAE <= RMSE <= MAE*sqrt(n)
Explain the need for validation and test sets.
You need a validation set to tune hyperparameters without letting your model see your “test” set.
You need a test set because, once you’ve trained a model, you need to be able to assess its performance in the real world (i.e. on data it’s never seen before).
Explain regularization in the context of linear regression.
Regularization increases generalizability by penalizing non-zero coefficients. (I.e., you want to discourage having more parameters than you need in your model).
L1 (or lasso) penalizes by the sum of the absolute values of the coefficients (Manhattan distance)
L2 (or ridge) penalizes by the sum of the squared coefficients * 1/2 (Euclidean distance)
Explain L1 regularization
L1 regularization, or lasso regularization adds the following term to a cost function: J = J + lambda*(sum from 1 to n of the absolute value of the coefficients). This is sum is also known as the Manhattan distance.
It has the effect of setting small coefficients to 0, thereby doing feature reduction. This improves interpretability.
Hyperparameter lambda controls how strong this penalty is.
The default lambda value in sklearn (lambda is termed “alpha” in sklearn.linear_model.Lasso) is 1.
Explain L2 regularization
L2 regularization, or ridge regularization adds the following term to a cost function: J = J + (1/2)lambda(sum from 1 to n of the coefficients^2). This is sum is also known as the Euclidean distance.
It has the effect of setting small coefficients to almost 0, so it does not eliminate any features. As a result, models using L2 regularization often have a high number of features and can be prone to overfitting.
Hyperparameter lambda controls how strong this penalty is.
The default lambda value in sklearn (lambda is termed “alpha” in sklearn.linear_model.Lasso) is 1.
L2 regularization requires the data to be scaled (? And L1 doesn’t?)
Logistic regression
- When is it used?
- What is the hypothesis?
- What is the cost function?
- Are there any assumptions?
Logistic regression is the simplest classification algorithm. It predicts the probability of a positive outcome based on a set of input features. These features can be continuous or categorical, but they must be represented numerically. In mathematical terms, it predicts p(y = 1 | x). The output is 0 or 1, depending on whether the predicted probabilities are greater or less than some threshold (usually 0.5).
The hypothesis is that the log-odds of a positive outcome is a linear combination of input features:
p(x) = (e^(b + thetax))/(1 + e^(b + thetax))
The cost function is: Binary cross entropy (also called log loss)
Assumptions:
- binary predictions
- independence
- log odds of the output can be modeled as a linear combination of the inputs
Define odds and log odds
odds = p(x) / (1 - p(x)) log(odds) = log(p(x))
if p(x) = (e^(b + theta*x))/(1 + e^(b + theta*x)) then log(odds) = b + theta*x
How are the coefficients of logistic regression typically estimated?
The cost function for logistic regression is usually minimized through Maximum Likelihood Estimation (MLE).
MLE is usually implemented using the Quasi-newton method.
If you’re implementing by hand, it’s easier to use gradient descent.
Explain Maximum Likelihood Estimation (MLE)
MLE is a method used to estimate the parameters of a model. It picks the parameter values such that they maximize the likelihood that the process described by the model produced the data that were actually observed.
I.e., it estimates which curve (e.g. a normal curve) was most likely responsible for generating the data points observed. In the case that we believe a normal distribution was the process that generated the data, MLE will find the values of mu and sigma that describe the curve that best fits the observed data.
Explain the decision tree learning algorithm.
Decision trees can be used on either categorical or continuous data and can predict either a categorical or a continuous output variable.
In a decision tree, a set of rules are chosen relating to the input features, and the rows of data that meet the rules are passed on to the next level of the tree. The rules and which features they operate on are usually chosen by default by the model; a parameter that is commonly set by the modeler, however, is the depth of the tree. It is usual to try a deeper tree to begin with, and then to scale back if the model overfits.
It is common to use a depth of 10, which will result in 2^10 leaf nodes at the bottom of the tree. But if each node only contains a few examples, the model will be prone to overfitting (not enough data to make generalizable conclusions). A sensible parameter to tune on a validation set in sklearn to handle this problem is max_leaf_nodes (e.g. try between 5 and 500).
Explain the random forest learning algorithm.
Random forests can be used on both continuous and categorical data, and they can predict either a continuous or a categorical outcome.
In this algorithm, a number of decision trees are generated for the same data.
There are two techniques to prevent all trees from reaching the same conclusions:
- Bootstrap the data (sample with replacement) and assign a sample to each tree
- Assign each tree only a subset of input features (usual to pick the sqrt of the number of input features)
To reach a prediction, the results of all trees are averaged (in the continuous case). In the categorical case, the trees vote.
Decision trees often work well with default parameters in sklearn. Params to tune: n_estimators (100), criterion (‘squared_error’), min_samples_split (2), max_depth (None)
When should you choose a decision tree versus a random forest?
- Decision trees are interpretable and easy to visualize.
- Decision trees are highly reproducible and perform well on large data sets because they are quick to run
- Decision trees are very prone to overfitting, especially if the tree is deep. We can limit tree depth, but this increases the risk of a biased model.
- Random forests are able to reduce overfitting while not dramatically increasing error due to bias.
- Random forests are also more robust to outliers and general variation in the data, because they are an ensemble method where multiple trees must reach consensus.
My sense is that decision trees are almost never used in practice.
When should you choose a random forest versus linear/logistic regression?
- Decision trees and random forests can outperform linear/logistic regression if the output is not well-represented by linear combinations of the input variables (tree-based methods are non-parametric and learn interactions without them having to be explicitly modeled).
- Random forests perform well in the case when the number of variables is close to or exceeds the number of observations, a regime in which linear/logistic regression breaks down.
- Random forests are more robust to outliers because they are an ensemble method.
- Random forests are generally less interpretable (easy to explain) than regression models, and they take more time and memory to run.
Explain support vector machines (SVMs).
SVMs can be used for regression or classification, but they’re usually used for the latter, and usually only for binary classification problems (true?). The SVM itself is a complex, multidimensional surface, also called a hyperplane, that separates classes. The goal of this algorithm is to determine the SVM such that is separates classes as successfully as possible.
In the SVM algorithm, a hyperplane is first identified that completely separates class A from class B. Then the distance from the support vectors (also called the “margin”) is maximized. (Which algorithm?)
Kernels can be used to create non-linear boundaries. In sklearn, there are options including ‘linear’, ‘rbf’, ‘poly’, and ‘sigmoid.’ Linear is usually best when you have a large number of features (> 1000) because it helps you avoid overfitting.
What are “support vectors” in Support Vector Machines (SVMs)?
This is easiest to think about in the context of binary classification.
The points that are the closest to the interface between clusters. The actual Support Vector Machine (SVM) is a complex multidimensional surface, called a hyperplane, that separates classes.
Explain the hyperparameters in the SVM algorithm.
1) Gamma is the kernel coefficient if a non-linear boundary is used (i.e. rbf, sigmoid, etc). A high value of gamma (e.g. 100) will likely result in overfitting.
2) C is a penalty parameter that controls the trade-off between correct classification and smooth boundaries.
Is there regularization in SVM?
What are the pros and cons of using the SVM algorithm?
Pros:
- It works well when there is a clear margin of separation between classes
- It is effective in high dimensional spaces, even when the number of dimensions is greater than the number of samples
- It is memory-efficient because only the support vectors are used to tune the location of the hyperplane
Cons:
- Training time can be large on large data sets
- It performs poorly on noisy data (i.e. there is not clear separation between classes)
- SVM does not directly provide probability estimates. These must be obtained using k-fold cross-validation.
Explain the k-means algorithm.
K-means is an unsupervised learning algorithm that groups similar points together to reveal underlying patterns.
The algorithm looks for a fixed number (k) of clusters in the data, where k defines the number of centroids you want to find.
- Starts with randomly located group of centroids
- Calculates the distance between each data point and all k centroids
- Assigns the data point to the closest centroid
- After all the data points have been assigned, updates the location of each centroid to the average location of all data points assigned to that centroid.
- The algorithm stops when centroid locations are not changing much between iterations, or when a certain number of iterations is reached.
The model output is the cluster each data record belongs to.
K is a critically-important hyperparameter.
Recommendation systems often use k-means.
Define “centroid”
A centroid is the center of a cluster of data. Formally, it’s the average location of all the data points assigned to a cluster.
What are some considerations for the k-means algorithm?
- The choice of initial positions for the centroids is important, and a poor choice can result in the algorithm failing to stabilize. Rather than assigning entirely random centroid locations, one option is to initialize the centroids at the locations of actual data points.
- The selection of hyperparameter k matters a lot
- Data must be normalized in order for k means to make sense. This can be done in sklearn.