Data Science Flashcards
Two most common supervised tasks?
Classification and Regression
Four common unsupervised tasks?
Clustering, visualization, dimensionality reduction, association rule learning
What model is used to train a robot to walk on various unknown terrains?
Reinforcement Learning
Is spam detection a supervised or unsupervised learning problem?
Supervised, you feed the model many emails that are labeled spam or not spam
What is an online learning system?
A learning system that can learn incrementally. Capable of adapting rapidly to changing data and autonomous systems, and of training on very large quantities of data
What is out-of-core learning?
Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. Chops the data into mini-batches and uses online learning techniques.
What type of learning algorithm relies on a similarity measure to make predictions?
An instance-based learning system learns the training data by heart; then, when given new instances, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
Difference between parameter and learning algorithm?
a parameter will predict give a new instance (e.g. slope of a linear model), a hyperparameter is a parameter of the learning algorithm itself (max depth of learning tree
what do model-based learning algorithms search for? 2 What is the most common strategy they use to succeed? 3 How do they make predictions?
They search for an optimal value for the model parameters such that the model will generalize well to new instances. 2 Usually by minimizing a cost function. 3 Feed new instances into the model.
Five main challenges to ML?
- lack of data 2. poor data quality 3. nonrepresentative data 4. uninformative features 5. overfitting or underfitting
Four solutions to overfitting?
- get more data 2. simplify the model 3. reduce the noise in the data 4 smaller learning rate
What is a test set?
to generalize the error that the model will make on new instances
The purpose of a validation set?
To compare models and tune the hyperparameters
What is a train-dev set?
Used when there is a risk of mismatch between the training data and the data used for validation
Which Linear Regression training algorithm can you use if you have a training set with million of features?
You can’t use SVD or Normal Equation because computational complexity grows quickly with the number of features. Use Stochastic Gradient Descent or Mini-batch Gradient Descent. If memory allows you can use Batch Gradient Descent.
If your training set has very different scales which algorithms might suffer? What can you do about this?
The cost function will have the shape of an elongated bowl, so the Gradient Descent algorithms will take a long time to converge. (Normal Equation or SVD approach will work fine). To solve this you should scale the data first. Moreover, regularized models may converge to a suboptimal solution if the features are not scaled.
Can Gradient Descent get stuck in a local minimum when training a Logistic Regression Model?
Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex.
Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on?
If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training.
Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?
If the optimization problem is convex (such as Linear Regression or Logistic Regression), and assuming the learning rate is not too high, then all Gradient Descent algorithms will approach the global optimum and end up producing fairly similar models. However, unless you gradually reduce the learning rate, Stochastic GD and Mini-batch GD will never truly converge; instead they will keep jumping back and forth around the global optimum. This means that even if you let them run for a very long time, these Gradient Descent algorithms will produce slightly different models.
Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?
Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch Gradient Descent is guaranteed to make progress at every single training iteration. So if you immediately stop training when the validation error goes up, you may stop much too early, before the optimum is reached. A better option is to
save the model at regular intervals; then, when it has not improved for a long time (meaning it will probably never beat the record), you can revert to the best
saved model.
Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?
Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity
of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough
training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.
Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?
If the validation error is much higher than the training error, this is likely because your model is overfitting the training set. One way to try to fix this is to reduce
the polynomial degree: a model with fewer degrees of freedom is less likely to overfit. Another thing you can try is to regularize the model—for example, by adding an ℓ2 penalty (Ridge) or an ℓ1 penalty (Lasso) to the cost function. This will also reduce the degrees of freedom of the model. Lastly, you can try to increase the size of the training set.
Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?
If both the training error and the validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high
bias. You should try reducing the regularization hyperparameter α.
Why would you want to use:
a. Ridge Regression instead of plain Linear Regression (i.e., without any regula‐
rization) ?
A model with some regularization typically performs better than a model without any regularization, so you should generally prefer Ridge Regression over plain Linear Regression.