10 Essential Machine Learning Interview Questions Flashcards

1
Q

What is stratified cross-validation and when should we use it?

A

Cross-validation is a technique for dividing data between training and validation sets. On typical cross-validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of the categories on both the training and validation datasets.

For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified cross-validation, we will have the same proportions in training and validation. In contrast, if we use simple cross-validation, in the worst case we may find that there are no samples of category A in the validation set.

Stratified cross-validation may be applied in the following scenarios:

On a dataset with multiple categories. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified cross-validation.
On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. If we do not ensure that both types are present in training and validation, we will have generalization problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do ensembles typically have higher scores than individual models?

A

An ensemble is the combination of multiple models to create a single prediction. The key idea for making better predictions is that the models should make different errors. That way the errors of one model will be compensated by the right guesses of the other models and thus the score of the ensemble will be higher.

We need diverse models for creating an ensemble. Diversity can be achieved by:

Using different ML algorithms. For example, you can combine logistic regression, k-nearest neighbors, and decision trees.
Using different subsets of the data for training. This is called bagging.
Giving a different weight to each of the samples of the training set. If this is done iteratively, weighting the samples according to the errors of the ensemble, it’s called boosting.
Many winning solutions to data science competitions are ensembles. However, in real-life machine learning projects, engineers need to find a balance between execution time and accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is regularization? Can you give some examples of regularization techniques?

A

Regularization is any technique that aims to improve the validation score, sometimes at the cost of reducing the training score.

Some regularization techniques:

L1 tries to minimize the absolute value of the parameters of the model. It produces sparse parameters.
L2 tries to minimize the square value of the parameters of the model. It produces parameters with small values.
Dropout is a technique applied to neural networks that randomly sets some of the neurons’ outputs to zero during training. This forces the network to learn better representations of the data by preventing complex interactions between the neurons: Each neuron needs to learn useful features.
Early stopping will stop training when the validation score stops improving, even when the training score may be improving. This prevents overfitting on the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the curse of dimensionality? Can you list some ways to deal with it?

A

The curse of dimensionality is when the training data has a high feature count, but the dataset does not have enough samples for a model to learn correctly from so many features. For example, a training dataset of 100 samples with 100 features will be very hard to learn from because the model will find random relations between the features and the target. However, if we had a dataset of 100k samples with 100 features, the model could probably learn the correct relationships between the features and the target.

There are different options to fight the curse of dimensionality:

Feature selection. Instead of using all the features, we can train on a smaller subset of features.
Dimensionality reduction. There are many techniques that allow to reduce the dimensionality of the features. Principal component analysis (PCA) and using autoencoders are examples of dimensionality reduction techniques.
L1 regularization. Because it produces sparse parameters, L1 helps to deal with high-dimensionality input.
Feature engineering. It’s possible to create new features that sum up multiple existing features. For example, we can get statistics such as the mean or median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an imbalanced dataset? Can you list some ways to deal with it?

A

An imbalanced dataset is one that has different proportions of target categories. For example, a dataset with medical images where we have to detect some illness will typically have many more negative samples than positive samples—say, 98% of images are without the illness and 2% of images are with the illness.

There are different options to deal with imbalanced datasets:

Oversampling or undersampling. Instead of sampling with a uniform distribution from the training dataset, we can use other distributions so the model sees a more balanced dataset.
Data augmentation. We can add data in the less frequent categories by modifying existing data in a controlled way. In the example dataset, we could flip the images with illnesses, or add noise to copies of the images in such a way that the illness remains visible.
Using appropriate metrics. In the example dataset, if we had a model that always made negative predictions, it would achieve a precision of 98%. There are other metrics such as precision, recall, and F-score that describe the accuracy of the model better when using an imbalanced dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some factors that explain the success and recent rise of deep learning?

A

The success of deep learning in the past decade can be explained by three main factors:

More data. The availability of massive labeled datasets allows us to train models with more parameters and achieve state-of-the-art scores. Other ML algorithms do not scale as well as deep learning when it comes to dataset size.
GPU. Training models on a GPU can reduce the training time by orders of magnitude compared to training on a CPU. Currently, cutting-edge models are trained on multiple GPUs or even on specialized hardware.
Improvements in algorithms. ReLU activation, dropout, and complex network architectures have also been very significant factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data augmentation? Can you give some examples?

A

Data augmentation is a technique for synthesizing new data by modifying existing data in such a way that the target is not changed, or it is changed in a known way.

Computer vision is one of fields where data augmentation is very useful. There are many modifications that we can do to images:

Resize
Horizontal or vertical flip
Rotate
Add noise
Deform
Modify colors
Each problem needs a customized data augmentation pipeline. For example, on OCR, doing flips will change the text and won’t be beneficial; however, resizes and small rotations may help.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are convolutional networks? Where can we use them?

A

Convolutional networks are a class of neural network that use convolutional layers instead of fully connected layers. On a fully connected layer, all the output units have weights connecting to all the input units. On a convolutional layer, we have some weights that are repeated over the input.

The advantage of convolutional layers over fully connected layers is that the number of parameters is far smaller. This results in better generalization of the model. For example, if we want to learn a transformation from a 10x10 image to another 10x10 image, we will need 10,000 parameters if using a fully connected layer. If we use two convolutional layers, the first one having nine filters and the second one having one filter, with a kernel size of 3x3, we will have only 90 parameters.

Convolutional networks are applied where data has a clear dimensionality structure. Time series analysis is an example where one-dimensional convolutions are used; for images, 2D convolutions are used; and for volumetric data, 3D convolutions are used.

Computer vision has been dominated by convolutional networks since 2012 when AlexNet won the ImageNet challenge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly