Machine Learning Flashcards
What are CNNs good for?
Image recognition, because CNNs can identify and extract features from images for classification. They are also good with any sort of data with spatial structure. CNN architectures do typically incorporate fully connected layers for classification.
What are FCNNs good for?
Fully connected neural networks are good for classification and more general purpose tasks (structure agnostic), but tedious for images.
What happens if you add additional layers to a neural network?
You get more feature extraction up to a point, where you then get overfitting.
How to tell if you are overfitting the training data?
You have good performance on the training data, but not good performance on evaluation data.
How to improve performance when overfitting?
Increase regularization or number of examples of training data, or decrease the amount of features used.
What is an epoch?
An epoch is one complete instance of passing the training data through the model. After a second epoch, the same data can be sent in with updated weights in order to improve performance.
What is batch normalization?
Batch normalization is a technique that is intended to solve the problem of internal covariate shift, which states that the inputs may not be distributed the same way every time the weights are updated, which means that training will take much longer since you’re “chasing a moving target”. In batch normalization, each input variable to a layer is normalized to have the same mean and standard deviation in order to try to have the same distribution.
Pros and cons of batch normalization?
Pros: Mostly good. Leads to faster convergence, decreases the importance of initial weights, and requires less data for generalization. Cons: It is not good when using sufficiently small batch sizes (accuracy of mean and variance decreases with smaller batch sizes), and makes test data more different from training data (since test data is batch 1).
What is regularization?
Regularization is a method for reducing overfitting. It is a regression technique that simplifies a model by shrinking coefficient estimates towards zero. Adds a penalty term to coefficients.
What is LSTM stand for, what does biLSTM mean, and what are these networks good for?
Long short-term memory, or bi-directional LSTM. Used primarily for sequences of data and dependencies between neighboring entries in the sequence. Bi-directional looks at both forward and backward relations (essentially is an additional LSTM layer going the opposite direction). Most useful for natural language processing or other speech tasks.
What happens in cross-validation?
Cross-validation is splitting your data into three parts, splitting the data into k subsets, training on k-1, and testing on the final one.
Bias vs variance in ML?
Bias = simplifying assumptions to more easily approximate the target function (like CNN for images)
Variance = amount that target function will change given new training data.
Too much variance : overfitting. Too much bias : underfitting. Try to find the trade-off.
What is an activation function?
An activation function takes in any input and maps it to a corresponding output between {0,1} (or maybe {-1, 1}).
Why are activation functions nonlinear?
The decision boundary needs to be nonlinear to fully account for the non-linear combinations of weights and inputs in classification (which is true in most cases). Multiple hidden layers will just collapse into one if the network is linear.
When to use ReLU and when to use softmax?
Softmax is primarily used as the output layer of neural networks, especially in cases of multi-class classification. Relu is better for hidden layers and is very computationally cheap (tanh and sigmoid are complex)