ML Flashcards
machine learnign algos and tips
In a neural network, what if all the weights are initialized with the same value?
In simplest terms, if all the neurons have the same value of weights, each hidden unit will get exactly the same signal. While this might work during forward propagation, the derivative of the cost function during backward propagation would be the same every time.
What are the advantages of logistic regression?
Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated like you do in Naive Bayes.
have a probabilistic interpretation, unlike decision trees or SVMs, and you can update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs.
Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
Improve bias?
Well, improving the model or the training algorithm:
- Bigger neural net, or random forest with boosting or bagging, trying ensambling.
- Train longer or with a different minimization algorithm.
- You can also play with features and find a new one. If using a linear model you can create polynomial feature.
What are the main ingredients that advanced method like Adam Adagrad RMS prop add to SGD with minibatches?
Essentially
1) Decaying learning rate automatically
2) updating different parameters with different learnign parameter
3) adding momentum to avoid getting stuck into saddle point and flat area where the gradient is almost zero
Nyquist theorem
The Nyquist Theorem states that in order to adequately reproduce a signal it should be periodically sampled at a rate that is 2X the highest frequency you wish to record.
Suppose the highest frequency component, in hertz, for a given analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency componen
Your manager has asked you to run PCA. Would you remove correlated variables first? Why?
YES
PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.
For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.
Why regularization help with overfitting?
By keeping coefficients/parameter in parametric models or node in network small it makes the model simpler (as removing those) and so less prone to overfitting. In NN you are keeping the activation function in the linear regime close to zero.
Machine learning recipe for bias/variance.
-If you have a bias, train bad reduce bias to an acceptable amount. -then look at test performance: do you have variance? Then reiterate! Also in deep learning, you can reduce both or in general you can improve once without affecting the other. In NN you can use more data and bigger architecture and that is it.
Pros and cons of affinity propagation
Strengths: The user doesn’t need to specify the number of clusters (but does need to specify ‘sample preference’ and ‘damping’ hyperparameters). Weaknesses: The main disadvantage of Affinity Propagation is that it’s quite slow and memory-heavy, making it difficult to scale to larger datasets. In addition, it also assumes the true underlying clusters are globular.
Tips on practical use of trees?
- Decision trees tend to overfit on data with a large number of features.
- Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.
- Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.
- Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant.
What is multitasking learning?
When you optimize more than one loss function leaning multiple taks at once. And it can be really powerful.
We can view multi-task learning as a form of inductive transfer. Inductive transfer can help improve a model by introducing an inductive bias, which causes a model to prefer some hypotheses over others. For instance, a common form of inductive bias is ℓ1ℓ1 regularization, which leads to a preference for sparse solutions. In the case of MTL, the inductive bias is provided by the auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task. As we will see shortly, this generally leads to solutions that generalize better.
What is a CycleGAN?
What are bidirectional neural networks?
Contrary to the unidirectional ones you do not just get information from the previous words in the sentence but also from the following ones.
So you can distinguish between:
He said: Teddy Roosevelt was a president
He said teddy bears are on sale
With the first 2 words, you can not understand teddy is a name in the first but no the second.
Explain p-value?
When you conduct a hypothesis test in statistics, a p-value allows you to determine the strength of your results. It is a numerical number between 0 and 1. Based on the value it will help you to denote the strength of the specific result.
Explain AB Testing in great detail.
Briefly, what is a random forest?
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
In a Random forest each tree uses only uses a fraction of features while bagged trees used all of the available features usually sqrt(N). This is important to diversify the weak learner trees and make them uncorrelated.
Advantages of SVM?
High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space (different from logistic regression). Especially popular in text classification problems where very high-dimensional spaces are the norm.
Handle high dimensional data well
Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?
The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.
In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.
What can cause problems in interpreting coefficients in somethign like linear regression?
The collinearity or correlation among variables then you can minimize by setting them to zero or just any opposite set of values.
What are the peoblem of using neural network for sequence models?
- Inputs/outputs can have a different variable length (with recurrent you just apply the net to every input words, they can vary in lenght)
- Most importantly: they do not share features learned across the different position of the text.
Also using one hot encoding has huge memory impact as for conv-net and images RNN for text also helps in having less memory usage.
Assumptions of Logistic Regression
Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.
First, logistic regression does not require a linear relationship between the dependent and independent variables. Second, the error terms (residuals) do not need to be normally distributed. Third, homoscedasticity is not required. Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale.
First, binary logistic regression requires the dependent variable to be binary
Second, logistic regression requires the observations to be independent of each other.
Third, logistic regression requires there to be little or no multicollinearity among the independent variables.
Fourth, logistic regression assumes linearity of independent variables and log odds
What are the issues with Gradient descent computed on the entire batch instead of mini batch ( stocastic gradient descent?)
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. Batch gradient descent also doesn’t allow us to update our model online, i.e. with new examples on-the-fly.
What are some drawback of batch norm?
However below are the few cons of Batch Normalization.
BN calculates the batch statistics(Mini-batch mean and variance) in every training iteration, therefore it requires larger batch sizes while training so that it can effectively approximate the population mean and variance from the mini-batch. This makes BN harder to train networks for application such as object detection, semantic segmentation, etc because they generally work with high input resolution(often as big as 1024x 2048) and training with larger batch sizes is not computationally feasible.
BN does not work well with RNNs. The problem is RNNs have a recurrent connection to previous timestamps and would require a separate β and γ for each timestep in the BN layer which instead adds additional complexity and makes it harder to use BN with RNNs.
Different training and test calculation: During test(or inference) time, the BN layer doesn’t calculate the mean and variance from the test data mini-batch(steps 1 and 2 from the algorithm table above) but uses the fixed mean and variance calculated from the training data. This requires cautious while using BN and introduces additional complexity. In pytorch model.eval() makes sure to set the model in evaluation model and hence the BN layer leverages this to use fixed mean and variance from pre-calculated from training data.
Describe yje way to detect anomalies in a given dataset.
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. These can be rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
There are 2 types:
outlier detection
The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
novelty detection
The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.
The simplest approach is to use simple statistical techniques and flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. For example, marking an anomaly when a data point deviates by a certain standard deviation from the mean.
However, in high dimensions, the statistical approach could be difficult, therefore, machine learning techniques could be used. Following are the popular methods used to detect anomalies:
Isolation Forest
One Class SVM
PCA-based Anomaly detection
FAST-MCD
Local Outlier Factor
(Explaining one of the above-mentioned methods)
Isolation Forests build a Random Forest in which each Decision Tree is grown randomly. At each node, it picks a feature randomly, then it picks a random threshold value (between the min and max value) to split the dataset in two. The dataset gradually gets chopped into pieces this way, until all instances end up isolated from the other instances.Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.








