ML Flashcards
machine learnign algos and tips
In a neural network, what if all the weights are initialized with the same value?
In simplest terms, if all the neurons have the same value of weights, each hidden unit will get exactly the same signal. While this might work during forward propagation, the derivative of the cost function during backward propagation would be the same every time.
What are the advantages of logistic regression?
Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated like you do in Naive Bayes.
have a probabilistic interpretation, unlike decision trees or SVMs, and you can update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs.
Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
Improve bias?
Well, improving the model or the training algorithm:
- Bigger neural net, or random forest with boosting or bagging, trying ensambling.
- Train longer or with a different minimization algorithm.
- You can also play with features and find a new one. If using a linear model you can create polynomial feature.
What are the main ingredients that advanced method like Adam Adagrad RMS prop add to SGD with minibatches?
Essentially
1) Decaying learning rate automatically
2) updating different parameters with different learnign parameter
3) adding momentum to avoid getting stuck into saddle point and flat area where the gradient is almost zero
Nyquist theorem
The Nyquist Theorem states that in order to adequately reproduce a signal it should be periodically sampled at a rate that is 2X the highest frequency you wish to record.
Suppose the highest frequency component, in hertz, for a given analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency componen
Your manager has asked you to run PCA. Would you remove correlated variables first? Why?
YES
PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.
For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.
Why regularization help with overfitting?
By keeping coefficients/parameter in parametric models or node in network small it makes the model simpler (as removing those) and so less prone to overfitting. In NN you are keeping the activation function in the linear regime close to zero.
Machine learning recipe for bias/variance.
-If you have a bias, train bad reduce bias to an acceptable amount. -then look at test performance: do you have variance? Then reiterate! Also in deep learning, you can reduce both or in general you can improve once without affecting the other. In NN you can use more data and bigger architecture and that is it.
Pros and cons of affinity propagation
Strengths: The user doesn’t need to specify the number of clusters (but does need to specify ‘sample preference’ and ‘damping’ hyperparameters). Weaknesses: The main disadvantage of Affinity Propagation is that it’s quite slow and memory-heavy, making it difficult to scale to larger datasets. In addition, it also assumes the true underlying clusters are globular.
Tips on practical use of trees?
- Decision trees tend to overfit on data with a large number of features.
- Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.
- Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.
- Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant.
What is multitasking learning?
When you optimize more than one loss function leaning multiple taks at once. And it can be really powerful.
We can view multi-task learning as a form of inductive transfer. Inductive transfer can help improve a model by introducing an inductive bias, which causes a model to prefer some hypotheses over others. For instance, a common form of inductive bias is ℓ1ℓ1 regularization, which leads to a preference for sparse solutions. In the case of MTL, the inductive bias is provided by the auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task. As we will see shortly, this generally leads to solutions that generalize better.
What is a CycleGAN?
What are bidirectional neural networks?
Contrary to the unidirectional ones you do not just get information from the previous words in the sentence but also from the following ones.
So you can distinguish between:
He said: Teddy Roosevelt was a president
He said teddy bears are on sale
With the first 2 words, you can not understand teddy is a name in the first but no the second.
Explain p-value?
When you conduct a hypothesis test in statistics, a p-value allows you to determine the strength of your results. It is a numerical number between 0 and 1. Based on the value it will help you to denote the strength of the specific result.
Explain AB Testing in great detail.
Briefly, what is a random forest?
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
In a Random forest each tree uses only uses a fraction of features while bagged trees used all of the available features usually sqrt(N). This is important to diversify the weak learner trees and make them uncorrelated.
Advantages of SVM?
High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space (different from logistic regression). Especially popular in text classification problems where very high-dimensional spaces are the norm.
Handle high dimensional data well
Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?
The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.
In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.
What can cause problems in interpreting coefficients in somethign like linear regression?
The collinearity or correlation among variables then you can minimize by setting them to zero or just any opposite set of values.
What are the peoblem of using neural network for sequence models?
- Inputs/outputs can have a different variable length (with recurrent you just apply the net to every input words, they can vary in lenght)
- Most importantly: they do not share features learned across the different position of the text.
Also using one hot encoding has huge memory impact as for conv-net and images RNN for text also helps in having less memory usage.
Assumptions of Logistic Regression
Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.
First, logistic regression does not require a linear relationship between the dependent and independent variables. Second, the error terms (residuals) do not need to be normally distributed. Third, homoscedasticity is not required. Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale.
First, binary logistic regression requires the dependent variable to be binary
Second, logistic regression requires the observations to be independent of each other.
Third, logistic regression requires there to be little or no multicollinearity among the independent variables.
Fourth, logistic regression assumes linearity of independent variables and log odds
What are the issues with Gradient descent computed on the entire batch instead of mini batch ( stocastic gradient descent?)
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. Batch gradient descent also doesn’t allow us to update our model online, i.e. with new examples on-the-fly.
What are some drawback of batch norm?
However below are the few cons of Batch Normalization.
BN calculates the batch statistics(Mini-batch mean and variance) in every training iteration, therefore it requires larger batch sizes while training so that it can effectively approximate the population mean and variance from the mini-batch. This makes BN harder to train networks for application such as object detection, semantic segmentation, etc because they generally work with high input resolution(often as big as 1024x 2048) and training with larger batch sizes is not computationally feasible.
BN does not work well with RNNs. The problem is RNNs have a recurrent connection to previous timestamps and would require a separate β and γ for each timestep in the BN layer which instead adds additional complexity and makes it harder to use BN with RNNs.
Different training and test calculation: During test(or inference) time, the BN layer doesn’t calculate the mean and variance from the test data mini-batch(steps 1 and 2 from the algorithm table above) but uses the fixed mean and variance calculated from the training data. This requires cautious while using BN and introduces additional complexity. In pytorch model.eval() makes sure to set the model in evaluation model and hence the BN layer leverages this to use fixed mean and variance from pre-calculated from training data.
Describe yje way to detect anomalies in a given dataset.
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. These can be rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
There are 2 types:
outlier detection
The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
novelty detection
The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.
The simplest approach is to use simple statistical techniques and flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. For example, marking an anomaly when a data point deviates by a certain standard deviation from the mean.
However, in high dimensions, the statistical approach could be difficult, therefore, machine learning techniques could be used. Following are the popular methods used to detect anomalies:
Isolation Forest
One Class SVM
PCA-based Anomaly detection
FAST-MCD
Local Outlier Factor
(Explaining one of the above-mentioned methods)
Isolation Forests build a Random Forest in which each Decision Tree is grown randomly. At each node, it picks a feature randomly, then it picks a random threshold value (between the min and max value) to split the dataset in two. The dataset gradually gets chopped into pieces this way, until all instances end up isolated from the other instances.Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.
Why is gradient checking important?
Gradient Checking is a method to check out the derivatives in Back-propagation algorithms. Implementation of back-propagation algorithm is usually prone to bugs and errors. Therefore, it’s necessary before running the neural network on training data to check if our implementation of back-propagation is correct. Gradient checking is a way to do that. It compares the back-propagation gradients, which are obtained analytically with loss function, with numerically obtained gradient for each parameter. Therefore, it ensures that the implementation is correct and would hence, significantly increase our confidence in the correctness of our code.
By numerically checking the derivatives computed, gradient checking eliminates most of the problems that may occur as the back-propagation algorithm may have many subtle bugs. It could look like it’s working, and our cost function may end up decreasing on every iteration of gradient descent, but this may result in a neural network that has a higher level of error that could go unnoticed and give us worse performance.
What cross-validation technique would you use on a time series dataset?
Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn’t hold in earlier years!
You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.
fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]
How is KNN different from k-means clustering?
K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm.
How can you evaluate multilabel classification
You turn this into multiple binary classification, is this examples one of this label? yes or no because the assignment to label should be independent.
There you can use a binary cross entriopy loss
https://machinelearningmastery.com/multi-label-classification-with-deep-learning/
What is a Kalman filters and when is that used?
Kalman Filters are a powerful tool used to evaluate the hidden state of a system, when we only have access to measurements of the system containing inaccuracies or errors. It bases its estimation on the past prior state, and the current measurements. For example, it can be used to estimate the position of a car based on its GPS signal. The position of the car at time t is a combination of its prior estimates of position and speed at t-1
Why transformers uses layer norm instead of batch norm?
A less known issue of Batch Norm is that how hard it is to parallellize batch-normalized models. Since there is dependence between elements, there is additional need for synchronization across devices. While this is not an issue for most vision models, which tends to be used on a small set of devices, Transformers really suffer from this problem, as they rely on large-scale setups to counter their quadratic complexity. In this regard, layer norm provides some degree of normalization while incurring no batch-wise dependence.
What is a lift analysis?
Lift analysis is used for classification tasks: you classify your data in deciles from 0-0.1 prob 0.1-0.2 and so on according to say to the probability users will cancel a subscription. Then you check how many of those did in each bin. If you have a good model your high bins will do it a lot compared to average and the opposite for low bins. The ratio of the value you have in high bins compare to average is your lift analysis.
Can you explain what MapReduce is and how it works?
MapReduce is a data processing job that enables distributed computations to handle a huge amount of data.
It is used to split and process terabytes of data in parallel, achieving quicker results. This way it makes it easy to scale data processing over multiple computing nodes.
The processing happens using the map and reduce function.
Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (like count the lenght of a string or the number of occcurances of word in text)
Whereas, reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples (for example getting the max of a tuple of lenght of strings).
The most famous implementation is Apache Hadoop
See https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac
What is the importance/role of the pooling layer?
It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. n addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling.
You can probably discard pooling and you stride and padding to reduce the diamension.
How is XGBoost handling bias-variance tradeoff?
XGBoost is a Gradient boosting of decision trees.
osting is a greedy algorithm and can overfit a training dataset quickly.
The general idea is that each individual tree will over fit some parts of the data, but therefor will under fit other parts of the data. But in boosting, you don’t use the individual trees, but rather “average” them all together, so for a particular data point (or group of points) the trees that over fit that point (those points) will be average with the under fitting trees and the combined average should neither over or under fit, but should be about right.
In particular in XGBoost
There are in general two ways that you can control overfitting in XGBoost:
The first way is to directly control model complexity.
This includes max_depth, min_child_weight and gamma.
The second way is to add randomness to make training robust to noise.
This includes subsample and colsample_bytree.
You can also reduce stepsize eta. Remember to increase num_round when you do so.
Below are some constraints that can be imposed on the construction of decision trees:
Number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.
Tree depth, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.
Number of nodes or number of leaves, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.
Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered
Formulate Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) techniques.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1
In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.
K- mean and Gaussian mixture model: what is the difference between K-mean and Mixture of Gaussian ?
Mixture of Gaussian is generative ad statistic based while
K-mean is very effective and accurate but not based on statistics
They are both affected by the initial state.
Why would you do dimensionality reduction?
(1) Reduce the storage space needed
(2) Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions (
3) Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed) (4) Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights
(5) Too many features or too complex a model can lead to overfitting.
What is the difference between type I vs type II error?
What is the difference between sigmoid/logistic and sofrmax?
Softmax FunctionSigmoid Function
1Used for multi-classification in logistic regression model.Used for binary classification in logistic regression model.
2 The probabilities sum will be 1 The probabilities sum need not be 1.
3 Used in the different layers of neural networks. Used as activation function while building neural networks.
4 The high value will have the higher probability than other values. The high value will have the high probability but not the higher probability.
What are the main parameters for ensemble method like a random forest?
main parameters to adjust when using these methods is n_estimators and max_features.
For n_estimThe larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees.
for max_feature The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features)
If you are having 4GB RAM in your machine and you want to train your model on 10GB dataset. How would you go about this problem. Have you ever faced this kind of problem in your machine learning/data science experience so far ?
First of all you have to ask which ML model you want to train.
For Neural networks: Batch size with Numpy array will work.
Steps:
Load the whole data in Numpy array. Numpy array has property to create mapping of complete dataset, it doesn’t load complete dataset in memory.
You can pass index to Numpy array to get required data.
Use this data to pass to Neural network.
Have small batch size.
For SVM: Partial fit will work
Steps:
Divide one big dataset in small size datasets.
Use partialfit method of SVM, it requires subset of complete dataset.
Repeat step 2 for other subsets.
What is the O(n^2) bottleneck in a Transformer and how can we do better?
https://chengh.medium.com/evolution-of-fast-and-efficient-transformers-ec0378257994
Transformers scale bad with the length of the sequence, and better with the size of the embedding. Indeed they have an n^2 cost in multiplying key and vector in the attention mechanism
to solve this you can
Segment level recurrence:
Transformer-XL
Sparse attention
Approximation
Inference Acceleration
how can you fix the computational problem with word2vec?
Using hierarchical softmax or negative sampling
Cons of SVM?
Memory-intensive, hard to interpret, and kind of annoying to run and tune
What methods do you know about outlier detection.
Gaussian discrimination
Similar to mixture of gaussians in my opinion
GDA, is a method for data classification commonly used when data can be approximated with a Normal distribution. As first step, you will need a training set, i.e. a bunch of data yet classified. These data are used to train your classifier, and obtain a discriminant function that will tell you to which class a data has higher probability to belong.
When you have your training set you need to compute the mean 𝜇μ and the standard deviation 𝜎2σ2. These two variables, as you know, allow you to describe a Normal distribution.
Once you have computed the Normal distribution for each class, to classify a data you will need to compute, for each one, the probability that that data belongs to it. The class with the highest probability will be chosen as the affinity class.
What are the different kernels functions in SVM ?
There are four types of kernels in SVM.
Linear Kernel
Polynomial kernel
Radial basis kernel
Sigmoid kernel
Tell us more about bagging and boosting!
Use an unfair coin for fair tosses
Toss the coin twice.
If the result is HT, assign X=0
. If the result is TH, assign X=1
.
If the result is either HH or TT, then discard the two coin tosses and go to step 1.
The probability of making an HH or TT for two tosses is
P(HH)+P(TT)=p2+q2.(1)
Therefore, the probability of finally getting HT (and thus setting X=0
) is1
P(HT)+(P(HH)+P(TT))P(HT)+(P(HH)+P(TT))2P(HT)+…=pq1−p2−q2=12.(2)
Similar, the probability of X=1
is 0.5 and hence we get a fail result from a biased coin.
What should you be careful to do if you do cross-validation for a classification problem?
Stratify your cross-validation samples!!
Pros and cons of DBSCAN
Strengths: DBSCAN does not assume globular clusters, and its performance is scalable. In addition, it doesn’t require every point to be assigned to a cluster, reducing the noise of the clusters (this may be a weakness, depending on your use case). Weaknesses: The user must tune the hyperparameters ‘epsilon’ and ‘min_samples,’ which define the density of clusters. DBSCAN is quite sensitive to these hyperparameters.
What is the difference between precision and recall
Precision: True positive/ number of predicted positive. True pos/ (true pos+ false pos) Of all the one we have predicted as positive how many really are. Recall: Of all the test set case that was how many we predicted as true? True pos/(true pos +false neg)
What is Adjusted rand index?
in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings.
it can be use the Rand index to assess a clustering approach
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.
SVM vs logistic regression?
They are closely linked. For example you can get SVM by asking the logistic regression to maximize the decision (not to get the probability as close as possible) LR are better to get probability more than decision.SVM tachnically do not give you prob at all. SVM are more scalable and can get more complex separations
What is a drawback of Masked language modelling using in autoencoder language model like BERT?
the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at f inetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language [9].
If the labels are known in the clustering project, how to evaluate the performance of the model?
this basically become a classification problem
What is a particular feature of a learner that makes it very suitable for bagging?
You want an unstable learner like a tree that gives very different answers given slightly different input
What is XLNet what is its difference with BERT
XLNet is a BERT-like model instead of a entirely different one. But it is an auspicious and potential one. In one word, XLNet is a generalized autoregressive pretraining method.
BERT predicts all the masked word simultaneously XLNet does it sequentially and not necessarily left to right
BERT is a bidirectional autoencoder
XLNet is “generalized” autoregressive(AR) it uses permutation language modelling https://arxiv.org/pdf/1906.08237.pdf
Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.
Also there is not data corruption
What is Heteroscedasticity and what are the effect on linear regression ?
linear regression model presents heteroscedasticity when the variance of the perturbations is not constant throughout the observations. This implies the breach of one of the basic hypothesis on which the linear regression model is based.
Recall that one of the basic assumptions of linear regression is “That errors have constant variance.” From it is derived that the data with which one works are heterogeneous since they come from probability distributions with a different variance.
There are two major consequences of heteroscedasticity. One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid.
The other is that OLS is an inefficient estimation techniqu
What are some methods for calibrating deep classifiers?
https://scikit-learn.org/stable/modules/calibration.html
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions. The x axis represents the average predicted probability in each bin.
Calibrating a classifier consists of fitting a regressor (called a calibrator) that maps the output of the classifier (as given by decision_function or predict_proba) to a calibrated probability in [0, 1]. Denoting the output of the classifier for a given sample by fi, the calibrator tries to predict p(yi=1|fi).
The samples that are used to fit the calibrator should not be the same samples used to fit the classifier, as this would introduce bias. This is because performance of the classifier on its training data would be better than for novel data. Using the classifier output of training data to fit the calibrator would thus result in a biased calibrator that maps to probabilities closer to 0 and 1 than it should.
Explain AdaBoost
AdaBoost ifit a sequence of weak learners (i.e., models that are only slightly better than random guessing,) on repeatedly modified versions of the data.
The predictions are combined through a weighted majority vote (or sum) to produce the final prediction.
The data modifications at each so-called boosting iteration consist of applying weights to each of the training samples. Initially, those weights are all set to 1/n, so that the first step simply trains a weak learner on the original data.
For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data.
At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly.
As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence
What are the main steps to create a stacked model?
So, assume that we want to fit a stacking ensemble composed of L weak learners. Then we have to follow the steps thereafter:
split the training data in two folds
choose L weak learners and fit them to data of the first fold
for each of the L weak learners, make predictions for observations in the second fold
fit the meta-model on the second fold, using predictions made by the weak learners as inputs
How does a neural network with one layer and one input and output compare to a logistic regression?
Neural networks and logistic regression are both used for classification problems. Logistic regression can be defined as the simplest form of Neural Network that results in straightforward decision boundaries whereas neural networks are a superset that includes additional complex decision boundaries to cater to more complex and large data. Logistic regression models cannot capture complex non-linear relationships w.r.t features. Meanwhile, a neural network with non-linear activation functions enables one to capture highly complex features.
What is an autoencode language modesl
Based on Masked language modelling
In comparison, AE based pretraining does not perform explicit density estimation but instead aims to reconstruct the original data from corrupted input. A notable example is BERT [10], which has been the state-of-the-art pretraining approach. Given the input token sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize
What ML algorithm Handles lots of irrelevant features well (separates signal from noise)?
DO: Naive Bayes Random forest (if not creazy noise) adaboost and neural networks DO not: KNN linear regression logistic (unless you regularize with LAsso) decision trees
What is the universal approximation theorem? Do we really need DEEP neural network?
According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper
Could you explain how to define the number of clusters in a clustering algorithm?
he primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another.
Generally, Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve.
The Elbow Curve graph contains a point that represents the point post in which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means.
Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.
What is reinforcement learning ?
add more
Reinforcement learning
Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximize the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward.Reinforcement learning is inspired by the learning of human beings, it is based on the reward/panelity mechanism.
Applications:
RL is quite widely used in building AI for playing computer games. AlphaGo Zero
In robotics and industrial automation,RL is used to enable the robot to create an efficient adaptive control system for itself which learns from its own experience and behavior
What are some of the differences between GRU and LSTM?
- A GRU has two gates, an LSTM has three gates.
- GRUs don’t possess and internal memory () that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs.
- The input and forget gates are coupled by an update gate and the reset gate is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both and .
- We don’t apply a second nonlinearity when computing the output.
More info https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
What is the key part of a recurrent neural network?
is, a neural network that has a memory that influences future predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word “Hello”.
ecurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.
What is the problem of having highly correlated features in log regression?
well basically you can not distinguish between for example putting the 2 coefficients to zero or one very close to minus the other.
What is the basic idea behind logistic regression?
It is a linear regressor assuming a linear relationship between features and target and no noise. The output is then passed to a sigmoid function turning a linear regression into a logistic one. This makes it non linear so you should not use a least square loss function. Use an entropy logarithmic one!
Why is naive bayes “naive”?
Because it assume variables are independent
What is an autoregressive model?
Pros and Cons of char based vs word based embeddings
How is batch normalization done on testing?
You use the training statistics, if they are not the same distribution you have bigger problems
What are the main loss functions for regression tasks?
- Mean Square Error, Quadratic loss, L2 Loss
- Mean Absolute Error, L1 Loss
using the squared error is easier to solve, but using the absolute error is more robust to outliers.
big problem in using MAE loss (for neural nets especially) is that its gradient is the same throughout, which means the gradient will be large even for small loss values. This isn’t good for learning
If we only had to give one prediction for all the observations that try to minimize MSE, then that prediction should be the mean of all target values. But if we try to minimize MAE, that prediction would be the median of all observations.
- Huber Loss, Smooth Mean Absolute Error, Log-Cosh Loss,
These try to get the best things of both. They are almost abs(x-x) at high x but they are smooth and almost quadratic (x-x)^2 near zero.
Huber loss has hyperparameter to train
Advanced quantile loss
What are the parameters of a CNN 2D layer?
filters: Integer, the dimensionality of the output space (i.e. the number of output filters in the convolution).
kernel_size: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window.
strides: An integer or tuple/list of a single integer, specifying the stride length of the convolution. Specifying any stride value != 1 is incompatible with specifying any dilation_rate value != 1.
padding: One of “valid”, “causal” or “same” (case-insensitive). “valid” means “no padding”. “same” results in padding the input such that the output has the same length as the original input. “causal” results in causal (dilated) convolutions, e.g. output[t] does not depend on input[t+1:]. Useful when modeling temporal data where the model should not violate the temporal order. See WaveNet: A Generative Model for Raw Audio, section 2.1.
dilated convolution
activation: Activation function to use (see activations). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
use_bias: Boolean, whether the layer uses a bias vector.
kernel_initializer: Initializer for the kernel weights matrix (see initializers).
bias_initializer: Initializer for the bias vector (see initializers).
kernel_regularizer: Regularizer function applied to the kernel weights matrix (see regularizer).
bias_regularizer: Regularizer function applied to the bias vector (see regularizer).
activity_regularizer: Regularizer function applied to the output of the layer (its “activation”). (see regularizer).
kernel_constraint: Constraint function applied to the kernel matrix (see constraints).
bias_constraint: Constraint function applied to the bias vector (see constraints).
When using the Gaussian mixture model, how do you know it’s applicable?
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. In this approach we describe each cluster by its centroid (mean), covariance, and the size of the cluster (weight). Therefore, based on this definition, a GMM will be applicable when we know that the data points are mixtures of a gaussian distribution and form clusters with different mean and standard deviation.
How is a decision tree pruned?
Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model
Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned.
What is the assumption of error in linear regression?
In requires variable to be Normally distributed (you can use box-cox transformation to do that)
The fourth assumption is that the error(residuals) follow a normal distribution.However, a less widely known fact is that, as sample sizes increase, the normality assumption for the residuals is not needed
and importantly
Homoscedasticity (That errors have constant variance)
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables.
There are two major consequences of heteroscedasticity. One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid.
Are decision tree parametric?
No just a cascade of if then else decision.
What is the AdaGrad optimization algorithm?
AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter learning rate,
Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.[21] It still has a base learning rate η, but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix
{\displaystyle G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}}}
where {\displaystyle g_{\tau }=\nabla Q_{i}(w)}, the gradient, at iteration τ. The diagonal is given by
{\displaystyle G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}}.
This vector is updated after every iteration. The formula for an update is now
{\displaystyle w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\circ g}[a]
or, written as per-parameter updates,
{\displaystyle w_{j}:=w_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j}.}
Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi. Since the denominator in this factor, {\displaystyle {\sqrt {G_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}} is the ℓ2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.[19]
While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.[23]
What are the main differences between baggin and boosting?
Bagging:
- parallel ensemble: each model is built independently
- aim to decrease variance, not bias
- suitable for high variance low bias models (complex models)
- an example of a bagging method is random forest, which develop fully grown trees (note that RF modifies the grown procedure to reduce the correlation between trees)
Boosting:
- sequential ensemble: try to add new models that do well where previous models lack
- aim to decrease bias, not variance
- suitable for low variance high bias models
- an example of a tree based method is gradient boosting
How should you deal with unbalanced classes?
- collect more data - use the appropriate evauation metric - first you can do sampling to rebalance your classed either you sample the dominant class or you repeat the less dominant one (oversampling makes it easy to overfit). You can do a syntetic minority sampling you create new fake small sample data to avoid overfitting - second you can usea smart algorithm that deals weel with this (or use weights). Xgboost for example resample its bags so that it has balanced classes.