ML Flashcards

machine learnign algos and tips

1
Q

In a neural network, what if all the weights are initialized with the same value?

A

In simplest terms, if all the neurons have the same value of weights, each hidden unit will get exactly the same signal. While this might work during forward propagation, the derivative of the cost function during backward propagation would be the same every time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the advantages of logistic regression?

A

Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated like you do in Naive Bayes.

have a probabilistic interpretation, unlike decision trees or SVMs, and you can update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs.

Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Improve bias?

A

Well, improving the model or the training algorithm:

  • Bigger neural net, or random forest with boosting or bagging, trying ensambling.
  • Train longer or with a different minimization algorithm.
  • You can also play with features and find a new one. If using a linear model you can create polynomial feature.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the main ingredients that advanced method like Adam Adagrad RMS prop add to SGD with minibatches?

A

Essentially

1) Decaying learning rate automatically
2) updating different parameters with different learnign parameter
3) adding momentum to avoid getting stuck into saddle point and flat area where the gradient is almost zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Nyquist theorem

A

The Nyquist Theorem states that in order to adequately reproduce a signal it should be periodically sampled at a rate that is 2X the highest frequency you wish to record.

Suppose the highest frequency component, in hertz, for a given analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency componen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

A

YES

PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why regularization help with overfitting?

A

By keeping coefficients/parameter in parametric models or node in network small it makes the model simpler (as removing those) and so less prone to overfitting. In NN you are keeping the activation function in the linear regime close to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Machine learning recipe for bias/variance.

A

-If you have a bias, train bad reduce bias to an acceptable amount. -then look at test performance: do you have variance? Then reiterate! Also in deep learning, you can reduce both or in general you can improve once without affecting the other. In NN you can use more data and bigger architecture and that is it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pros and cons of affinity propagation

A

Strengths: The user doesn’t need to specify the number of clusters (but does need to specify ‘sample preference’ and ‘damping’ hyperparameters). Weaknesses: The main disadvantage of Affinity Propagation is that it’s quite slow and memory-heavy, making it difficult to scale to larger datasets. In addition, it also assumes the true underlying clusters are globular.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tips on practical use of trees?

A
  • Decision trees tend to overfit on data with a large number of features.
  • Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.
  • Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.
  • Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is multitasking learning?

A

When you optimize more than one loss function leaning multiple taks at once. And it can be really powerful.

We can view multi-task learning as a form of inductive transfer. Inductive transfer can help improve a model by introducing an inductive bias, which causes a model to prefer some hypotheses over others. For instance, a common form of inductive bias is ℓ1ℓ1 regularization, which leads to a preference for sparse solutions. In the case of MTL, the inductive bias is provided by the auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task. As we will see shortly, this generally leads to solutions that generalize better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a CycleGAN?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are bidirectional neural networks?

A

Contrary to the unidirectional ones you do not just get information from the previous words in the sentence but also from the following ones.

So you can distinguish between:

He said: Teddy Roosevelt was a president

He said teddy bears are on sale

With the first 2 words, you can not understand teddy is a name in the first but no the second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain p-value?

A

When you conduct a hypothesis test in statistics, a p-value allows you to determine the strength of your results. It is a numerical number between 0 and 1. Based on the value it will help you to denote the strength of the specific result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain AB Testing in great detail.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Briefly, what is a random forest?

A

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

In a Random forest each tree uses only uses a fraction of features while bagged trees used all of the available features usually sqrt(N). This is important to diversify the weak learner trees and make them uncorrelated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Advantages of SVM?

A

High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space (different from logistic regression). Especially popular in text classification problems where very high-dimensional spaces are the norm.

Handle high dimensional data well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

A

The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What can cause problems in interpreting coefficients in somethign like linear regression?

A

The collinearity or correlation among variables then you can minimize by setting them to zero or just any opposite set of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the peoblem of using neural network for sequence models?

A
  • Inputs/outputs can have a different variable length (with recurrent you just apply the net to every input words, they can vary in lenght)
  • Most importantly: they do not share features learned across the different position of the text.

Also using one hot encoding has huge memory impact as for conv-net and images RNN for text also helps in having less memory usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Assumptions of Logistic Regression

A

Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.

First, logistic regression does not require a linear relationship between the dependent and independent variables. Second, the error terms (residuals) do not need to be normally distributed. Third, homoscedasticity is not required. Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale.

First, binary logistic regression requires the dependent variable to be binary

Second, logistic regression requires the observations to be independent of each other.

Third, logistic regression requires there to be little or no multicollinearity among the independent variables.

Fourth, logistic regression assumes linearity of independent variables and log odds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the issues with Gradient descent computed on the entire batch instead of mini batch ( stocastic gradient descent?)

A

As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. Batch gradient descent also doesn’t allow us to update our model online, i.e. with new examples on-the-fly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are some drawback of batch norm?

A

However below are the few cons of Batch Normalization.

BN calculates the batch statistics(Mini-batch mean and variance) in every training iteration, therefore it requires larger batch sizes while training so that it can effectively approximate the population mean and variance from the mini-batch. This makes BN harder to train networks for application such as object detection, semantic segmentation, etc because they generally work with high input resolution(often as big as 1024x 2048) and training with larger batch sizes is not computationally feasible.

BN does not work well with RNNs. The problem is RNNs have a recurrent connection to previous timestamps and would require a separate β and γ for each timestep in the BN layer which instead adds additional complexity and makes it harder to use BN with RNNs.

Different training and test calculation: During test(or inference) time, the BN layer doesn’t calculate the mean and variance from the test data mini-batch(steps 1 and 2 from the algorithm table above) but uses the fixed mean and variance calculated from the training data. This requires cautious while using BN and introduces additional complexity. In pytorch model.eval() makes sure to set the model in evaluation model and hence the BN layer leverages this to use fixed mean and variance from pre-calculated from training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe yje way to detect anomalies in a given dataset.

A

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. These can be rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

There are 2 types:

outlier detection

The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.

novelty detection

The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.

The simplest approach is to use simple statistical techniques and flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. For example, marking an anomaly when a data point deviates by a certain standard deviation from the mean.

However, in high dimensions, the statistical approach could be difficult, therefore, machine learning techniques could be used. Following are the popular methods used to detect anomalies:

Isolation Forest

One Class SVM

PCA-based Anomaly detection

FAST-MCD

Local Outlier Factor

(Explaining one of the above-mentioned methods)

Isolation Forests build a Random Forest in which each Decision Tree is grown randomly. At each node, it picks a feature randomly, then it picks a random threshold value (between the min and max value) to split the dataset in two. The dataset gradually gets chopped into pieces this way, until all instances end up isolated from the other instances.Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why is gradient checking important?

A

Gradient Checking is a method to check out the derivatives in Back-propagation algorithms. Implementation of back-propagation algorithm is usually prone to bugs and errors. Therefore, it’s necessary before running the neural network on training data to check if our implementation of back-propagation is correct. Gradient checking is a way to do that. It compares the back-propagation gradients, which are obtained analytically with loss function, with numerically obtained gradient for each parameter. Therefore, it ensures that the implementation is correct and would hence, significantly increase our confidence in the correctness of our code.

By numerically checking the derivatives computed, gradient checking eliminates most of the problems that may occur as the back-propagation algorithm may have many subtle bugs. It could look like it’s working, and our cost function may end up decreasing on every iteration of gradient descent, but this may result in a neural network that has a higher level of error that could go unnoticed and give us worse performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What cross-validation technique would you use on a time series dataset?

A

Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn’t hold in earlier years!

You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.

fold 1 : training [1], test [2]

fold 2 : training [1 2], test [3]

fold 3 : training [1 2 3], test [4]

fold 4 : training [1 2 3 4], test [5]

fold 5 : training [1 2 3 4 5], test [6]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How is KNN different from k-means clustering?

A

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How can you evaluate multilabel classification

A

You turn this into multiple binary classification, is this examples one of this label? yes or no because the assignment to label should be independent.

There you can use a binary cross entriopy loss

https://machinelearningmastery.com/multi-label-classification-with-deep-learning/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a Kalman filters and when is that used?

A

Kalman Filters are a powerful tool used to evaluate the hidden state of a system, when we only have access to measurements of the system containing inaccuracies or errors. It bases its estimation on the past prior state, and the current measurements. For example, it can be used to estimate the position of a car based on its GPS signal. The position of the car at time t is a combination of its prior estimates of position and speed at t-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why transformers uses layer norm instead of batch norm?

A

A less known issue of Batch Norm is that how hard it is to parallellize batch-normalized models. Since there is dependence between elements, there is additional need for synchronization across devices. While this is not an issue for most vision models, which tends to be used on a small set of devices, Transformers really suffer from this problem, as they rely on large-scale setups to counter their quadratic complexity. In this regard, layer norm provides some degree of normalization while incurring no batch-wise dependence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is a lift analysis?

A

Lift analysis is used for classification tasks: you classify your data in deciles from 0-0.1 prob 0.1-0.2 and so on according to say to the probability users will cancel a subscription. Then you check how many of those did in each bin. If you have a good model your high bins will do it a lot compared to average and the opposite for low bins. The ratio of the value you have in high bins compare to average is your lift analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Can you explain what MapReduce is and how it works?

A

MapReduce is a data processing job that enables distributed computations to handle a huge amount of data.

It is used to split and process terabytes of data in parallel, achieving quicker results. This way it makes it easy to scale data processing over multiple computing nodes.

The processing happens using the map and reduce function.

Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (like count the lenght of a string or the number of occcurances of word in text)

Whereas, reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples (for example getting the max of a tuple of lenght of strings).

The most famous implementation is Apache Hadoop

See https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the importance/role of the pooling layer?

A

It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. n addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling.

You can probably discard pooling and you stride and padding to reduce the diamension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How is XGBoost handling bias-variance tradeoff?

A

XGBoost is a Gradient boosting of decision trees.

osting is a greedy algorithm and can overfit a training dataset quickly.

The general idea is that each individual tree will over fit some parts of the data, but therefor will under fit other parts of the data. But in boosting, you don’t use the individual trees, but rather “average” them all together, so for a particular data point (or group of points) the trees that over fit that point (those points) will be average with the under fitting trees and the combined average should neither over or under fit, but should be about right.

In particular in XGBoost

There are in general two ways that you can control overfitting in XGBoost:

The first way is to directly control model complexity.

This includes max_depth, min_child_weight and gamma.

The second way is to add randomness to make training robust to noise.

This includes subsample and colsample_bytree.

You can also reduce stepsize eta. Remember to increase num_round when you do so.

Below are some constraints that can be imposed on the construction of decision trees:

Number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.

Tree depth, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.

Number of nodes or number of leaves, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.

Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Formulate Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) techniques.

A

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1

In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

K- mean and Gaussian mixture model: what is the difference between K-mean and Mixture of Gaussian ?

A

Mixture of Gaussian is generative ad statistic based while

K-mean is very effective and accurate but not based on statistics

They are both affected by the initial state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why would you do dimensionality reduction?

A

(1) Reduce the storage space needed
(2) Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions (
3) Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed) (4) Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights
(5) Too many features or too complex a model can lead to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the difference between type I vs type II error?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the difference between sigmoid/logistic and sofrmax?

A

Softmax FunctionSigmoid Function

1Used for multi-classification in logistic regression model.Used for binary classification in logistic regression model.

2 The probabilities sum will be 1 The probabilities sum need not be 1.

3 Used in the different layers of neural networks. Used as activation function while building neural networks.

4 The high value will have the higher probability than other values. The high value will have the high probability but not the higher probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the main parameters for ensemble method like a random forest?

A

main parameters to adjust when using these methods is n_estimators and max_features.

For n_estimThe larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees.

for max_feature The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

If you are having 4GB RAM in your machine and you want to train your model on 10GB dataset. How would you go about this problem. Have you ever faced this kind of problem in your machine learning/data science experience so far ?

A

First of all you have to ask which ML model you want to train.

For Neural networks: Batch size with Numpy array will work.

Steps:

Load the whole data in Numpy array. Numpy array has property to create mapping of complete dataset, it doesn’t load complete dataset in memory.

You can pass index to Numpy array to get required data.

Use this data to pass to Neural network.

Have small batch size.

For SVM: Partial fit will work

Steps:

Divide one big dataset in small size datasets.

Use partialfit method of SVM, it requires subset of complete dataset.

Repeat step 2 for other subsets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is the O(n^2) bottleneck in a Transformer and how can we do better?

A

https://chengh.medium.com/evolution-of-fast-and-efficient-transformers-ec0378257994

Transformers scale bad with the length of the sequence, and better with the size of the embedding. Indeed they have an n^2 cost in multiplying key and vector in the attention mechanism

to solve this you can

Segment level recurrence:

Transformer-XL

Sparse attention

Approximation

Inference Acceleration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

how can you fix the computational problem with word2vec?

A

Using hierarchical softmax or negative sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Cons of SVM?

A

Memory-intensive, hard to interpret, and kind of annoying to run and tune

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What methods do you know about outlier detection.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Gaussian discrimination

A

Similar to mixture of gaussians in my opinion

GDA, is a method for data classification commonly used when data can be approximated with a Normal distribution. As first step, you will need a training set, i.e. a bunch of data yet classified. These data are used to train your classifier, and obtain a discriminant function that will tell you to which class a data has higher probability to belong.

When you have your training set you need to compute the mean 𝜇μ and the standard deviation 𝜎2σ2. These two variables, as you know, allow you to describe a Normal distribution.

Once you have computed the Normal distribution for each class, to classify a data you will need to compute, for each one, the probability that that data belongs to it. The class with the highest probability will be chosen as the affinity class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What are the different kernels functions in SVM ?

A

There are four types of kernels in SVM.

Linear Kernel

Polynomial kernel

Radial basis kernel

Sigmoid kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Tell us more about bagging and boosting!

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Use an unfair coin for fair tosses

A

Toss the coin twice.

If the result is HT, assign X=0

. If the result is TH, assign X=1

.

If the result is either HH or TT, then discard the two coin tosses and go to step 1.

The probability of making an HH or TT for two tosses is

P(HH)+P(TT)=p2+q2.(1)

Therefore, the probability of finally getting HT (and thus setting X=0

) is1

P(HT)+(P(HH)+P(TT))P(HT)+(P(HH)+P(TT))2P(HT)+…=pq1−p2−q2=12.(2)

Similar, the probability of X=1

is 0.5 and hence we get a fail result from a biased coin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What should you be careful to do if you do cross-validation for a classification problem?

A

Stratify your cross-validation samples!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Pros and cons of DBSCAN

A

Strengths: DBSCAN does not assume globular clusters, and its performance is scalable. In addition, it doesn’t require every point to be assigned to a cluster, reducing the noise of the clusters (this may be a weakness, depending on your use case). Weaknesses: The user must tune the hyperparameters ‘epsilon’ and ‘min_samples,’ which define the density of clusters. DBSCAN is quite sensitive to these hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the difference between precision and recall

A

Precision: True positive/ number of predicted positive. True pos/ (true pos+ false pos) Of all the one we have predicted as positive how many really are. Recall: Of all the test set case that was how many we predicted as true? True pos/(true pos +false neg)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is Adjusted rand index?

A

in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings.

it can be use the Rand index to assess a clustering approach

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

SVM vs logistic regression?

A

They are closely linked. For example you can get SVM by asking the logistic regression to maximize the decision (not to get the probability as close as possible) LR are better to get probability more than decision.SVM tachnically do not give you prob at all. SVM are more scalable and can get more complex separations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is a drawback of Masked language modelling using in autoencoder language model like BERT?

A

the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at f inetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language [9].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

If the labels are known in the clustering project, how to evaluate the performance of the model?

A

this basically become a classification problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is a particular feature of a learner that makes it very suitable for bagging?

A

You want an unstable learner like a tree that gives very different answers given slightly different input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is XLNet what is its difference with BERT

A

XLNet is a BERT-like model instead of a entirely different one. But it is an auspicious and potential one. In one word, XLNet is a generalized autoregressive pretraining method.

BERT predicts all the masked word simultaneously XLNet does it sequentially and not necessarily left to right

BERT is a bidirectional autoencoder

XLNet is “generalized” autoregressive(AR) it uses permutation language modelling https://arxiv.org/pdf/1906.08237.pdf

Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.

Also there is not data corruption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is Heteroscedasticity and what are the effect on linear regression ?

A

linear regression model presents heteroscedasticity when the variance of the perturbations is not constant throughout the observations. This implies the breach of one of the basic hypothesis on which the linear regression model is based.

Recall that one of the basic assumptions of linear regression is “That errors have constant variance.” From it is derived that the data with which one works are heterogeneous since they come from probability distributions with a different variance.

There are two major consequences of heteroscedasticity. One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid.

The other is that OLS is an inefficient estimation techniqu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are some methods for calibrating deep classifiers?

A

https://scikit-learn.org/stable/modules/calibration.html

Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions. The x axis represents the average predicted probability in each bin.

Calibrating a classifier consists of fitting a regressor (called a calibrator) that maps the output of the classifier (as given by decision_function or predict_proba) to a calibrated probability in [0, 1]. Denoting the output of the classifier for a given sample by fi, the calibrator tries to predict p(yi=1|fi).

The samples that are used to fit the calibrator should not be the same samples used to fit the classifier, as this would introduce bias. This is because performance of the classifier on its training data would be better than for novel data. Using the classifier output of training data to fit the calibrator would thus result in a biased calibrator that maps to probabilities closer to 0 and 1 than it should.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Explain AdaBoost

A

AdaBoost ifit a sequence of weak learners (i.e., models that are only slightly better than random guessing,) on repeatedly modified versions of the data.

The predictions are combined through a weighted majority vote (or sum) to produce the final prediction.

The data modifications at each so-called boosting iteration consist of applying weights to each of the training samples. Initially, those weights are all set to 1/n, so that the first step simply trains a weak learner on the original data.

For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data.

At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly.

As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What are the main steps to create a stacked model?

A

So, assume that we want to fit a stacking ensemble composed of L weak learners. Then we have to follow the steps thereafter:

split the training data in two folds

choose L weak learners and fit them to data of the first fold

for each of the L weak learners, make predictions for observations in the second fold

fit the meta-model on the second fold, using predictions made by the weak learners as inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

How does a neural network with one layer and one input and output compare to a logistic regression?

A

Neural networks and logistic regression are both used for classification problems. Logistic regression can be defined as the simplest form of Neural Network that results in straightforward decision boundaries whereas neural networks are a superset that includes additional complex decision boundaries to cater to more complex and large data. Logistic regression models cannot capture complex non-linear relationships w.r.t features. Meanwhile, a neural network with non-linear activation functions enables one to capture highly complex features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is an autoencode language modesl

A

Based on Masked language modelling

In comparison, AE based pretraining does not perform explicit density estimation but instead aims to reconstruct the original data from corrupted input. A notable example is BERT [10], which has been the state-of-the-art pretraining approach. Given the input token sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What ML algorithm Handles lots of irrelevant features well (separates signal from noise)?

A

DO: Naive Bayes Random forest (if not creazy noise) adaboost and neural networks DO not: KNN linear regression logistic (unless you regularize with LAsso) decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is the universal approximation theorem? Do we really need DEEP neural network?

A

According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Could you explain how to define the number of clusters in a clustering algorithm?

A

he primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another.

Generally, Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve.

The Elbow Curve graph contains a point that represents the point post in which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means.

Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is reinforcement learning ?

add more

A

Reinforcement learning

Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximize the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward.Reinforcement learning is inspired by the learning of human beings, it is based on the reward/panelity mechanism.

Applications:

RL is quite widely used in building AI for playing computer games. AlphaGo Zero

In robotics and industrial automation,RL is used to enable the robot to create an efficient adaptive control system for itself which learns from its own experience and behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are some of the differences between GRU and LSTM?

A
  • A GRU has two gates, an LSTM has three gates.
  • GRUs don’t possess and internal memory () that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs.
  • The input and forget gates are coupled by an update gate and the reset gate is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both and .
  • We don’t apply a second nonlinearity when computing the output.

More info https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What is the key part of a recurrent neural network?

A

is, a neural network that has a memory that influences future predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word “Hello”.

ecurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is the problem of having highly correlated features in log regression?

A

well basically you can not distinguish between for example putting the 2 coefficients to zero or one very close to minus the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What is the basic idea behind logistic regression?

A

It is a linear regressor assuming a linear relationship between features and target and no noise. The output is then passed to a sigmoid function turning a linear regression into a logistic one. This makes it non linear so you should not use a least square loss function. Use an entropy logarithmic one!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Why is naive bayes “naive”?

A

Because it assume variables are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is an autoregressive model?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Pros and Cons of char based vs word based embeddings

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

How is batch normalization done on testing?

A

You use the training statistics, if they are not the same distribution you have bigger problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What are the main loss functions for regression tasks?

A
  1. Mean Square Error, Quadratic loss, L2 Loss
  2. Mean Absolute Error, L1 Loss

using the squared error is easier to solve, but using the absolute error is more robust to outliers.

big problem in using MAE loss (for neural nets especially) is that its gradient is the same throughout, which means the gradient will be large even for small loss values. This isn’t good for learning

If we only had to give one prediction for all the observations that try to minimize MSE, then that prediction should be the mean of all target values. But if we try to minimize MAE, that prediction would be the median of all observations.

  1. Huber Loss, Smooth Mean Absolute Error, Log-Cosh Loss,

These try to get the best things of both. They are almost abs(x-x) at high x but they are smooth and almost quadratic (x-x)^2 near zero.

Huber loss has hyperparameter to train

Advanced quantile loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

What are the parameters of a CNN 2D layer?

A

filters: Integer, the dimensionality of the output space (i.e. the number of output filters in the convolution).

kernel_size: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window.

strides: An integer or tuple/list of a single integer, specifying the stride length of the convolution. Specifying any stride value != 1 is incompatible with specifying any dilation_rate value != 1.

padding: One of “valid”, “causal” or “same” (case-insensitive). “valid” means “no padding”. “same” results in padding the input such that the output has the same length as the original input. “causal” results in causal (dilated) convolutions, e.g. output[t] does not depend on input[t+1:]. Useful when modeling temporal data where the model should not violate the temporal order. See WaveNet: A Generative Model for Raw Audio, section 2.1.

dilated convolution

activation: Activation function to use (see activations). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).

use_bias: Boolean, whether the layer uses a bias vector.

kernel_initializer: Initializer for the kernel weights matrix (see initializers).

bias_initializer: Initializer for the bias vector (see initializers).

kernel_regularizer: Regularizer function applied to the kernel weights matrix (see regularizer).

bias_regularizer: Regularizer function applied to the bias vector (see regularizer).

activity_regularizer: Regularizer function applied to the output of the layer (its “activation”). (see regularizer).

kernel_constraint: Constraint function applied to the kernel matrix (see constraints).

bias_constraint: Constraint function applied to the bias vector (see constraints).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

When using the Gaussian mixture model, how do you know it’s applicable?

A

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. In this approach we describe each cluster by its centroid (mean), covariance, and the size of the cluster (weight). Therefore, based on this definition, a GMM will be applicable when we know that the data points are mixtures of a gaussian distribution and form clusters with different mean and standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

How is a decision tree pruned?

A

Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What is the assumption of error in linear regression?

A

In requires variable to be Normally distributed (you can use box-cox transformation to do that)

The fourth assumption is that the error(residuals) follow a normal distribution.However, a less widely known fact is that, as sample sizes increase, the normality assumption for the residuals is not needed

and importantly

Homoscedasticity (That errors have constant variance)

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables.

There are two major consequences of heteroscedasticity. One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Are decision tree parametric?

A

No just a cascade of if then else decision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

What is the AdaGrad optimization algorithm?

A

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter learning rate,

Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.[21] It still has a base learning rate η, but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix

{\displaystyle G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}}}

where {\displaystyle g_{\tau }=\nabla Q_{i}(w)}, the gradient, at iteration τ. The diagonal is given by

{\displaystyle G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}}.

This vector is updated after every iteration. The formula for an update is now

{\displaystyle w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\circ g}[a]

or, written as per-parameter updates,

{\displaystyle w_{j}:=w_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j}.}

Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi. Since the denominator in this factor, {\displaystyle {\sqrt {G_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}} is the ℓ2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.[19]

While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.[23]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

What are the main differences between baggin and boosting?

A

Bagging:

  • parallel ensemble: each model is built independently
  • aim to decrease variance, not bias
  • suitable for high variance low bias models (complex models)
  • an example of a bagging method is random forest, which develop fully grown trees (note that RF modifies the grown procedure to reduce the correlation between trees)

Boosting:

  • sequential ensemble: try to add new models that do well where previous models lack
  • aim to decrease bias, not variance
  • suitable for low variance high bias models
  • an example of a tree based method is gradient boosting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

How should you deal with unbalanced classes?

A
  • collect more data - use the appropriate evauation metric - first you can do sampling to rebalance your classed either you sample the dominant class or you repeat the less dominant one (oversampling makes it easy to overfit). You can do a syntetic minority sampling you create new fake small sample data to avoid overfitting - second you can usea smart algorithm that deals weel with this (or use weights). Xgboost for example resample its bags so that it has balanced classes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

What is a Box Cox Transformation?

A

Dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.

A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques, if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests. The Box Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the technique.

87
Q

What are some clustering algorithms?

A
  • K-means is the simplest: Just the K closest point. You can also do this for regression.
  • mixture Gaussians
  • hierarchical clustering Either by splitting or grouping
  • affinity propagation
  • DBScan based on dthe ensity of points
88
Q

What ML algorithm returns calibrated probability?

A

KNN and logistic regression. Definitely not naive bayse Possibly random forest and neural network.

89
Q

What are 2 typical issues when training a RNN?

A

Exploding Gradients

this problem can be easily solved if you truncate or squash the gradients. Also known as clipping (it is also easier to diagnose, you see nan appering)

Vanishing Gradients (happen also for CNN just RNN Are very deep)

Related to not being able to learn from distant words. Their gradient goes to zero and drive all gradients to zero.

was a major problem in the 1990s and much harder to solve than the exploding gradients. Fortunately, it was solved through the concept of LSTM

90
Q

Disdvantage of decision trees?

A

can create over-complex trees –> overfitting.

pruning, a minimum number of samples required at a leaf node or a the maximum depth of the tree are necessary or use ensamble

  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. again use ensemble.

The problem of learning an optimal decision tree is known to be NP-complete algorithms cannot guarantee to return the globally optimal decision tree.

There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

91
Q

What is Ensemble Learning?

A

The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are:

Bagging

Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions.

Boosting

Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last classification. Boosting decreases the bias error and helps you to build strong predictive models.

92
Q

How do you generate a synthetic sample for unbalanced classes?

A

A simple is to randomly sample the attributes from instances in the minority class.

you could use a method like Naive Bayes that can sample each attribute independently when running in reverse. You will have more and different data, but the non-linear relationships between the attributes may not be preserved.

There are systematic algorithms that you can use to generate synthetic samples like SMOTE or the Synthetic Minority Over-sampling Technique.

93
Q

What are some extensions/variants of dropout?

A

https://towardsdatascience.com/12-main-dropout-methods-mathematical-and-visual-explanation-58cdc2112293

94
Q

What is the formula of the error that explains the bias-variance problem and trade-off?

A

E( y - f(x)) + Var(F(x))

Var( f(x) ) = E(f(x)-E(f(x))^2)

95
Q

What distributions give you the probability of having h head after n toss/

A

the binomial distribution

(n )r^h (1-r)^(n-h)

(H)

96
Q

Why should we use Batch Normalization?

A
97
Q

What algorithms automatically learn feature interactions?

A

DO : NN, Random forest, decision trees adaboost DONOT: linear logic regressions KNN, naive bayes

98
Q

Preprocess data steps:

A
  • Normalize your input. Some ML algorithm actually needs this. NN also train faster if you do that.
  • In NN start your weight properly so that you can not make your gradient explode. you usually use mean zero and std = sqrt(2/ (input units))
99
Q

Drawback of weight normalization?

A

Weight Normalization speeds up the training similar to batch normalization and unlike BN, it is applicable to RNNs as well. But the training of deep networks with Weight Normalization is significantly less stable compared to Batch Normalization and hence it is not widely used in practice.

https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6

100
Q

In NN what kind of regularization technique you have?

A
  • Drop out or L2 regularization on the W of your neuron - you need more data if you can not do data augmentation. - you can do early stopping, even if it is a little bit risky. You stop before ||w|| gets too big. (also makes bias and variance problem not orthogonal anymore)
101
Q

What model does Word 2 vec uses?

A

Skip-o-gram and CBOW continuos bag of words.

102
Q

Tell us more about batch normalization?

A

Batch normalization (also known as batch norm) is a method used to make artificial neural networks faster and more stable through normalization of the layers’ inputs by re-centering and re-scaling.

We do this on the batch since we use SGD

We normalize the input layer by adjusting and scaling the activations. This like it happens from the first layer speed up learning!!

We can use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really high or really low

t reduces overfitting because it has a slight regularization effect. Similar to drop out, it adds some noise to each hidden layer’s activations.

103
Q

What’s the “kernel trick” and how is it useful?

A

The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space.

104
Q

What are some application of sequence models in ML?

A

Speach organization

Sound–>word

Music generation

integer–>music

Sentiment classification

Words –> stars of review etc

DNA sequencing, translation, video activity recognization etc., named entity recognition (find a name in text)

105
Q

What are some common ways of initializing the weights of a neural network?

A

If you initialize them to zero the network became linear just a linear function. So do not do that. Also they are usually initialized at random but we carefull for them not to be too high or too small

Some new initialization

basically they are random but scaled with the square root of the size of the layer

106
Q

What is layer normalization?

A

Layer Normalization(LN)

Inspired by the results of Batch Normalization, Geoffrey Hinton et al. proposedLayer Normalization which normalizes the activations along the feature direction instead of mini-batch direction. This overcomes the cons of BN by removing the dependency on batches and makes it easier to apply for RNNs as well.

In essence, Layer Normalization normalizes each feature of the activations to zero mean and unit variance.

107
Q

What is one-hot encoding?

A

Encode categorical integer features using a one-hot aka one-of-K scheme.

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. It is assumed that input features take on values in the range [0, n_values).

108
Q

What is the difference between label encoding and one hot encoding?

A

one hot encoding increasing the dimensionality of a data set.

say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

109
Q

What algorithm perform well on small observation sizes?

A

Logistic and linear regression, naive bias. Do not: KNN decision trees NN adabost etc

110
Q

How do you combine precision and recall in a single metric?

A

Basically, you want them to both be not extreme. F score is 2*PR/(P+R) It is an average but down weight one is very small.

111
Q

Why is not a great idea to use least square for classification?

A

A lot of problem for outliers also it does not get to zero to infinity. You do not to penalize a too high value, with a threshold those are going to be set to one.

112
Q

How do you deal with multi collinearity? What are sign of its presence and how do you solve it?

A

Sign of presence:

A regression coefficient is not significant even though, theoretically, that variable should be highly correlated with Y.

When you add or delete an X variable, the regression coefficients change dramatically.

You see a negative regression coefficient when your response should increase along with X and viceversa

Your X variables have high pairwise correlations.

Dealing with:

Remove highly correlated predictors from the model. correlation matrix

Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.

113
Q

What are the typical steps to do speech recognition?

A

You start witht the sound (if alexa is already alerted). The amplitude as a function of time 16khz (16,000 samples per second) is enough to cover the frequency range of human speech.

You group our sampled audio into 20-millisecond-long chunks.

You Fourier transform those and pass the sonogram for each 20 ms into a recurrent neural network.

This will give you the probable letter or spaces for each of those intervals.

HHHHEE____LL__LLLL0000__

Remove repeating characters and spaces

but at low prob, you also have AULLO HULLO. You then compare this with natural language text databases!

114
Q

Suppose you toss a fair coin 400 times. What is the

probability that you get at least 220 heads? Round your answer to the nearest percent.

A

The trick is to view each toss as a random variable that returns 1 if a head is tossed and 0 if a tail is tossed. Then each such random variable has expected value 1/2 and variance 1/4. So your Z-variable (for using the central limit theorem) will be:

(220-200)/(sqrt(400*(1/4))) = 20/10 = 2

So we’ve reduced the question to asking what’s the probability that Z takes a value bigger than 2. Recall that on the standard normal, the probability that z takes values between -2 and 2 is about 95%, so the probability that it takes values less than 2 is about 97.5% (it’s actually more like 97.7% but just estimating). So the probability that we are bigger than 2 is a little less than 2.5%, which after rounding to the nearest percent gives us 2%

115
Q

What is a naive bayes method?

A

Simply you assume all the feature are indipendent then P(y|x) = prod(P(y|x1)*P(y|x2)… You then basically maximize that likelihood. Different classifier assume different P(y|x).

116
Q

What algorithms have a lot of problems with outliers?

A

All the boosting algorithm for example. They might put extra weight on outliers and try to fit them at all cost instead of discarding them

117
Q

Describe popular classes of methods for zero-shot/few-shot/meta learning.

A
118
Q

How should we recommend items to (new) users?

basics of recommendation systems

A

There are 2 big families of reccomender systems:

: collaborative and content based methods

Collaborative methods are based solely on the past interactions recorded between users and items.

These interactions are stored in “user-item interactions matrix”.

collaborative filtering algorithms is divided into two sub-categories: memory based and model based approaches:

memory –> nearest neighbours search

model –> generative model.

The more interaction the more they get better but they can not be used at the very beginning.( you can suggest random or popular items at the beginning)

hree classical collaborative filtering approaches: two memory based methods (user-user and item-item) and one model based approach (matrix factorisation).

content based approaches use additional information about users and/or items. Using feature of users (sex age) to predict movie (with its own features like actor genre)

the recommendation problem is casted into either a classification problem (predict if a user “likes” or not an item) or into a regression problem (predict the rating given by a user to an item)

119
Q

What is ELMo?

A

It is a deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts.

Note this model is character based via ( CNN) It was one of the first to introduce context compared to GloVe or Word2Vec

These word vectors are learned functions of internal states of a deep biLM(bidirectional language model), which is pre trained on large text corpus.

The task was language modelling

see https://jalammar.github.io/illustrated-bert/

120
Q

what are some useful metrics for classification problems apart from accuracy? why can accuracy be a problem?

A

Accuracy is a problem for unbalanced classes. Other metrics are:

  • Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.
  • Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
  • ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.
121
Q

What ML algorithm need feature rescaling?

A

KNN linear and logistic if regularized, and neural networks Decision trees and random forest are ok (and robust again outliers)

122
Q

How many topic modeling techniques do you know of? Explain them briefly.

A

Latent Semantic Analysis (LSA):

A Latent Semantic Analysis tries to use the context around the words to find hidden concepts. It does that by generating a document-term matrix, where each cell has TD-IDF score which assigns a weight for every term in the document. Using a technique known as Singular Value Decomposition (SVD), the dimensions of the matrix are reduced to the number of desired topics. The resultant matrices, after decomposition, gives us vectors for every document and term in our data that can then be used to find similar words and similar documents using the cosine similarity method.

Probabilistic Latent Semantic Analysis(PLSA):

Probabilistic Latent Semantic Analysis is a technique used to model information under a probabilistic framework instead of SVD. It creates a model P(D,W) such that for any document d and word w, P(d,w) corresponds to that entry in the document-term matrix.

Latent Dirichlet Allocation (LDA):

Latent Dirichlet Allocation is a technique that automatically discovers topics that documents contain. LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that each document mix with various topics and every topic mix with various words. Assuming this, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection. It maps all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics.

123
Q

Is it necessary to use activation functions in neural networks?

A

Activation functions are essential to learn and model complex data and its relationships. These functions add non-linearity to the network. If there is no activation function then the input signal will be mapped to an output using a linear function which is just a polynomial of one degree. Now why is that a problem? Linear functions are not able to capture complex functional mappings of the data. However, this is possible with the use of nonlinear functions which have a degree of more than one. These activation functions can be used to model any real world data.

124
Q

Deep Learning’s Most Important Ideas - A Brief Historical Review

A

https://dennybritz.com/blog/deep-learning-most-important-ideas/

125
Q
A
126
Q

Hidden Markov Model. Briefly what they are and some of their applications.

A

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.

Hidden Markov models are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

The idea is to predict a probability of a sequence of observations (this can be related to a hidden state you can not observe, and you can solve this with the transition matrices)

127
Q

If X and Y are independent random variables, can you write down the formula for Var(XY)?

A
128
Q

How would you evaluate your ML parameters?

A

cross validation

129
Q

Reduce variance or overfitting?

A
  • Get more data if possible - Use regularization, dropout, and similar techniques - reducing features with some kind of feature selection. -

Change algorithm, some are less prone to that.

Label Smoothing is a regularization technique it adds noise to labels

It works with multiclass classification and sigmoid in particular.

130
Q

list activation function and describe them?

A

softmax:

The softmax function is a more generalized logistic activation function which is used for multiclass classification. it is e^zj/sum_all_classes(e^zi)

[1.2 , 0.9 , 0.75], When we apply the softmax function we would get [0.42 , 0.31, 0.27]. So now we can use these as probabilities for the value to be in each class

relu:

the most used right now: zero from -inf to 0 (in not leaky) and linear after that. Note this goes to infinity for infinity.

The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time

tanh

the tanh. This is between -1,1 tanh is also like logistic sigmoid but better The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The tanh function is mainly used classification between two classes. It basically solves our problem of the values all being of the same sign. All other properties are the same as that of the sigmoid function

sigmoid:

or logistic simply 1/(1+e-z). It exist between 0-1. It is differentiable but it can get you stuck during training since grad is zero at borders.

linear:

this is simply the identity

131
Q

What are different normalization layers?

A

https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6

Batch Normalization:

you need batches large enough to compute the statistics.

It does not work well with RNN, because of the residual connection you need a different normalization for each time step

Weight Normalization

Layer Normalization

Group Normalization

Weight Standarization

132
Q

What is an example of a data set with a non-Gaussian distribution?

A

Some important distributions are:

Bernoulli/Poisson/gamma/beta,

many of them are categorical data, and those with only two categories are Bernoulli. Some are multivariate numerical and presumably approximately normal in each coordinate but different coordinates are not independent.

For large 𝑛n, many distributions are approximately normal such as the gamma distribution and the beta distribution.

I’ll give you a few variables you might consider measuring:

Waiting time. (Usually). For instance if you are queuing to check in.

Number of faults (in some unit of measurement). For instance the number of typos in 1 sheet of paper (A4).

Same type of distribution (probably): The number of accidents at a specific crossroads between the hours of 06.00 am to 09.00 am.

The number of times you throw a 6, when you throw a limited number of times.

The number of children having a particular (rare) disease, given a distribution of genes coding for this within the parents.

133
Q

Activation functions: What is the improvement of Relu over tanh? What are some of the problem ?

Which activation function improve on that?

A

Derivatives are now big (==1) so the gradient does not vanish.

However it can be pushed to zero for negative values (can results in dead neuron).

This is why leaky relu is introduced

134
Q

Simple Nyquist theorem?

A

Nyquist theorem, we know that we can use math to perfectly reconstruct the original sound wave from the spaced-out samples — as long as we sample at least twice as fast as the highest frequency we want to record.

135
Q

the assumption of linear regression!

A

Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:

Linear relationship: between the outcome variable and the independent variables.

Multivariate normality: Multiple regression assumes that the residuals are normally distributed.

No or little multicollinearity:independent variables are not highly correlated with each other.

No auto-correlation: no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price.

Homoscedasticity:This assumption states that the variance of error terms are similar across the values of the independent variables.

136
Q

EM algorithm

A

What makes finding the parameters of a gaussian mixture hard is the fact that we do not know which gaussian a point belongs to. This is a latent variable.

So it is hard to compute the maximum likelihood eestimation.

so if mu_k and sigma_k are the likelihood for each point to belong to a gaussian mixture we can

Initialization: Get an initial estimate for parameters 𝜃0θ0 (e.g. all the 𝜇𝑘,𝜎2𝑘μk,σk2 and 𝜋π variables). In many cases, this can just be a random initialization.

Expectation Step: Assume the parameters (𝜃𝑡−1θt−1) from the previous step are fixed, compute the expected values of the latent variables (or more often a function of the expected values of the latent variables).

Maximization Step: Given the values you computed in the last step (essentially known values for the latent variables), estimate new values for 𝜃𝑡θt that maximize a variant of the likelihood function.

Exit Condition: If likelihood of the observations have not changed much, exit; otherwise, go back to Step 1.

137
Q

Pros and cons of hierarchical

A

Strengths: The main advantage of hierarchical clustering is that the clusters are not assumed to be globular. In addition, it scales well to larger datasets. Weaknesses: Much like K-Means, the user must choose the number of clusters (i.e. the level of the hierarchy to “keep” after the algorithm completes).

138
Q

How is the binary loss defined?

A

also known as log-loss. It is defined as

-log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))

This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions.

For multiclass just generalize

log(P) = sum( yt_i log(y_p_i)

139
Q

What is the probability meaning of area under the curve?

A

When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).

140
Q

What is one draw back of stacking (in terms of data?)

A

data that have been used for the training of the weak learners are not relevant for the training of the meta-model. Thus, an obvious drawback of this split of our dataset in two parts is that we only have half of the data to train the base models and half of the data to train the meta-model. In order to overcome this limitation, we can however follow some kind of “k-fold cross-training” approach (similar to what is done in k-fold cross-validation)

141
Q

When using accuracy or MSE as a metric evalutation is bad?

A

When you have very unbalanced classes. You can have a 1% error but if in the data 0.5% have cancer and the other not, you can do better than that by simply always predicting no cancer.

142
Q

What is pruning in Decision Tree ?

A

When we remove sub-nodes of a decision node, this procsss is called pruning or opposite process of splitting.

143
Q

SVM, how does kernels change the data ?

A
144
Q

What is stratified k-fold or sampling?

A

Stratification keeps the subsample of the population similar to the total population.

You are basically asking the model to take the training and test set such that the class proportion is same as of the whole dataset, which is the right thing to do. If your classes are balanced then a shuffle (no stratification needed here) can basically guarantee a fair test and train split.

145
Q

Why is feature selection important?

A

How to select features and what are Benefits of performing feature selection before modeling your data?

· Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

· Improves Accuracy: Less misleading data means modeling accuracy improves.

· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

146
Q

Describe stochastic gradient descent

A
147
Q

What are the 3 main ensamble methods? Do they use eterogeneus or homogeneus learner?

A

bagging, that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process

boosting, that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy

stacking, that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions

148
Q

Naive Bayes cons

A

not very high accuracy

Relies on independence assumption.

Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance.

If a categorical variable has a category in test data set which was not observed in training data set, then the model will assign a zero probability. It will not be able to make a prediction. This is often known as “Zero Frequency”

149
Q

Pros and cons of K-means

A

Strengths: K-Means is hands-down the most popular clustering algorithm because it’s fast, simple, and surprisingly flexible if you pre-process your data and engineer useful features.

Weaknesses: The user must specify the number of clusters, which won’t always be easy to do. In addition, if the true underlying clusters in your data are not globular, then K-Means will produce poor clusters.

150
Q

What are some problems with training word2vec?

A

Well, it does not have context

If you see as a problem of multiclass classification

The soft max is computationally expensive

Also you have to pick the context to predict right with some kind of inverted importance sampling.

The solution is negative sampling. Making it a binary classifier (being the right word or not) with the right work and k random word

151
Q

What is and what is the best method for for image matching?

A

You will probabl yuse a convnet get features for the image and then measure some kind of similarity

Feature-based approach relies on the extraction of image features such, i.e. shapes, textures, colors, to match in the target image or frame. This approach is currently achieved by using Neural Networks and Deep Learning classifiers such as VGG,[6] AlexNet, ResNet. Deep Convolutional Neural Networks process the image by passing it through different hidden layers and at each layer produce a vector with classification information about the image. These vectors are extracted from the network and are used as the features of the image. Feature extraction by using Deep Neural Networks is extremely effective and thus is the standard in state of the art template matching algorithms.[7]

152
Q

What does it mean to have a well-calibrated neural network?

A
153
Q

What are important ways of preprocessing data?

A
  • Standardization, or mean removal and variance scaling
  • Scaling features to a range

If you have outliers you should be careful to use some robust algorithm.

Also centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.

  • Generate polinomial features
  • Encode categorical data
  • drop nan or Imputation of missing values (mean median etc)
154
Q

What is ULMFiT ?

A

It stands for Universal Language Model Fine-tuning, or ULMFiT, is an architecture and transfer learning method that can be applied to NLP tasks.

This model is word based

ULMFit is unidirectional not bidirectional

Transfer learning is really the big addiction of this model.

Discriminative Fine-Tuning

“As different layers capture different types of information, they should be fine-tuned to different extents.”¹ Thus, for each layer, a different learning rate is used.

155
Q

What is selection bias?

What are the different types?

A

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed

There are different types

Sampling bias is systematic error due to a non-random sample of a population,[2] causing some members of the population to be less likely to be included than others, resulting in a biased sample,

Time interval[edit]

Early termination of a trial at a time when its results support the desired conclusion

Data[edit]

Partitioning (dividing) data with knowledge of the contents of the partitions, and then analyzing them with tests designed for blindly chosen partitions.

Observer selection[edit]

Philosopher Nick Bostrom has argued that data are filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study.

156
Q

Describe a class imbalance problem of classification and how you would approach it

A
157
Q

Character level generation sequence model architecture?

A
158
Q

What is Entropy and Information gain in Decision tree algorithm ?

A

The core algorithm for building decision tree is called ID3. ID3 uses Enteropy and Information Gain to construct a decision tree.

Entropy

A decision tree is built top-down from a root node and involve partitioning of data into homogenious subsets. ID3 uses enteropy to check the homogeneity of a sample. If the sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one.

Information Gain

The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that returns the highest information gain.

159
Q

What are SVM bird eye view?

A

, SVM can be used to it provides the maximum separating margin for a linearly separable dataset (unless you use the kernel trick). That is, of all possible decision boundaries that could be chosen to separate the dataset for classification, it chooses the decision boundary which is the most distant from the points nearest to the said decision boundary from both classes.

160
Q

What is an AUC in a ROC curve? why is that so useful?

A

It allows you to visualize how well a classifier is doing for all the possible probability threshold we can use for the classifier. This can be used when your classifier return a propba. You can use them even if your probability are not well calibrated Once you maximize the AUC then you can choose your threshold according to business model

161
Q

Cons of neural network

A

not suitable as general-purpose algorithms because they require a very large amount of data. In fact, they are usually outperformed by tree ensembles for classical machine learning problems.

162
Q

Activation functions: What is the problem with sigmoid?

A

The derivatives are too small theuy can kill the gradient in back propagation

They are not zero centered so they can push gradient very far (no zig zag around zero)

It is never used these days?

163
Q

Which on is better GRU or LSTM?

A

there isn’t a clear winner. In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture. GRUs have fewer parameters (U and W are smaller) and thus may train a bit faster or need less data to generalize. On the other hand, if you have enough data, the greater expressive power of LSTMs may lead to better results.

164
Q

What are the advantages of decision tree?

A

Simple to understand and to interpret. Trees can be visualised.

Requires little data preparation.

Other techniques often require data normalization, dummy variables need to be created and blank values to be removed.

Able to handle both numerical and categorical data.

Other techniques are usually specialised in analysing datasets that have only one type of variable. Able to handle multi-output problems.

The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

165
Q

What are the main loss functions for classifications?

A

1) Log loss

sum(ylogy())

2)Hinge loss (with yp -1 or 1)

max(0,1-yt*yp)

3) e^(-beta yt*yp)

It penalizes incorrect predictions more than Hinge loss and has a larger gradient.

Logarithmic loss leads to better probability estimation at the cost of accuracy

Hinge loss leads to better accuracy and some sparsity at the cost of much less sensitivity regarding probabilities

166
Q

Random forest pseudo code?

A
  • Randomly select “k” features from total “m” features.
  • Where k << m
  • Among the “k” features, calculate the node “d” using the best split point.
  • Split the node into daughter nodes using the best split.
  • Repeat 1 to 3 steps until “l” number of nodes has been reached.
  • Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.
  • Takes the test features and use the rules of each randomly created decision tree to predict the oucome and stores the predicted outcome (target)
  • Calculate the votes for each predicted target.
  • Consider the high voted predicted target as the final prediction from the random forest algorithm.
167
Q

What activation function wouls you use in the final layer of a NN? for regression and classification?

A

Linear for regression

Softmax for classification

168
Q

What is a resnet?

A

The residual block in a resnet learn the residual functions. So it can learn the identity and basically allow deep layer network not to have vanishing gradients.

if y=F(x) hard to get F(x) =x it is easier to learn F(x) =0 and y =F(x)+x

When deeper networks starts converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated and then degrades rapidly

Instead of learning a direct mapping of x ->y with a function H(x) (A few stacked non-linear layers). Let us define the residual function using F(x) = H(x) — x, which can be reframed into H(x) = F(x)+x, where F(x) and x represents the stacked non-linear layers and the identity function(input=output) respectively.

169
Q

Testing a fair coin?

A

Some useful information for bernoulli std = p(1-p) so the error on the mean goes as sqrt(p(1-p)/n) then you can compare the average of an esimator for your p head/(head+tails) with 0.5 given the error (assuming normal)

170
Q

Difference and pros and cons of generative vs discriminative models?

A

Generative models allow you to make explicit claims about the process that underlies a dataset.

After fitting a generative model, you can also run them forward to generate synthetic data sets.

However, if the relationships expressed by your generative model only approximate the true underlying generative process that created your data, discriminative models will typically outperform in terms of classification error rates

a discriminative model is going to attempt to optimize the prediction of y from x, whereas a generative model will attempt to optimize the joint prediction of xand y. Because of this, discriminative models outperform generative models at conditional prediction tasks (logistic regression models tend to outperform naive Bayes models with the same number of parameters

There is one case in which you can’t use a discriminative model at all: if you don’t have labeled data.

171
Q

What is the main parameter for LSTM layers?

A

As for GSU you can choose the number of hidden neurons in the hidden state a.

It has the usual problem if too big overfitting and bias in the other direction.

172
Q

What are some feature selection algorithms?

A

Filter based: We specify some metric and based on that filter features. An example of such a metric could be correlation/chi-square variance threshold.

Wrapper-based: Wrapper methods consider the selection of a set of features as a search problem. Example: Recursive Feature Elimination

Embedded: Embedded methods use algorithms that have built-in feature selection methods. For instance, Lasso and RF have their own feature selection methods.

First note that some algo already do feature selection, regularization and random forest do that.

  • Variance threshold (unsupervised) you need to normalize. This remove feature with low variuance indipendently from the correlation with target.
  • correlation threshold to avoid redundant feature
  • Genetic algo: this is very complex but powerful. You can use this if you have a huge number of dimension.
173
Q

In what part of the NN would you use leaky relu?

A

Leahy relu can only be used in hidden layer. For the final one you want get a classification or regression otherwise. So in the final layer we use softmax or linear

174
Q

Activation functions: What is the problem with tanh? How does it improve on sigmoid?

A

Improve

it is now centered around zero

still derivatives are small they can not push them far apart but they can drive it to zero

175
Q

List dimensionality reduction algorithm?

A
  • PCA: simple basic algebra. They are all independent and ordered by the explained variance. (you need to normalize). PCA is a versatile technique that works well in practice. In addition, PCA offers several variations and extensions (i.e. kernel PCA, sparse PCA, etc.) to tackle specific roadblocks.

-LDA linear discriminant (supervised). Same as PCA but now you want to increase the separability between classes.

- auto encoders: Autoencoders are neural networks that are trained to reconstruct their original inputs. If you use the input as the target image these are unsupervised. Y

ou can do manifold learnign (Isomap MDS spectral embedding TSNE – used a lot— etc/)

176
Q

What are the 2 main differences between bagging boosting and stacking?

A

First stacking often considers heterogeneous weak learners (different learning algorithms are combined) whereas bagging and boosting consider mainly homogeneous weak learners.

Second, stacking learns to combine the base models using a meta-model whereas bagging and boosting combine weak learners following deterministic algorithms.

177
Q

How do you define the recall?

A

the recall is the True positive rate or sensitivity

TP/(TP+FN)

178
Q

How deep can you go when stacking RNN?

A

Not more than 3-4 layers. The temporal dimensions make it really computationally expensive.

179
Q

What is neural architecture search?

A
180
Q

What is an autoregressive model?

A

AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model [7, 27, 28]. Specifically, given a text sequence x = (x1,··· ,xT), AR language modeling factorizes the likelihood into a forward product p(x) = T t=1 p(xt | xt). A parametric model (e.g. a neural network) is trained to model each conditional distribution. Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pretraining.

181
Q

What is p-value?

A

In statistical hypothesis testing, the p-value or probability value or asymptotic significance is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary (such as the sample mean the difference between two compared groups) would be the same as or of greater magnitude than the actual observed results.[1]

182
Q

How can you use SVM on data that re not linearly separable?

A

kernel trick

183
Q

How should we compute the confidence interval of a predicted target in a linear regression model.

A
184
Q

What is importance sampling?

A
185
Q

Gaussian mixture model when to use it?

A

It is a clustering algorithm that differently from kmeans is based on a statistical distribution. So it can also be used to generate new data. Of course it assuems data are drwan from. a gaussian distribution

186
Q

How do you represent words for machine learning?

A

The basic idea is a bag of words. You have a dictionary of all the words and you represent a word as a vector with 1 in the right position (one hot encoding). This can be as large as 50K to 1 M words.

187
Q

Can you use decision trees to do feature importance?

A

YES

The relative rank (i.e. depth) of a feature used as a decision node in a tree assess the relative importance of that feature with respect to the predictability of the target variable.

Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples.

In a forest expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

188
Q

Describe different tokenization schemes used in NLP?

A

https://huggingface.co/docs/transformers/tokenizer_summary

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

BPE (GPT-X) relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. GPT-2, Roberta. More advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.

After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. Start by single characters and grow from there. Byte-level BPE is a trick not to use all unicode characters but the bites use to represent them

Word piece

very similar to BPE but it does not combined based on frequency but on the likelihood

WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.

Sentence piece

189
Q

What is a single node neural network equal too?

A

Logistic regression if you use the sigmoid function as the activation function. Note that logistic regression has a convex loss function but a few layer neural network does not

190
Q

What is Power Analysis?

A

The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true. The statistical power ranges from 0 to 1, and as statistical power increases, the probability of making a type II error (wrongly failing to reject the null hypothesis) decreases.

Think about p-values

191
Q

What is the advantage of non-parametric models?

A

You don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end).

192
Q

Gaussian mixture model what it is?

A

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

It uses the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models

The BIC criterion can be used to select the number of components in a Gaussian Mixture in an efficient way. In theory, it recovers the true number of components only in the asymptotic regime

The main difficulty in learning Gaussian mixture models from unlabeled data is that it is one usually doesn’t know which points came from which latent component (if one has access to this information it gets very easy to fit a separate Gaussian distribution to each set of points).

Expectation-maximization is a well-founded statistical algorithm to get around this problem by an iterative process. First one assumes random components (randomly centered on data points, learned from k-means, or even just normally distributed around the origin) and computes for each point a probability of being generated by each component of the model. Then, one tweaks the parameters to maximize the likelihood of the data given those assignments. Repeating this process is guaranteed to always converge to a local optimum.

193
Q

What are the most common ML optimizers?

A

SGD

Stochastic gradient descent optimizer. You can use momentum, learning rate decay etc. Choosing a proper learning rate can be difficult.

Additionally, the same learning rate applies to all parameter updates + local minima

RMSprop

Adagrad

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the updates.

194
Q

Should we present the same mini batches in the same order to the optimization algorithm?

A

It is very often better to shuffle to avoid biases. They should be shuffle at every learning epoch.

The only caveat is if you are trying to solve a simpler problem first and then trying to present harder cases to your optimizer.

195
Q

Naive Bayes pros

A
  • Computationally fast Simple to implement
  • Works well with high dimensions.
  • Can make probabilist prediciton It handles very well irrelevant feature. ()just return all the same for P(irrelavant feature | DATA)
  • lso it is a generative model. so it is easier to deal with missing values.
196
Q

What can be another way to look at imbalance classes problems?

A

Treat the problem as a anomaly detection problem

197
Q

How would you serch for parameters of a network?

A

A search consists of:

  • an estimator (regressor or classifier such as sklearn.svm.SVC());
  • a parameter space;
  • a method for searching or sampling candidates;
  • a cross-validation scheme; and
  • a score function.

You can search in defferent way:

  • simple grid search
  • randominzed search (monte carlo etc)
198
Q

Is logistic regression linear?

A

Almost, the final sigmaoid function introduce a little bit of non lineaity in it.

199
Q

What are the different types of RNN? many to ….

A
200
Q

Think about Hidden Markov Model. What are the training data? what are the hyperparameters?

A
201
Q

What are the disadvanteges of biredictional RNN?

A

You have to have a complete input i.e for example you need the speaker to be done before testing its sentence.f

202
Q

Time series forecast:

Given a time series predict how it continues for the next 7 month ?

A
  • look at the time series for pathological cases.
  • Naive: predict constant at the last value y_t+1 = y_t, or the average (of all the dataset)
  • Moving average: choose a time interval p to use. You can also do this weighted.
  • Exponential smoothing: y_t+1 = a y_t +a(1-a)y_t-1 + a(1-a)^2y_t-2 basically old example influence less the importance of the

All of the above do not work well with data with high variation.

Holt’s Linear Trend method: it is an advanced expenential smoothing that takes into account the trend. You fit a level and a trend, Then you combine them

Holt-Winters Method: This adds the seasonality it is one of the best methods.

ARIMA Autoregressive Integrated Moving average: While exponential smoothing models were based on a description of trend and seasonality in the data, ARIMA models aim to describe the correlations in the data with each other.

-

203
Q

What are LSTM?

A

Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks.

The units of an LSTM are used as building units for the layers of a RNN, which is then often called an LSTM network.

LSTM’s enable RNN’s to remember their inputs over a long period of time. This is because LSTM’s contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory.

204
Q

What are some classical time series algorithm?

A
  • Autoregression (AR) : The autoregression (AR) method models the next step in the sequence as a linear function of the observations at prior time steps. So x_t = c_t + sum(phi_t-i x_t-i) etc
  • Moving average and autoregressed moving average.

to be completed with https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/

205
Q

How the 3 main ensamble method impact variance and bias of thefinal model? What is their aim?

A

Very roughly, we can say that bagging will mainly focus at getting an ensemble model with less variance than its components whereas boosting and stacking will mainly try to produce strong models less biased than their components (even if variance can also be reduced).

206
Q

What is Label Smoothing?

A

Label Smoothing is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of log⁡p(y∣x) directly can be harmful. Assume for a small constant ϵ, the training set label y is correct with probability 1−ϵ and incorrect otherwise. Label Smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of ϵk−1 and 1−ϵ respectively.

https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06

207
Q

What is a mixture of experts?

A
208
Q

What is the lottery ticket hypothesis?

A
209
Q

How should you use cross validation while doing oversampling of your data (for imbalanced case problems?)?

A

Before like feature selectio. Keep in mind that over-sampling takes observed rare samples and applies bootstrapping to generate new random data based on a distribution function. If cross-validation is applied after over-sampling, basically what we are doing is overfitting our model to a specific artificial bootstrapping result. That is why cross-validation should always be done before over-sampling the data, just as how feature selection should be implemented. Only by resampling the data repeatedly, randomness can be introduced into the dataset to make sure that there won’t be an overfitting problem.

210
Q

Exaplain and think about a deep RNN architecture?

A
211
Q

What is gradient clipping, why it is used, what are its cons and the 2 types of gradient clipping that exist?

A

Gradient clipping was used to avoid exploding gradients. Now it is less used in favor of batch normalization and even more layer normalization.

Simply, you restrict gradients to be in a range. The problem is that if you do this you change the direction, cause you are not rescaling all of them

The other approach is gradient clipping by norm, however now you risk having very small gradients and tiny updates.

212
Q

How does Latent Dirichlet Allocation (LDA) work?

A

It is a generative probabilistic model which describes each document as a mixture of topics and each topic as a distribution of words. LDA generalizesProbabilistic Latent Semantic Analysis (PLSA) [2] by adding a Dirichlet prior distribution over document-topic and topic-word distributions.

LDAand PLSAdiscretize the continuous topic space into t topics and model documents as mixtures of those t topics. These models assume the number of topics t to be known. The discretization of topics is necessary to model the relationship between documents and words.

213
Q

What is a major weakness of How does Latent Dirichlet Allocation (LDA) work?

A

This is one of the greatest weakness of these models, as the number of topics t or the way to estimate it is rarely known, especially for very large or unfamiliar datasets