ML Flashcards

Question

Why is gradient checking important?

Answer 1

Gradient Checking is a method to check out the derivatives in Back-propagation algorithms. Implementation of back-propagation algorithm is usually prone to bugs and errors. Therefore, it’s necessary before running the neural network on training data to check if our implementation of back-propagation is correct. Gradient checking is a way to do that. It compares the back-propagation gradients, which are obtained analytically with loss function, with numerically obtained gradient for each parameter. Therefore, it ensures that the implementation is correct and would hence, significantly increase our confidence in the correctness of our code. By numerically checking the derivatives computed, gradient checking eliminates most of the problems that may occur as the back-propagation algorithm may have many subtle bugs. It could look like it's working, and our cost function may end up decreasing on every iteration of gradient descent, but this may result in a neural network that has a higher level of error that could go unnoticed and give us worse performance.

Answer 2

Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn’t hold in earlier years! You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data. fold 1 : training [1], test [2] fold 2 : training [1 2], test [3] fold 3 : training [1 2 3], test [4] fold 4 : training [1 2 3 4], test [5] fold 5 : training [1 2 3 4 5], test [6]

Answer 3

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm.

Answer 4

You turn this into multiple binary classification, is this examples one of this label? yes or no because the assignment to label should be independent. There you can use a binary cross entriopy loss https://machinelearningmastery.com/multi-label-classification-with-deep-learning/

Answer 5

Kalman Filters are a powerful tool used to evaluate the hidden state of a system, when we only have access to measurements of the system containing inaccuracies or errors. It bases its estimation on the past prior state, and the current measurements. For example, it can be used to estimate the position of a car based on its GPS signal. The position of the car at time t is a combination of its prior estimates of position and speed at t-1

Answer 6

A less known issue of Batch Norm is that how hard it is to parallellize batch-normalized models. Since there is dependence between elements, there is additional need for synchronization across devices. While this is not an issue for most vision models, which tends to be used on a small set of devices, Transformers really suffer from this problem, as they rely on large-scale setups to counter their quadratic complexity. In this regard, layer norm provides some degree of normalization while incurring no batch-wise dependence.

Answer 7

Lift analysis is used for classification tasks: you classify your data in deciles from 0-0.1 prob 0.1-0.2 and so on according to say to the probability users will cancel a subscription. Then you check how many of those did in each bin. If you have a good model your high bins will do it a lot compared to average and the opposite for low bins. The ratio of the value you have in high bins compare to average is your lift analysis.

Answer 8

MapReduce is a data processing job that enables distributed computations to handle a huge amount of data. It is used to split and process terabytes of data in parallel, achieving quicker results. This way it makes it easy to scale data processing over multiple computing nodes. The processing happens using the map and reduce function. **Map** takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (like count the lenght of a string or the number of occcurances of word in text) Whereas, **reduce** takes the output from a map as an input and combines those data tuples into a smaller set of tuples (for example getting the max of a tuple of lenght of strings). The most famous implementation is Apache Hadoop See https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac

Answer 9

It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. n addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling. You can probably discard pooling and you stride and padding to reduce the diamension.

Answer 10

XGBoost is a Gradient boosting of decision trees. ## Footnote osting is a greedy algorithm and can overfit a training dataset quickly. *_The general idea is that each individual tree will over fit some parts of the data, but therefor will under fit other parts of the data. But in boosting, you don't use the individual trees, but rather "average" them all together, so for a particular data point (or group of points) the trees that over fit that point (those points) will be average with the under fitting trees and the combined average should neither over or under fit, but should be about right._* In particular in XGBoost There are in general two ways that you can control overfitting in XGBoost: The first way is to directly control model complexity. This includes max\_depth, min\_child\_weight and gamma. The second way is to add randomness to make training robust to noise. This includes subsample and colsample\_bytree. You can also reduce stepsize eta. Remember to increase num\_round when you do so. **Below are some constraints** that can be imposed on the construction of decision trees: **Number of trees**, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed. **Tree depth**, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels. **Number of nodes or number of leaves**, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used. Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered

Answer 11

**Latent semantic analysis (LSA)** is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1 **In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups** that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

Answer 12

Mixture of Gaussian is generative ad statistic based while K-mean is very effective and accurate but not based on statistics They are both affected by the initial state.

Answer 13

(1) Reduce the storage space needed (2) Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions ( 3) Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed) (4) Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights (5) Too many features or too complex a model can lead to overfitting.

Answer 14

Softmax FunctionSigmoid Function 1Used for multi-classification in logistic regression model.Used for binary classification in logistic regression model. 2 The probabilities sum will be 1 The probabilities sum need not be 1. 3 Used in the different layers of neural networks. Used as activation function while building neural networks. 4 The high value will have the higher probability than other values. The high value will have the high probability but not the higher probability.

Answer 15

main parameters to adjust when using these methods is **n\_estimators and max\_features**. For n\_estimThe _larger the better,_ but also the _longer_ it will take to compute. In addition, note that results _will stop getting significantly better_ beyond a critical number of trees. for max\_feature The lower the greater the reduction of variance, but also the greater the increase in bias. Empirical good default values are max\_features=n\_features for regression problems, and max\_features=sqrt(n\_features)

Answer 16

First of all you have to ask which ML model you want to train. For Neural networks: Batch size with Numpy array will work. Steps: Load the whole data in Numpy array. Numpy array has property to create mapping of complete dataset, it doesn’t load complete dataset in memory. You can pass index to Numpy array to get required data. Use this data to pass to Neural network. Have small batch size. For SVM: Partial fit will work Steps: Divide one big dataset in small size datasets. Use partialfit method of SVM, it requires subset of complete dataset. Repeat step 2 for other subsets.

Answer 17

https://chengh.medium.com/evolution-of-fast-and-efficient-transformers-ec0378257994 Transformers scale bad with the length of the sequence, and better with the size of the embedding. Indeed they have an n^2 cost in multiplying key and vector in the attention mechanism to solve this you can Segment level recurrence: Transformer-XL Sparse attention Approximation Inference Acceleration

Answer 18

Using hierarchical softmax or negative sampling

Answer 19

Memory-intensive, hard to interpret, and kind of annoying to run and tune

Answer 20

Similar to mixture of gaussians in my opinion GDA, is a method for data classification commonly used when data can be approximated with a Normal distribution. As first step, you will need a training set, i.e. a bunch of data yet classified. These data are used to train your classifier, and obtain a discriminant function that will tell you to which class a data has higher probability to belong. When you have your training set you need to compute the mean 𝜇μ and the standard deviation 𝜎2σ2. These two variables, as you know, allow you to describe a Normal distribution. Once you have computed the Normal distribution for each class, to classify a data you will need to compute, for each one, the probability that that data belongs to it. The class with the highest probability will be chosen as the affinity class.

Answer 21

There are four types of kernels in SVM. Linear Kernel Polynomial kernel Radial basis kernel Sigmoid kernel

Answer 22

Toss the coin twice. If the result is HT, assign X=0 . If the result is TH, assign X=1 . If the result is either HH or TT, then discard the two coin tosses and go to step 1. The probability of making an HH or TT for two tosses is P(HH)+P(TT)=p2+q2.(1) Therefore, the probability of finally getting HT (and thus setting X=0 ) is1 P(HT)+(P(HH)+P(TT))P(HT)+(P(HH)+P(TT))2P(HT)+…=pq1−p2−q2=12.(2) Similar, the probability of X=1 is 0.5 and hence we get a fail result from a biased coin.

Answer 23

Stratify your cross-validation samples!!

Answer 24

Strengths: DBSCAN does not assume globular clusters, and its performance is scalable. In addition, it doesn't require every point to be assigned to a cluster, reducing the noise of the clusters (this may be a weakness, depending on your use case). Weaknesses: The user must tune the hyperparameters 'epsilon' and 'min\_samples,' which define the density of clusters. DBSCAN is quite sensitive to these hyperparameters.

Answer 25

Precision: True positive/ number of predicted positive. True pos/ (true pos+ false pos) Of all the one we have predicted as positive how many really are. Recall: Of all the test set case that was how many we predicted as true? True pos/(true pos +false neg)

Answer 26

in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. it can be use the Rand index to assess a clustering approach The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

Answer 27

They are closely linked. For example you can get SVM by asking the logistic regression to maximize the decision (not to get the probability as close as possible) LR are better to get probability more than decision.SVM tachnically do not give you prob at all. SVM are more scalable and can get more complex separations

Answer 28

the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at f inetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language [9].

Answer 29

this basically become a classification problem

Answer 30

You want an unstable learner like a tree that gives very different answers given slightly different input

Answer 31

XLNet is a BERT-like model instead of a entirely different one. But it is an auspicious and potential one. In one word, XLNet is a generalized autoregressive pretraining method. BERT predicts all the masked word simultaneously XLNet does it sequentially and not necessarily left to right BERT is a bidirectional autoencoder XLNet is "generalized" autoregressive(AR) it uses **permutation language modelling** https://arxiv.org/pdf/1906.08237.pdf Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context. Also there is not data corruption

Answer 32

linear regression model presents heteroscedasticity when the **variance of the perturbations is not constant throughout the observations**. This implies the breach of one of the basic hypothesis on which the linear regression model is based. Recall that one of the **basic assumptions of linear regression is “That errors have constant variance.”** From it is derived that the **data with which one works are heterogeneous since they come from probability distributions with a different variance.** There are **two major consequences of heteroscedasticity.** One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid. The other is that OLS is an inefficient estimation techniqu

Answer 33

https://scikit-learn.org/stable/modules/calibration.html Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions. The x axis represents the average predicted probability in each bin. Calibrating a classifier consists of fitting a regressor (called a calibrator) that maps the output of the classifier (as given by decision\_function or predict\_proba) to a calibrated probability in [0, 1]. Denoting the output of the classifier for a given sample by fi, the calibrator tries to predict p(yi=1|fi). The samples that are used to fit the calibrator should not be the same samples used to fit the classifier, as this would introduce bias. This is because performance of the classifier on its training data would be better than for novel data. Using the classifier output of training data to fit the calibrator would thus result in a biased calibrator that maps to probabilities closer to 0 and 1 than it should.

Answer 34

AdaBoost ifit a sequence of weak learners (i.e., models that are only slightly better than random guessing,) on repeatedly modified versions of the data. The predictions are combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights to each of the training samples. Initially, those weights are all set to 1/n, so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence

Answer 35

So, assume that we want to fit a stacking ensemble composed of L weak learners. Then we have to follow the steps thereafter: split the training data in two folds choose L weak learners and fit them to data of the first fold for each of the L weak learners, make predictions for observations in the second fold fit the meta-model on the second fold, using predictions made by the weak learners as inputs

Answer 36

Neural networks and logistic regression are both used for classification problems. Logistic regression can be defined as the simplest form of Neural Network that results in straightforward decision boundaries whereas neural networks are a superset that includes additional complex decision boundaries to cater to more complex and large data. Logistic regression models cannot capture complex non-linear relationships w.r.t features. Meanwhile, a neural network with non-linear activation functions enables one to capture highly complex features.

Answer 37

Based on Masked language modelling In comparison, AE based pretraining does not perform explicit density estimation but instead aims to reconstruct the original data from corrupted input. A notable example is BERT [10], which has been the state-of-the-art pretraining approach. Given the input token sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize

Answer 38

DO: Naive Bayes Random forest (if not creazy noise) adaboost and neural networks DO not: KNN linear regression logistic (unless you regularize with LAsso) decision trees

Answer 39

According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper

Answer 40

he primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another. Generally, Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve. The Elbow Curve graph contains a point that represents the point post in which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means. Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.

Answer 41

Reinforcement learning Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximize the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward.Reinforcement learning is inspired by the learning of human beings, it is based on the reward/panelity mechanism. Applications: RL is quite widely used in building AI for playing computer games. AlphaGo Zero In robotics and industrial automation,RL is used to enable the robot to create an efficient adaptive control system for itself which learns from its own experience and behavior

Answer 42

* A GRU has two gates, an LSTM has three gates. * GRUs don’t possess and internal memory () that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs. * The input and forget gates are coupled by an update gate and the reset gate is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both and . * We don’t apply a second nonlinearity when computing the output. More info https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

Answer 43

is, a neural network that has a memory that influences future predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word “Hello”. ecurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.

Answer 44

well basically you can not distinguish between for example putting the 2 coefficients to zero or one very close to minus the other.

Answer 45

It is a linear regressor assuming a linear relationship between features and target and no noise. The output is then passed to a sigmoid function turning a linear regression into a logistic one. This makes it non linear so you should not use a least square loss function. Use an entropy logarithmic one!

Answer 46

Because it assume variables are independent

Answer 47

You use the training statistics, if they are not the same distribution you have bigger problems

Answer 48

1. Mean Square Error, Quadratic loss, L2 Loss 2. Mean Absolute Error, L1 Loss using the squared error is easier to solve, but using the absolute error is more robust to outliers. **big** **problem in using MAE** loss (for neural nets especially) is that its gradient is the same throughout, which means the gradient will be large even for small loss values. This isn’t good for learning If we only had to give one prediction for all the observations that try to minimize MSE, then that prediction should be the mean of all target values. But if we try to minimize MAE, that prediction would be the median of all observations. 3. Huber Loss, Smooth Mean Absolute Error, Log-Cosh Loss, These try to get the best things of both. They are almost abs(x-x) at high x but they are smooth and almost quadratic (x-x)^2 near zero. Huber loss has hyperparameter to train Advanced quantile loss

Answer 49

**filters**: Integer, the dimensionality of the output space (i.e. the number of output filters in the convolution). **kernel\_size**: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window. **strides**: An integer or tuple/list of a single integer, specifying the stride length of the convolution. Specifying any stride value != 1 is incompatible with specifying any dilation\_rate value != 1. **padding**: One of "valid", "causal" or "same" (case-insensitive). "valid" means "no padding". "same" results in padding the input such that the output has the same length as the original input. "causal" results in causal (dilated) convolutions, e.g. output[t] does not depend on input[t+1:]. Useful when modeling temporal data where the model should not violate the temporal order. See WaveNet: A Generative Model for Raw Audio, section 2.1. dilated convolution **activation: Activation function** to use (see activations). If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x). use\_bias: Boolean, whether the layer uses a bias vector. kernel\_initializer: Initializer for the kernel weights matrix (see initializers). bias\_initializer: Initializer for the bias vector (see initializers). kernel\_regularizer: Regularizer function applied to the kernel weights matrix (see regularizer). bias\_regularizer: Regularizer function applied to the bias vector (see regularizer). activity\_regularizer: Regularizer function applied to the output of the layer (its "activation"). (see regularizer). kernel\_constraint: Constraint function applied to the kernel matrix (see constraints). bias\_constraint: Constraint function applied to the bias vector (see constraints).

Answer 50

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. In this approach we describe each cluster by its centroid (mean), covariance, and the size of the cluster (weight). Therefore, based on this definition, a GMM will be applicable when we know that the data points are mixtures of a gaussian distribution and form clusters with different mean and standard deviation.

Answer 51

Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned.

Answer 52

In requires variable to be Normally distributed (you can use box-cox transformation to do that) The fourth assumption is that the error(residuals) follow a normal distribution.However, a less widely known fact is that, as sample sizes increase, the normality assumption for the residuals is not needed and importantly **Homoscedasticity (**That errors have constant variance**)** Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables. There are two major consequences of heteroscedasticity. One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid.

Answer 53

No just a cascade of if then else decision.

Answer 54

**AdaGrad (for adaptive gradient algorithm)** is a m**odified stochastic gradient descent** algorithm **with per-parameter learning rate**, Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. **This strategy often improves** convergence performance over standard stochastic gradient descent in settings **where data is sparse** and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.[21] It still has a base learning rate η, but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix {\displaystyle G=\sum \_{\tau =1}^{t}g\_{\tau }g\_{\tau }^{\mathsf {T}}} where {\displaystyle g\_{\tau }=\nabla Q\_{i}(w)}, the gradient, at iteration τ. The diagonal is given by {\displaystyle G\_{j,j}=\sum \_{\tau =1}^{t}g\_{\tau ,j}^{2}}. This vector is updated after every iteration. The formula for an update is now {\displaystyle w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\circ g}[a] or, written as per-parameter updates, {\displaystyle w\_{j}:=w\_{j}-{\frac {\eta }{\sqrt {G\_{j,j}}}}g\_{j}.} **Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi. Since the denominator in this factor, {\displaystyle {\sqrt {G\_{i}}}={\sqrt {\sum \_{\tau =1}^{t}g\_{\tau }^{2}}}} is the ℓ2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.[19]** While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.[23]

Answer 55

**Bagging:** * parallel ensemble: each model is built independently * aim to decrease variance, not bias * suitable for high variance low bias models (complex models) * an example of a bagging method is random forest, which develop fully grown trees (note that RF modifies the grown procedure to reduce the correlation between trees) **Boosting:** * sequential ensemble: try to add new models that do well where previous models lack * aim to decrease bias, not variance * suitable for low variance high bias models * an example of a tree based method is gradient boosting

Answer 56

- collect more data - use the appropriate evauation metric - first you can do sampling to rebalance your classed either you sample the dominant class or you repeat the less dominant one (oversampling makes it easy to overfit). You can do a syntetic minority sampling you create new fake small sample data to avoid overfitting - second you can usea smart algorithm that deals weel with this (or use weights). Xgboost for example resample its bags so that it has balanced classes.

Answer 57

Dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests. A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques, if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests. The Box Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the technique.

Answer 58

- K-means is the simplest: Just the K closest point. You can also do this for regression. - mixture Gaussians - hierarchical clustering Either by splitting or grouping - affinity propagation - DBScan based on dthe ensity of points

Answer 59

KNN and logistic regression. Definitely not naive bayse Possibly random forest and neural network.

Answer 60

Exploding Gradients this problem can be easily solved if you truncate or squash the gradients. Also known as clipping (it is also easier to diagnose, you see nan appering) Vanishing Gradients (happen also for CNN just RNN Are very deep) Related to not being able to learn from distant words. Their gradient goes to zero and drive all gradients to zero. was a major problem in the 1990s and much harder to solve than the exploding gradients. Fortunately, it was solved through the concept of LSTM

Answer 61

can create over-complex trees --\> overfitting. pruning, a minimum number of samples required at a leaf node or a the maximum depth of the tree are necessary or use ensamble * Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. again use ensemble. The problem of learning an optimal decision tree is known to be NP-complete algorithms cannot guarantee to return the globally optimal decision tree. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Answer 62

The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are: Bagging Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions. Boosting Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last classification. Boosting decreases the bias error and helps you to build strong predictive models.

Answer 63

A simple is to randomly sample the attributes from instances in the minority class. you could use a method like Naive Bayes that can sample each attribute independently when running in reverse. You will have more and different data, but the non-linear relationships between the attributes may not be preserved. There are systematic algorithms that you can use to generate synthetic samples like SMOTE or the Synthetic Minority Over-sampling Technique.

Answer 64

https://towardsdatascience.com/12-main-dropout-methods-mathematical-and-visual-explanation-58cdc2112293

Answer 65

E( y - f(x)) + Var(F(x)) Var( f(x) ) = E(f(x)-E(f(x))^2)

Answer 66

the binomial distribution (n )r^h (1-r)^(n-h) (H)

Answer 67

DO : NN, Random forest, decision trees adaboost DONOT: linear logic regressions KNN, naive bayes

Answer 68

- Normalize your input. Some ML algorithm actually needs this. NN also train faster if you do that. - In NN start your weight properly so that you can not make your gradient explode. you usually use mean zero and std = sqrt(2/ (input units))

Answer 69

Weight Normalization speeds up the training similar to batch normalization and unlike BN, it is applicable to RNNs as well. But the training of deep networks with Weight Normalization is significantly less stable compared to Batch Normalization and hence it is not widely used in practice. https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6

Answer 70

- Drop out or L2 regularization on the W of your neuron - you need more data if you can not do data augmentation. - you can do early stopping, even if it is a little bit risky. You stop before ||w|| gets too big. (also makes bias and variance problem not orthogonal anymore)

Answer 71

Skip-o-gram and CBOW continuos bag of words.

Answer 72

Batch normalization (also known as batch norm) is a method used to make artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. We do this on the batch since we use SGD We normalize the input layer by adjusting and scaling the activations. This like it happens from the first layer **speed up learning!!** We can use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really high or really low t reduces overfitting because it has a slight regularization effect. Similar to drop out, it adds some noise to each hidden layer’s activations.

Answer 73

The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space.

Answer 74

Speach organization Sound--\>word Music generation integer--\>music Sentiment classification Words --\> stars of review etc DNA sequencing, translation, video activity recognization etc., named entity recognition (find a name in text)

Answer 75

If you initialize them to zero the network became linear just a linear function. So do not do that. Also they are usually initialized at **random** but we carefull for them not to be too high or too small Some new initialization basically they are random but scaled with the square root of the size of the layer

Answer 76

Layer Normalization(LN) Inspired by the results of Batch Normalization, Geoffrey Hinton et al. proposedLayer Normalization which normalizes the activations along the feature direction instead of mini-batch direction. This overcomes the cons of BN by removing the dependency on batches and makes it easier to apply for RNNs as well. In essence, Layer Normalization normalizes each feature of the activations to zero mean and unit variance.

Answer 77

Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. It is assumed that input features take on values in the range [0, n\_values).

Answer 78

one hot encoding increasing the dimensionality of a data set. ## Footnote say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

Answer 79

Logistic and linear regression, naive bias. Do not: KNN decision trees NN adabost etc

Answer 80

Basically, you want them to both be not extreme. F score is 2\*PR/(P+R) It is an average but down weight one is very small.

Answer 81

A lot of problem for outliers also it does not get to zero to infinity. You do not to penalize a too high value, with a threshold those are going to be set to one.

Answer 82

Sign of presence: A regression coefficient is not significant even though, theoretically, that variable should be highly correlated with Y. When you add or delete an X variable, the regression coefficients change dramatically. You see a negative regression coefficient when your response should increase along with X and viceversa Your X variables have high pairwise correlations. Dealing with: Remove highly correlated predictors from the model. correlation matrix Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.

Answer 83

You start witht the sound (if alexa is already alerted). The amplitude as a function of time 16khz (16,000 samples per second) is enough to cover the frequency range of human speech. You group our sampled audio into 20-millisecond-long chunks. You Fourier transform those and pass the sonogram for each 20 ms into a recurrent neural network. This will give you the probable letter or spaces for each of those intervals. HHHHEE\_\_\_\_LL\_\_LLLL0000\_\_ Remove repeating characters and spaces but at low prob, you also have AULLO HULLO. You then compare this with natural language text databases!

Answer 84

The trick is to view each toss as a random variable that returns 1 if a head is tossed and 0 if a tail is tossed. Then each such random variable has expected value 1/2 and variance 1/4. So your Z-variable (for using the central limit theorem) will be: (220-200)/(sqrt(400\*(1/4))) = 20/10 = 2 So we've reduced the question to asking what's the probability that Z takes a value bigger than 2. Recall that on the standard normal, the probability that z takes values between -2 and 2 is about 95%, so the probability that it takes values less than 2 is about 97.5% (it's actually more like 97.7% but just estimating). So the probability that we are bigger than 2 is a little less than 2.5%, which after rounding to the nearest percent gives us 2%

Answer 85

Simply you assume all the feature are indipendent then P(y|x) = prod(P(y|x1)\*P(y|x2)... You then basically maximize that likelihood. Different classifier assume different P(y|x).

Answer 86

All the boosting algorithm for example. They might put extra weight on outliers and try to fit them at all cost instead of discarding them

Answer 87

There are 2 big families of reccomender systems: **: collaborative and content based methods** Collaborative methods are based solely on the past interactions recorded between users and items. These interactions are stored in *“user-item interactions matrix”.* **collaborative filtering** algorithms is divided into two sub-categories: memory based and model based approaches: memory --\> nearest neighbours search model --\> generative model. The more interaction the more they get better but they can not be used at the very beginning.( you can suggest random or popular items at the beginning) hree classical collaborative filtering approaches: **two memory based methods (user-user and item-item) and one model based approach (matrix factorisation).** **content based** approaches use additional information about users and/or items. Using feature of users (sex age) to predict movie (with its own features like actor genre) the recommendation problem is casted into either a classification problem (predict if a user “likes” or not an item) or into a regression problem (predict the rating given by a user to an item)

Answer 88

It is a deep **contextualized** word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts. **Note this model is character based via ( CNN)** It was one of the first to introduce context compared to GloVe or Word2Vec These word vectors are learned functions of internal states of a deep biLM(bidirectional language model), which is pre trained on large text corpus. The task was **language modelling** **see https://jalammar.github.io/illustrated-bert/**

Answer 89

Accuracy is a problem for unbalanced classes. Other metrics are: * Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned). * Precision: A measure of a classifiers exactness. * Recall: A measure of a classifiers completeness * F1 Score (or F-score): A weighted average of precision and recall. * Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data. * ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.

Answer 90

KNN linear and logistic if regularized, and neural networks Decision trees and random forest are ok (and robust again outliers)

Answer 91

Latent Semantic Analysis (LSA): A Latent Semantic Analysis tries to use the context around the words to find hidden concepts. It does that by generating a document-term matrix, where each cell has TD-IDF score which assigns a weight for every term in the document. Using a technique known as Singular Value Decomposition (SVD), the dimensions of the matrix are reduced to the number of desired topics. The resultant matrices, after decomposition, gives us vectors for every document and term in our data that can then be used to find similar words and similar documents using the cosine similarity method. Probabilistic Latent Semantic Analysis(PLSA): Probabilistic Latent Semantic Analysis is a technique used to model information under a probabilistic framework instead of SVD. It creates a model P(D,W) such that for any document d and word w, P(d,w) corresponds to that entry in the document-term matrix. Latent Dirichlet Allocation (LDA): Latent Dirichlet Allocation is a technique that automatically discovers topics that documents contain. LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that each document mix with various topics and every topic mix with various words. Assuming this, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection. It maps all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics.

Answer 92

Activation functions are essential to learn and model complex data and its relationships. These functions add non-linearity to the network. If there is no activation function then the input signal will be mapped to an output using a linear function which is just a polynomial of one degree. Now why is that a problem? Linear functions are not able to capture complex functional mappings of the data. However, this is possible with the use of nonlinear functions which have a degree of more than one. These activation functions can be used to model any real world data.

Answer 93

https://dennybritz.com/blog/deep-learning-most-important-ideas/

Answer 94

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. Hidden Markov models are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics. The idea is to predict a probability of a sequence of observations (this can be related to a hidden state you can not observe, and you can solve this with the transition matrices)

Answer 95

cross validation

Answer 96

- Get more data if possible - Use regularization, dropout, and similar techniques - reducing features with some kind of feature selection. - Change algorithm, some are less prone to that. Label Smoothing is a regularization technique it adds noise to labels **It works with multiclass classification and sigmoid in particular.**

Answer 97

softmax: The softmax function is a more generalized logistic activation function which is used for multiclass classification. it is e^zj/sum\_all\_classes(e^zi) [1.2 , 0.9 , 0.75], When we apply the softmax function we would get [0.42 , 0.31, 0.27]. So now we can use these as probabilities for the value to be in each class relu: the most used right now: zero from -inf to 0 (in not leaky) and linear after that. Note this goes to infinity for infinity. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time tanh the tanh. This is between -1,1 tanh is also like logistic sigmoid but better The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The tanh function is mainly used classification between two classes. It basically solves our problem of the values all being of the same sign. All other properties are the same as that of the sigmoid function sigmoid: or logistic simply 1/(1+e-z). It exist between 0-1. It is differentiable but it can get you stuck during training since grad is zero at borders. linear: this is simply the identity

Answer 98

https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6 Batch Normalization: you need batches large enough to compute the statistics. It does not work well with RNN, because of the residual connection you need a different normalization for each time step Weight Normalization Layer Normalization Group Normalization Weight Standarization

Answer 99

Some important distributions are: Bernoulli/Poisson/gamma/beta, many of them are categorical data, and those with only two categories are Bernoulli. Some are multivariate numerical and presumably approximately normal in each coordinate but different coordinates are not independent. For large 𝑛n, many distributions are approximately normal such as the gamma distribution and the beta distribution. I'll give you a few variables you might consider measuring: Waiting time. (Usually). For instance if you are queuing to check in. Number of faults (in some unit of measurement). For instance the number of typos in 1 sheet of paper (A4). Same type of distribution (probably): The number of accidents at a specific crossroads between the hours of 06.00 am to 09.00 am. The number of times you throw a 6, when you throw a limited number of times. The number of children having a particular (rare) disease, given a distribution of genes coding for this within the parents.

Answer 100

Derivatives are now big (==1) so the gradient does not vanish. However it can be pushed to zero for negative values (can results in dead neuron). This is why leaky relu is introduced

Answer 101

Nyquist theorem, we know that we can use math to perfectly reconstruct the original sound wave from the spaced-out samples — as long as we sample at least twice as fast as the highest frequency we want to record.

Answer 102

Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions: **Linear relationship:** between the outcome variable and the independent variables. **Multivariate normality:** Multiple regression assumes that the residuals are normally distributed. **No or little multicollinearity**:independent variables are not highly correlated with each other. **No auto-correlation:** no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price. **Homoscedasticity****:**This assumption states that the variance of error terms are similar across the values of the independent variables.

Answer 103

What makes finding the parameters of a gaussian mixture hard is the fact that we do not know which gaussian a point belongs to. This is a latent variable. So it is hard to compute the maximum likelihood eestimation. so if mu\_k and sigma\_k are the likelihood for each point to belong to a gaussian mixture we can Initialization: Get an initial estimate for parameters 𝜃0θ0 (e.g. all the 𝜇𝑘,𝜎2𝑘μk,σk2 and 𝜋π variables). In many cases, this can just be a random initialization. Expectation Step: Assume the parameters (𝜃𝑡−1θt−1) from the previous step are fixed, compute the expected values of the latent variables (or more often a function of the expected values of the latent variables). Maximization Step: Given the values you computed in the last step (essentially known values for the latent variables), estimate new values for 𝜃𝑡θt that maximize a variant of the likelihood function. Exit Condition: If likelihood of the observations have not changed much, exit; otherwise, go back to Step 1.

Answer 104

Strengths: The main advantage of hierarchical clustering is that the clusters are not assumed to be globular. In addition, it scales well to larger datasets. Weaknesses: Much like K-Means, the user must choose the number of clusters (i.e. the level of the hierarchy to "keep" after the algorithm completes).

Answer 105

also known as log-loss. It is defined as -log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)) This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. For multiclass just generalize log(P) = sum( yt\_i log(y\_p\_i)

Answer 106

When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').

Answer 107

data that have *been used for the training of the weak learners are not relevant for the training of the meta-model.* Thus, an obvious drawback of **this split of our dataset in two parts is that we only have half of the data to train** the base models and half of the data to train the meta-model. In order to overcome this limitation, we can however follow some kind of **“k-fold cross-training”** approach (similar to what is done in k-fold cross-validation)

Answer 108

When you have very unbalanced classes. You can have a 1% error but if in the data 0.5% have cancer and the other not, you can do better than that by simply always predicting no cancer.

Answer 109

When we remove sub-nodes of a decision node, this procsss is called pruning or opposite process of splitting.

Answer 110

Stratification keeps the subsample of the population similar to the total population. You are basically asking the model to take the training and test set such that the class proportion is same as of the whole dataset, which is the right thing to do. If your classes are balanced then a shuffle (no stratification needed here) can basically guarantee a fair test and train split.

Answer 111

How to select features and what are Benefits of performing feature selection before modeling your data? · **Reduces Overfitting**: Less redundant data means less opportunity to make decisions based on noise. · **Improves Accuracy:** Less misleading data means modeling accuracy improves. · **Reduces Training Time**: fewer data points reduce algorithm complexity and algorithms train faster.

Answer 112

**bagging**, that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process **boosting**, that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy **stacking**, that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions

Answer 113

not very high accuracy Relies on independence assumption. Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance. If a categorical variable has a category in test data set which was not observed in training data set, then the model will assign a zero probability. It will not be able to make a prediction. This is often known as “Zero Frequency”

Answer 114

**Strengths**: K-Means is hands-down the most popular clustering algorithm because it's fast, simple, and surprisingly flexible if you pre-process your data and engineer useful features. **Weaknesses**: The user must _specify_ the _number of clusters_, which won't always be easy to do. In addition, if the true underlying clusters in your _data are not globular,_ then K-Means will produce poor clusters.

Answer 115

Well, it does not have context If you see as a problem of multiclass classification The soft max is computationally expensive Also you have to pick the context to predict right with some kind of inverted importance sampling. The solution is negative sampling. Making it a binary classifier (being the right word or not) with the right work and k random word

Answer 116

*_You will probabl yuse a convnet get features for the image and then measure some kind of similarity_* Feature-based approach relies on the extraction of image features such, i.e. shapes, textures, colors, to match in the target image or frame. This approach is currently achieved by using Neural Networks and Deep Learning classifiers such as VGG,[6] AlexNet, ResNet. Deep Convolutional Neural Networks process the image by passing it through different hidden layers and at each layer produce a vector with classification information about the image. These vectors are extracted from the network and are used as the features of the image. Feature extraction by using Deep Neural Networks is extremely effective and thus is the standard in state of the art template matching algorithms.[7]

Answer 117

* Standardization, or mean removal and variance scaling * Scaling features to a range If you have outliers you should be careful to use some robust algorithm. Also centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales. * Generate polinomial features * Encode categorical data * drop nan or Imputation of missing values (mean median etc)

Answer 118

It stands for Universal Language Model Fine-tuning, or ULMFiT, is an architecture and _transfer learning_ method that can be applied to NLP tasks. This model is **word based** ULMFit is unidirectional not bidirectional Transfer learning is really the big addiction of this model. Discriminative Fine-Tuning “As different layers capture different types of information, they should be fine-tuned to different extents.”¹ Thus, for each layer, a different learning rate is used.

Answer 119

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed **There are different types** **Sampling bias** is systematic error due to a non-random sample of a population,[2] causing some members of the population to be less likely to be included than others, resulting in a biased sample, **Time interval[edit]** Early termination of a trial at a time when its results support the desired conclusion **Data[edit]** Partitioning (dividing) data with knowledge of the contents of the partitions, and then analyzing them with tests designed for blindly chosen partitions. **Observer selection[edit]** Philosopher Nick Bostrom has argued that data are filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study.

Answer 120

The core algorithm for building decision tree is called ID3. ID3 uses Enteropy and Information Gain to construct a decision tree. Entropy A decision tree is built top-down from a root node and involve partitioning of data into homogenious subsets. ID3 uses enteropy to check the homogeneity of a sample. If the sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one. Information Gain The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that returns the highest information gain.

Answer 121

, SVM can be used to it provides the maximum separating margin for a linearly separable dataset (unless you use the kernel trick). That is, of all possible decision boundaries that could be chosen to separate the dataset for classification, it chooses the decision boundary which is the most distant from the points nearest to the said decision boundary from both classes.

Answer 122

It allows you to visualize how well a classifier is doing for all the possible probability threshold we can use for the classifier. This can be used when your classifier return a propba. You can use them even if your probability are not well calibrated Once you maximize the AUC then you can choose your threshold according to business model

Answer 123

not suitable as general-purpose algorithms because they require a very large amount of data. In fact, they are usually outperformed by tree ensembles for classical machine learning problems.

Answer 124

The derivatives are too small theuy can kill the gradient in back propagation They are not zero centered so they can push gradient very far (no zig zag around zero) It is never used these days?

Answer 125

there isn’t a clear winner. In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture. GRUs have fewer parameters (U and W are smaller) and thus may train a bit faster or need less data to generalize. On the other hand, if you have enough data, the greater expressive power of LSTMs may lead to better results.

Answer 126

_Simple to understand and to interpret. Trees can be_ _visualised_. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. Able to handle multi-output problems. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

Answer 127

1) Log loss sum(ylogy()) 2)Hinge loss (with yp -1 or 1) max(0,1-yt\*yp) 3) e^(-beta yt\*yp) It penalizes incorrect predictions more than Hinge loss and has a larger gradient. Logarithmic loss leads to better probability estimation at the cost of accuracy Hinge loss leads to better accuracy and some sparsity at the cost of much less sensitivity regarding probabilities

Answer 128

* Randomly select “k” features from total “m” features. * Where k \<\< m * Among the “k” features, calculate the node “d” using the best split point. * Split the node into daughter nodes using the best split. * Repeat 1 to 3 steps until “l” number of nodes has been reached. * Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees. * Takes the test features and use the rules of each randomly created decision tree to predict the oucome and stores the predicted outcome (target) * Calculate the votes for each predicted target. * Consider the high voted predicted target as the final prediction from the random forest algorithm.

Answer 129

Linear for regression Softmax for classification

Answer 130

The residual block in a resnet learn the residual functions. So it can learn the identity and basically allow deep layer network not to have vanishing gradients. if y=F(x) hard to get F(x) =x it is easier to learn F(x) =0 and y =F(x)+x When deeper networks starts converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated and then degrades rapidly Instead of learning a direct mapping of x -\>y with a function H(x) (A few stacked non-linear layers). Let us define the residual function using F(x) = H(x) — x, which can be reframed into H(x) = F(x)+x, where F(x) and x represents the stacked non-linear layers and the identity function(input=output) respectively.

Answer 131

Some useful information for bernoulli std = p(1-p) so the error on the mean goes as sqrt(p(1-p)/n) then you can compare the average of an esimator for your p head/(head+tails) with 0.5 given the error (assuming normal)

Answer 132

Generative models allow you to make explicit claims about the process that underlies a dataset. After fitting a generative model, you can also run them forward to generate synthetic data sets. However, if the relationships expressed by your generative model only approximate the true underlying generative process that created your data, **discriminative models will typically outperform in terms of classification error rates** a discriminative model is going to attempt to optimize the prediction of y from x, whereas a generative model will attempt to optimize the joint prediction of xand y. Because of this, discriminative models outperform generative models at conditional prediction tasks (logistic regression models tend to outperform naive Bayes models with the same number of parameters _There is one case in which you can't use a discriminative model at all: if you don't have labeled data._

Answer 133

As for GSU you can choose the number of hidden neurons in the hidden state a. It has the usual problem if too big overfitting and bias in the other direction.

Answer 134

Filter based: We specify some metric and based on that filter features. An example of such a metric could be **correlation/chi-square variance threshold**. Wrapper-based: Wrapper methods consider the selection of a set of features as a search problem. Example: **Recursive Feature Elimination** Embedded: Embedded methods use algorithms that have built-in feature selection methods. For instance, **Lasso and RF** have their own feature selection methods. First note that some algo already do feature selection, regularization and random forest do that. - Variance threshold (unsupervised) you need to normalize. This remove feature with low variuance indipendently from the correlation with target. - correlation threshold to avoid redundant feature - Genetic algo: this is very complex but powerful. You can use this if you have a huge number of dimension.

Answer 135

Leahy relu can only be used in hidden layer. For the final one you want get a classification or regression otherwise. So in the final layer we use softmax or linear

Answer 136

Improve it is now centered around zero still derivatives are small they can not push them far apart but they can drive it to zero

Answer 137

- PCA: simple basic algebra. They are all independent and ordered by the explained variance. (you need to normalize). PCA is a versatile technique that works well in practice. In addition, PCA offers several variations and extensions (i.e. kernel PCA, sparse PCA, etc.) to tackle specific roadblocks. **-LDA linear discriminant (supervised).** Same as PCA but now you want to increase the separability between classes. **- auto encoders:** Autoencoders are neural networks that are trained to reconstruct their original inputs. If you use the input as the target image these are unsupervised. Y ou can do manifold learnign (Isomap MDS spectral embedding TSNE -- used a lot--- etc/)

Answer 138

**First** stacking often considers heterogeneous weak learners (different learning algorithms are combined) whereas bagging and boosting consider mainly homogeneous weak learners. **Second**, stacking learns to combine the base models using a meta-model whereas bagging and boosting combine weak learners following deterministic algorithms.

Answer 139

the recall is the True positive rate or sensitivity TP/(TP+FN)

Answer 140

Not more than 3-4 layers. The temporal dimensions make it really computationally expensive.

Answer 141

AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model [7, 27, 28]. Specifically, given a text sequence x = (x1,··· ,xT), AR language modeling factorizes the likelihood into a forward product p(x) = T t=1 p(xt | xt). A parametric model (e.g. a neural network) is trained to model each conditional distribution. Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pretraining.

Answer 142

In statistical hypothesis testing, the p-value or probability value or asymptotic significance is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary (such as the sample mean the difference between two compared groups) would be the same as or of greater magnitude than the actual observed results.[1]

Answer 143

kernel trick

Answer 144

It is a clustering algorithm that differently from kmeans is based on a statistical distribution. So it can also be used to generate new data. Of course it assuems data are drwan from. a gaussian distribution

Answer 145

The basic idea is a bag of words. You have a dictionary of all the words and you represent a word as a vector with 1 in the right position (one hot encoding). This can be as large as 50K to 1 M words.

Answer 146

YES The relative rank (i.e. depth) of a feature used as a decision node in a tree assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. In a forest expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

Answer 147

https://huggingface.co/docs/transformers/tokenizer\_summary **Subword tokenization** algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. **BPE (GPT-X)** relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. GPT-2, Roberta. More advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus. After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. Start by single characters and grow from there. Byte-level BPE is a trick not to use all unicode characters but the bites use to represent them **Word piece** very similar to BPE but it does not combined based on frequency but on the likelihood WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it. **Sentence piece**

Answer 148

Logistic regression if you use the sigmoid function as the activation function. Note that logistic regression has a convex loss function but a few layer neural network does not

Answer 149

The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true. The statistical power ranges from 0 to 1, and as statistical power increases, the probability of making a type II error (wrongly failing to reject the null hypothesis) decreases. Think about p-values

Answer 150

You don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end).

Answer 151

**A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions** with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. It uses the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models **The BIC criterion can be used to select the number of components in a Gaussian Mixture** in an efficient way. In theory, it recovers the true number of components only in the asymptotic regime **The main difficulty in learning Gaussian mixture models from unlabeled data is that it is one usually doesn’t know which points came from which latent component (if one has access to this information it gets very easy to fit a separate Gaussian distribution to each set of points).** **Expectation-maximizatio**n is a well-founded statistical algorithm to get around this problem by an iterative process. First one assumes random components (randomly centered on data points, learned from k-means, or even just normally distributed around the origin) and computes for each point a probability of being generated by each component of the model. Then, one tweaks the parameters to maximize the likelihood of the data given those assignments. Repeating this process is guaranteed to always converge to a local optimum.

Answer 152

SGD Stochastic gradient descent optimizer. You can use momentum, learning rate decay etc. Choosing a proper learning rate can be difficult. Additionally, the same learning rate applies to all parameter updates + local minima RMSprop Adagrad Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the updates.

Answer 153

It is very often better to shuffle to avoid biases. They should be shuffle at every learning epoch. The only caveat is if you are trying to solve a simpler problem first and then trying to present harder cases to your optimizer.

Answer 154

* Computationally fast Simple to implement * Works well with high dimensions. * Can make probabilist prediciton It handles very well irrelevant feature. ()just return all the same for P(irrelavant feature | DATA) * lso it is a generative model. so it is easier to deal with missing values.

Answer 155

Treat the problem as a anomaly detection problem

Answer 156

A search consists of: * an estimator (regressor or classifier such as sklearn.svm.SVC()); * a parameter space; * a method for searching or sampling candidates; * a cross-validation scheme; and * a score function. You can search in defferent way: * simple grid search * randominzed search (monte carlo etc)

Answer 157

Almost, the final sigmaoid function introduce a little bit of non lineaity in it.

Answer 158

You have to have a complete input i.e for example you need the speaker to be done before testing its sentence.f

Answer 159

- look at the time series for pathological cases. - Naive: predict constant at the last value y\_t+1 = y\_t, or the average (of all the dataset) - Moving average: choose a time interval p to use. You can also do this weighted. - Exponential smoothing: y\_t+1 = a y\_t +a(1-a)y\_t-1 + a(1-a)^2y\_t-2 basically old example influence less the importance of the All of the above do not work well with data with high variation. *_Holt’s Linear Trend method_*: it is an advanced expenential smoothing that takes into account the trend. You fit a level and a trend, Then you combine them *_Holt-Winters Method:_* This adds the seasonality it is one of the best methods. ARIMA Autoregressive Integrated Moving average: While exponential smoothing models were based on a description of trend and seasonality in the data, ARIMA models aim to describe the correlations in the data with each other. -

Answer 160

Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks. ## Footnote The units of an LSTM are used as building units for the layers of a RNN, which is then often called an LSTM network. LSTM’s enable RNN’s to remember their inputs over a long period of time. This is because LSTM’s contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory.

Answer 161

* Autoregression (AR) : The autoregression (AR) method models the next step in the sequence as a linear function of the observations at prior time steps. So x\_t = c\_t + sum(phi\_t-i x\_t-i) etc * Moving average and autoregressed moving average. to be completed with https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/

Answer 162

Very roughly, we can say that bagging will mainly focus at getting an ensemble model with less variance than its components whereas boosting and stacking will mainly try to produce strong models less biased than their components (even if variance can also be reduced).

Answer 163

Label Smoothing is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of log⁡p(y∣x) directly can be harmful. Assume for a small constant ϵ, the training set label y is correct with probability 1−ϵ and incorrect otherwise. Label Smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of ϵk−1 and 1−ϵ respectively. https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06

Answer 164

Before like feature selectio. Keep in mind that over-sampling takes observed rare samples and applies bootstrapping to generate new random data based on a distribution function. If cross-validation is applied after over-sampling, basically what we are doing is overfitting our model to a specific artificial bootstrapping result. That is why cross-validation should always be done before over-sampling the data, just as how feature selection should be implemented. Only by resampling the data repeatedly, randomness can be introduced into the dataset to make sure that there won’t be an overfitting problem.

Answer 165

Gradient clipping was used to avoid exploding gradients. Now it is less used in favor of batch normalization and even more layer normalization. Simply, you restrict gradients to be in a range. The problem is that if you do this you change the direction, cause you are not rescaling all of them The other approach is gradient clipping by norm, however now you risk having very small gradients and tiny updates.

Answer 166

It is a generative probabilistic model which describes each document as a mixture of topics and each topic as a distribution of words. LDA generalizesProbabilistic Latent Semantic Analysis (PLSA) [2] by adding a Dirichlet prior distribution over document-topic and topic-word distributions. LDAand PLSAdiscretize the continuous topic space into t topics and model documents as mixtures of those t topics. These models assume the number of topics t to be known. The discretization of topics is necessary to model the relationship between documents and words.

Answer 167

This is one of the greatest weakness of these models, as the number of topics t or the way to estimate it is rarely known, especially for very large or unfamiliar datasets

ML Flashcards

machine learnign algos and tips (213 cards)