Machine Learning Flashcards

Question

What are weights in neural networks?

Answer 1

Weights represent the strength of a connection betwen two units. If the weight from node 1 to node 2 has greater magnitude, it means that neuron 1 has greater influence over neuron 2. A weight brings down the importance of the input value. As a vector, the more it influences the cost function, the more weight will it be gvien.

Answer 2

When you define a cost function, is it a way of telling the computer if it was right or wrong. You add up the squares of the differences between each of the bad output activations, and the value that you want them to have. It is small if the model confidently classifies correctly. The average cost is a measure of how good a neural network performs. Gradient descent can also be used to find the lowest cost, by finding the steepest slope, and take a step down, and repeat until the minimum has been found.

Answer 3

In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins.

Answer 4

In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.

Answer 5

bootstrapping refers to a resample method that consists of repeatedly drawn, with replacement, samples from data to form other smaller datasets, called bootstrapping samples. For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each bootstrap sample containing n observations, the following are valid samples: - n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…

Answer 6

Bagging means bootstrap+aggregating and it is a ensemble method in which we first bootstrap our data and for each bootstrap sample we train one model. After that, we aggregate them with equal weights. When it’s not used replacement, the method is called pasting.

Answer 7

It is an ensemble method that can combine several weak learners into strong learners. The purpose is to train predictors sequentially, each trying to correct its predecessor.

Answer 8

As normal boosting, is it, training predictors, sequentially, each trying to correct its predecessor. However, instead of tweaking the instance weights at every iteration does, this method tries to fit the new predictor to the residual (Resterende) errors, made by the previous predictor.

Answer 9

Given multiple machine learning models that are skillful on a problem, but in different ways, how do you choose which model to use (trust)? The approach to this question is to use another machine learning model that learns when to use or trust each model in the ensemble. This is called a "blender"

Answer 10

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

Answer 11

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

Answer 12

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

Answer 13

Regularization is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero (often the weights). In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting, by simplifying the model.

Answer 14

A matrix that shows the numbers of TP, FP, TN, and FN that can be used to calculate precision, recall and f1

Answer 15

It is the accuracy of the positive predictions, and is calculated: TP / TP + FP

Answer 16

Is the ratio of positive instances (True positive rate) the model predicts, and is calculated: TP / TP + FN

Answer 17

The F1 score is a measure for both recall and precision, which takes the harmonic mean, meaning it gives more weight to low values (skewed date). There both recall and precision has to be high to get a good F1 score

Answer 18

It uses a decision function to compute a score, and based on a given threshold it decides which class it thinks it belong to. Random forest doesn't use a dec. funct. but a predict_proba() instead.

Answer 19

One strategy is to (with digits) to train 10 classifiers, the one with the highest probability will be picked (One versus all OvA) One vs One is when all combinations of classifiers is created and tested against each other. The one with most "won" duels "wins" (OvA is often preferred)

Answer 20

By taking the gradient (partial derivative) of the cost function with respect to all the parameters. This is done to see how much the cost function changes when adjusting the parameters just a little bit.

Answer 21

By reducing the learning rate as it gets closer to the global minimum

Answer 22

To see whether the model is over or underfitting the data. Is based on the performance from both the training and validation set as a function of the training set size (computes on different batch sizes from cross-val)

Answer 23

When the model is based on wrong assumptions, like the data is linear when it in fact is polynomial Low complexity = high bias

Answer 24

When the model is very sensitive to small variations in data, this is often seen in models with many degrees of freedom. This will in most cases lead to overfitting High complexity = high bias

Answer 25

When there is a lot of noise in the data. Handled by cleaning up the data set.

Answer 26

To train and combine multiple binary classes in one classifier

Answer 27

They are very prone to overfitting and it is therefore important to regularize them (max_depth, max_leaf_nodes) Futhermore they are also prone to orientation of data as the decision trees prefer to split horizontal or vertical. This can be handled by conducting PCA that finds the optimal orientation for the data They are also very sensitive to new data. As this can easily change the way the model splits the features/thresholds

Answer 28

It combines 500 trees, and also uses bagging/pasting. Furhtermore to not overfit it randomly picks a subset of the features to train each tree on, so they diversify as much as possible (higher bias, lower variance)

Answer 29

It's a way to avoid overfitting. An is conducted by programming the algorithm to stop iterating when the minimum error rate occurs.

Answer 30

A convolution is the simple application of a filter to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input, such as an image. Simpel filters could detect horizontal edges, and others might detect circles. The role of the ConvNet is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction.

Answer 31

RNN's i good at sequential memory and is good a NLP and. A normal feedforward neural network has: input, hidden layer and output layer. A RNN has a loop within the hidden layer, which allows information to flow from one step to the next - this is called 'hidden state'. However, they suffer from short term memory. If longer sequences, can LSTM's and GRU's be used, which takes longer term memory into account.

Answer 32

In simple terms, after each forward pass through a network, backpropagation performs a backward pass while adjusting the model’s parameters (weights and biases) to reduce the cost function. The level of adjustment is determined by the gradients of the cost function with respect to those parameters.

Answer 33

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. ReLU is the default activation function for most NN's as it is easier to train and often achieves better performance.

Answer 34

To group similar instances together into clusters. Its great for segmentation, search engines and image segmentation, semi-supervised learning etc.

Answer 35

The objective is to learn what "Normal" data looks like, and then use that to detect abnormal instances - like defective items.

Answer 36

The task of estimating the probability density function (PDF) of the random process that generated the dataset. It is commonly used for anomaly detection.

Answer 37

To define labels on non-labled data, is a fast way to cluster it. K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

Answer 38

Root node is the top node. The internal nodes has arrows pointing to them and from them. Leaf nodes has only arrows pointing to them.

Answer 39

A perceptron is a simple type of a neural network multiplying inputs with weights to calculate a weighted sum used to take a "step". Multi layer perception then have multiple layers (hidden layers). And instead of a step function a logistic function is used instead (because gradient descent is used instead, and there is no gradient on a straight line). Then backpropagation is used to update the weights, to minimize the error rate of each neuron, and thereby get a better output.

Answer 40

A type of NN very good for complex machine learning tasks involving thousand of features and milloins of data samples. Uses convolution layers (applies different filters in each conv. layer that convolves over the picture and detect patterns.) As the data goes through the conv. layers small patterns becomes bigger and bigger until the model can "recognize" the correct output. Also uses pooling layers to reduce the size of the image (max pooling etc)

Answer 41

The most popular is "Gini". This is calculated by = 1 - (The probability of "yes")^2 - (The probability of "No")^2. The weighted average for all the leaf nodes is then calculated.

Answer 42

A type of NN very good for complex machine learning tasks involving thousand of features and milloins of data samples. Uses convolution layers (convolve with perceptive fields and a given stride) and then applies several different filters to the layers to spot patterns in the pictures/data. As the data goes through the conv. layers small patterns becomes bigger and bigger until the model can "recognize" the correct output.

Answer 43

TensorFlow is a powerful open source software library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Its basic principle is simple: you first define in Python a graph of computations to perform (for example, the one in Figure 9-1), and then TensorFlow takes that graph and runs it efficiently using optimized C++ code.

Machine Learning Flashcards

(67 cards)