Machine Learning Flashcards

1
Q

What are the different types of machine learning?

A

Supervised learning (Classification and regression)
Unsupervised learning (Clustering and dimensionality reduction)
Reinforcement learning
(And Semisupervised learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is machine learning?

A

The science of programming computers to learn from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the benefits from machine learning compared to “old school” programming?

A

Take fx a spam filter, with old school programming you have to write code with all the “spam”-words to make the program find these. With ML the model finds these patterns by itself, and can when presented to new data automatically find new patterns (Spam-mails will try to avoid the spam-filters to get more clicks/views)

ML is also very good to find patterns in large data sets with many different features (high complexity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between online and batch learning?

A

Online is when the model can learn “on-the-go” by being continuously fed with new data.
Batch learning is the opposite, and the model will then only learn when presented to a new batch of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between a classifier and a regressor?

A

A classifier classifies whether a sample belongs to a specific class or not, where regressors predict a target numeric value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is unsupervised learning important to read up on?

A

hmmm, look at the slides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is semi supervised learning?

A

When only a small batch of the data is labeled. A ml model then clusters the data, and if minimum one of the instances in the cluster is labeled the model can label the rest (online photo album)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is reinforcement learning?

A

Is a learning system that is rewarded/penalized when actions are taken based on the given environemnt it is put into. It thereby learn the best possible strategy called a policy for all the different situations it can be presented to.

Example:
AlphaGO
Robots learning to walk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is learning rate?

A

The degree of adaption to new data. High learning rate = high degree of adaption.
A high learning rate also means that the model will quickly forget old data (more sensitive to noisy data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is high generalization more important that good performance measures on the training data?

A

Good performance measures on training data gives a good indication on how well the model is performing, but in the end what is important is that the model can perform well on new instances (high level of generalization)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is instance based learning when talking about approaches to generalization?

A

When the model is using “measure of similarity”. So if a known spam mail and a new email has a lot of words in common the new mail will get predicted as a spam mail (kNN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is model based learning when talking about approaches to generalization?

A

When generalizing from a model (clustered data) build on a set of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between a fitness/utility function and a cost function?

A

Utility/fitness is measuring how well the model is performing, and cost function is measuring how bad the model is performing
A cost function measures the distance between the model’s prediction and the training example. The objective is then to minimize this distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is sampling bias?

A

It is when the data is not representative for the whole population.

Do we have sampling bias?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is overfitting? And when does it occur?

A

When the model is overgeneralizing to the training data, and then isn’t capable of generalizing to new data (High variance)

It happens when the model is too complex, and there is too much noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is regularization?

A

When constraining the model, to make it simpler (and thereby reduce the risk of overfitting). Is done by using simpler model (fewer degrees of freedom, and by constraining the hyperparameters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a hyperparameter?

A

It is a parameter of the learning algorithm, that is constant under the training, and can help creating a better model and also constrain the model from not overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is low training error and high generalization error?

A

Low training error = the model performs very well on the training data
High generalization error = Model is performing well on training but poorly on test (overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why should the test set only be used in the end?

A

Because if you begin to test multiple times on the test set you will adapt the parameters to this, and when presented to new data the model will not perform well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between model parameteres and hyperparameters?

A

In summary, model parameters are estimated from data automatically and model hyperparameters are set manually and are used in processes to help estimate model parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is RMSE and MAE, and what are the differences?

A

It is performance measures. RMSE measures the standard deviation of the errors the system makes in its predictions.
RMSE is more sensitive when having multiple outliers, and therefore MAE is sometime preferred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is an ensemble and ensemble learning?

A

An aggregate of a group of predictors (Such as classifiers or regressors). A Random forest model is typically an ensemble of decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is stochastic gradient descent?

A

Instead of carefully calculating the best next step using the full dataset, is stochastic gradient descent, dividing the data into mini-batches, to calculate the best step based on that. This speeds up the process significantly. Imagine gradient descent as a man who carefully calculates each step, but it takes a long ting, and stochastic gradient descend, as a semi-drunk man who stumbles a bit more, but it is way faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a perceptron (Or neuron)?

A
The Perceptron is one of the simplest ANN architectures.
It is based on a slightly different artificial neuron called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs (z = w1 x1 + w2 x2 + ⋯ + wn xn
= wT · x), then applies a step function to that sum and outputs the result: hw(x) = step (z) = step (wT · x). 
A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are weights in neural networks?

A

Weights represent the strength of a connection betwen two units. If the weight from node 1 to node 2 has greater magnitude, it means that neuron 1 has greater influence over neuron 2. A weight brings down the importance of the input value. As a vector, the more it influences the cost function, the more weight will it be gvien.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a cost function?

A

When you define a cost function, is it a way of telling the computer if it was right or wrong. You add up the squares of the differences between each of the bad output activations, and the value that you want them to have. It is small if the model confidently classifies correctly. The average cost is a measure of how good a neural network performs. Gradient descent can also be used to find the lowest cost, by finding the steepest slope, and take a step down, and repeat until the minimum has been found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is hard voting?

A

In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins.

28
Q

What is soft voting

A

In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier’s importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.

29
Q

What is bootstrapping?

A

bootstrapping refers to a resample method that consists of repeatedly drawn, with replacement, samples from data to form other smaller datasets, called bootstrapping samples.

For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each bootstrap sample containing n observations, the following are valid samples:
- n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…

30
Q

What is bagging and pasting?

A

Bagging means bootstrap+aggregating and it is a ensemble method in which we first bootstrap our data and for each bootstrap sample we train one model. After that, we aggregate them with equal weights. When it’s not used replacement, the method is called pasting.

31
Q

What is boosting?

A

It is an ensemble method that can combine several weak learners into strong learners. The purpose is to train predictors sequentially, each trying to correct its predecessor.

32
Q

What is gradient boosting?

A

As normal boosting, is it, training predictors, sequentially, each trying to correct its predecessor. However, instead of tweaking the instance weights at every iteration does, this method tries to fit the new predictor to the residual (Resterende) errors, made by the previous predictor.

33
Q

What is stacking?

A

Given multiple machine learning models that are skillful on a problem, but in different ways, how do you choose which model to use (trust)?

The approach to this question is to use another machine learning model that learns when to use or trust each model in the ensemble. This is called a “blender”

34
Q

What is Bias (In variance and bias)?

A

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

35
Q

What is variance?

A

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

36
Q

Why is Bias Variance Tradeoff?

A

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

37
Q

What is Regularization?

A

Regularization is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero (often the weights). In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting, by simplifying the model.

38
Q

What is a confusion matrix?

A

A matrix that shows the numbers of TP, FP, TN, and FN that can be used to calculate precision, recall and f1

39
Q

What is precision a measure for, and how is it calculated?

A

It is the accuracy of the positive predictions, and is calculated: TP / TP + FP

40
Q

What is recall a measure for, and how is it calculated?

A

Is the ratio of positive instances (True positive rate) the model predicts, and is calculated: TP / TP + FN

41
Q

What is the f1-score a measure for?

A

The F1 score is a measure for both recall and precision, which takes the harmonic mean, meaning it gives more weight to low values (skewed date). There both recall and precision has to be high to get a good F1 score

42
Q

How does a classifier determine whether a sample belongs to one class or the other?

A

It uses a decision function to compute a score, and based on a given threshold it decides which class it thinks it belong to. Random forest doesn’t use a dec. funct. but a predict_proba() instead.

43
Q

What is the different strategies for training multiclass classifiers? (Hint: OvA and OvO)

A

One strategy is to (with digits) to train 10 classifiers, the one with the highest probability will be picked (One versus all OvA)
One vs One is when all combinations of classifiers is created and tested against each other. The one with most “won” duels “wins”
(OvA is often preferred)

44
Q

How is the gradient calculated when using gradient descent?

A

By taking the gradient (partial derivative) of the cost function with respect to all the parameters. This is done to see how much the cost function changes when adjusting the parameters just a little bit.

45
Q

How can SGD hyperparameters be tweaked so that it to a higher degree gets close to the global minimum?

A

By reducing the learning rate as it gets closer to the global minimum

46
Q

What are learning curves used for?

A

To see whether the model is over or underfitting the data. Is based on the performance from both the training and validation set as a function of the training set size (computes on different batch sizes from cross-val)

47
Q

When does high bias occur?

A

When the model is based on wrong assumptions, like the data is linear when it in fact is polynomial
Low complexity = high bias

48
Q

When does high variance occur?

A

When the model is very sensitive to small variations in data, this is often seen in models with many degrees of freedom. This will in most cases lead to overfitting
High complexity = high bias

49
Q

What is irreducable error?

A

When there is a lot of noise in the data. Handled by cleaning up the data set.

50
Q

What can softmax regression be used for?

A

To train and combine multiple binary classes in one classifier

51
Q

What are some of the negative sides about decision trees?

A

They are very prone to overfitting and it is therefore important to regularize them (max_depth, max_leaf_nodes)

Futhermore they are also prone to orientation of data as the decision trees prefer to split horizontal or vertical. This can be handled by conducting PCA that finds the optimal orientation for the data

They are also very sensitive to new data. As this can easily change the way the model splits the features/thresholds

52
Q

What are the benefits from scikit learns random forrest classifier?

A

It combines 500 trees, and also uses bagging/pasting. Furhtermore to not overfit it randomly picks a subset of the features to train each tree on, so they diversify as much as possible (higher bias, lower variance)

53
Q

What is early stopping?

A

It’s a way to avoid overfitting. An is conducted by programming the algorithm to stop iterating when the minimum error rate occurs.

54
Q

What is Convolutional Neural Networks (CNNs)?

A

A convolution is the simple application of a filter to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input, such as an image. Simpel filters could detect horizontal edges, and others might detect circles.

The role of the ConvNet is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction.

55
Q

What is Recurrent Neural Networks (CNN)?

A

RNN’s i good at sequential memory and is good a NLP and. A normal feedforward neural network has: input, hidden layer and output layer. A RNN has a loop within the hidden layer, which allows information to flow from one step to the next - this is called ‘hidden state’. However, they suffer from short term memory. If longer sequences, can LSTM’s and GRU’s be used, which takes longer term memory into account.

56
Q

What is backpropagation on a feedforward neural network?

A

In simple terms, after each forward pass through a network, backpropagation performs a backward pass while adjusting the model’s parameters (weights and biases) to reduce the cost function.
The level of adjustment is determined by the gradients of the cost function with respect to those parameters.

57
Q

What is a Neural Network Activation Function?

A

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction.

ReLU is the default activation function for most NN’s as it is easier to train and often achieves better performance.

58
Q

What is the goal of clustering?

A

To group similar instances together into clusters. Its great for segmentation, search engines and image segmentation, semi-supervised learning etc.

59
Q

What is anomaly detection?

A

The objective is to learn what “Normal” data looks like, and then use that to detect abnormal instances - like defective items.

60
Q

What is density estimation?

A

The task of estimating the probability density function (PDF) of the random process that generated the dataset. It is commonly used for anomaly detection.

61
Q

What is K-means?

A

To define labels on non-labled data, is a fast way to cluster it.

K-Means is one of the most popular “clustering” algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

62
Q

What is “Root Node”, “internal nodes” and “Leaf nodes”

A

Root node is the top node. The internal nodes has arrows pointing to them and from them. Leaf nodes has only arrows pointing to them.

63
Q

What is a perceptron and a multi layer perceptron?

A

A perceptron is a simple type of a neural network multiplying inputs with weights to calculate a weighted sum used to take a “step”.
Multi layer perception then have multiple layers (hidden layers). And instead of a step function a logistic function is used instead (because gradient descent is used instead, and there is no gradient on a straight line). Then backpropagation is used to update the weights, to minimize the error rate of each neuron, and thereby get a better output.

64
Q

What is CNN?

A

A type of NN very good for complex machine learning tasks involving thousand of features and milloins of data samples.
Uses convolution layers (applies different filters in each conv. layer that convolves over the picture and detect patterns.)
As the data goes through the conv. layers small patterns becomes bigger and bigger until the model can “recognize” the correct output.

Also uses pooling layers to reduce the size of the image (max pooling etc)

65
Q

How to measure impurity in decision trees?

A

The most popular is “Gini”. This is calculated by = 1 - (The probability of “yes”)^2 - (The probability of “No”)^2.
The weighted average for all the leaf nodes is then calculated.

66
Q

What is CNN?

A

A type of NN very good for complex machine learning tasks involving thousand of features and milloins of data samples.
Uses convolution layers (convolve with perceptive fields and a given stride) and then applies several different filters to the layers to spot patterns in the pictures/data.
As the data goes through the conv. layers small patterns becomes bigger and bigger until the model can “recognize” the correct output.

67
Q

What is Tensorflow?

A

TensorFlow is a powerful open source software library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Its basic principle is simple: you first define in Python a graph of computations to perform (for example, the one in Figure 9-1), and then TensorFlow takes that graph and runs it efficiently using optimized C++ code.