Machine Learning Flashcards

Question

What is a validation set used for? What does the normal train/validate/test split look like?

Answer 1

It is used to tune the hyperparameters. Anything from 60/20/20 to 98/1/1 (the latter is used only if the dataset is very big)

Answer 2

1. Divide the group of data into k subgroups (sometimes called folds). 2. Train the model on all the data expect for one subgroup. 3. Evaluate the model on the one subgroup. 4. Repeat for every subgroup It is then possible to find the mean and standard deviation of all the subgroups. We can then repeat this whole process using a different set of hyperparameters. **Once the optimal hyperparameters are found, the model can be retained on all of the training data. ** The number k varies, but is usually between 3-10. If k = N (where N is the number of datapoints that you have, then you have leave-one-out k-fold cross validation.

Answer 3

Regression: Mean Absolute Error: This is simply the average of the distance of the prediction from the actual value. Mean squared error: This is the average of the square of the difference between the predictions and the actual data. R2 – Value: This measures how well the model compares to just predicting the mean for all predictions. Classification: Accuracy: The accuracy is the percentage of predictions that are correct. Log Loss: The log loss is a loss function which is like maximising the likelihood function. True Positive Rate (TPR) and True Negative Rate (TNR): The proportion of positives that were predicted correctly and the proportion of negatives that were predicted correctly. ROC, ROCAUC, Brier Score, Calibration curves

Answer 4

Regression: Mean Absolute Error: This is simply the average of the distance of the prediction from the actual value. Mean squared error: This is the average of the square of the difference between the predictions and the actual data. R2 – Value: This measures how well the model compares to just predicting the mean for all predictions. Classification: Accuracy: The accuracy is the percentage of predictions that are correct. Log Loss: The log loss is a loss function which is like maximising the likelihood function. True Positive Rate (TPR) and True Negative Rate (TNR): The proportion of positives that were predicted correctly and the proportion of negatives that were predicted correctly. ROC, ROCAUC, Brier Score, Calibration curves

Answer 5

A confusion matrix shows predictions against the actual values. This shows you what type of errors are being made. E.g., a false positive is when the actual value is negative, but you predict positive. Sometimes you really don’t want false positives. E.g. You don’t want a spam filter to delete important emails. Sometimes you really don’t want false negatives. E.g. You don’t want a medical test to miss a cancer diagnosis.

Answer 6

To convert a prediction to a classification you need to define a cut-off point. We normally use c=0.5, so that any prediction above 0.5 is classified as positive, and any below is classified as negative. It is possible to vary the cut-off point to achieve a better prediction. An ROC-curve plots all of the different possible values of TPR & 1-TNR for varying c. Depending on which type of errors you care more about avoiding, you can vary the c value to achieve different results.

Answer 7

This is the area under the curve of a ROC graph. The area is a way to measure the predictive ability of the model. • ROC AUC = 0.5 : Your model is no better than a random guess. • ROC AUC = 1 : Your model is perfect.

Answer 8

This is the area under the curve of a ROC graph. The area is a way to measure the predictive ability of the model. • ROC AUC = 0.5 : Your model is no better than a random guess. • ROC AUC = 1 : Your model is perfect.

Answer 9

The brief score is the exact same as the formula for the mean squared error, expect that it is used on classification instead of regression. This can be used as a way to gauge how well ‘calibrated’ our probabilities are. Calibration is how well a predicted probability maps to the actual probability. If we took 10 predictions that are P=0.1 then we would expect only one of them to be positive. A calibration curve is a plot of the calibration. The probabilities will need to be binned. The ideal is for the probabilities to map perfectly, so that a straight line is achieved.

Answer 10

* Regularisation is trading off some approximation error for better estimation error. * This allows us to shrink the hypothesis space. * There are two main types of regularisation: Ivanov and Tikhonov Ivanov: Here we decide that the complexity must be below some value (that we set, so it’s a hyperparameter). Tikhonov: Here we add in a penalty for the complexity of a function, so a function can be more complex but it will increase its risk. Regularisation optimises the complexity of a function with respect to its predictive capability.

Answer 11

Ridge regression is normal regression that uses a type of Tikhonov regularisation where the penalty is the L2-norm of the prediction function. The L2-norm is a measure of the size of a vector, and is found by summing the squares of the sizes of the features and taking the square root of the sum. Lasso regression is normal regression that uses a type of Tikhonov regularisation where the penalty is the L1-norm of the prediction function. The L1-norm is a measure of the size of a vector, and is found by summing the absolute values of the features. Lasso regression produces sparse solutions (ones where many betas go to exactly 0). * Lasso and Ridge Regression penalise functions with large betas, and lasso penalises the number of betas also * This means we tend to smaller solutions * This makes the function simpler * But also less sensitive to new inputs We must standardise the features (mean = 0, s.d. = 1), so that we do not penalise the size of an input.

Answer 12

KNN is K-Nearest Neighbours. k is a hyperparameter that sets the number of neighbours which are checked to determine a new datapoint's value. It works by assigning a value to a datapoint based on other data near to that point. Mathematically, we define N_k(x), which is the number of nearest neighbours to a datapoint x. This says that the value should be the average of the values which are in the nearest neighbour set. * Lower k can lead to more complex decision boundaries, but can lead to overfitting * The problem is that all training data needs to be stored * It is difficult to find the k nearest neighbours of a new datapoint * As the dimensions increase, the notion of ‘distance’ makes s=less and less sense * It is also very sensitive to the scale of the data

Answer 13

CART stands for Classification and Regression Trees. Trees work by splitting up the feature space into regions. If the datapoint is in the region, then it is set to the region value. The starting point is the root node, the end points are leaf nodes, and all other points are branch nodes. * Trees can easily approximate non-linear functions * The predictive performance is not sensitive to the scale of the data * But outside the datarange, you just predict a constant

Answer 14

We could split up a feature space fairly easily with intuition, but it is more difficult to achieve with machine learning. Recursive Binary Splitting is a greedy algorithm to find the splitting points for a binary tree. It finds: • The best feature to split, j. • The best place to split that feature, s. 1. Define two regions, a left region, and a right region: 2. The objective function is the sum of the MSE in each region: 3. Minimise the objective function with respect to j and s. The function is recursive, because once it defines a region, it repeats the process for the new region it created. The objective function can’t be differentiated • So we try splits between datapoints • For p dimensions we will need p(N-1) splits, where N is the number of datapoints • If recursion goes on forever, then each point will end up as a leaf node • So we so the recursion in a number of ways: o Set a minimum for the number of datapoints in each leaf node o Limit the depth of the tree o Stop splitting when the improvement from splitting gets too small

Answer 15

* Normally we have a binary classification, either in class 1 or class 2 * It is possible to have multiclass classification, where there would be any number of classes The solution is to convert the multiclass classification to a set of binary classifiers, and the class we predict that it is in is just the class that is most likely. If our classes were cat - dog – mouse, then we could have: z_cat = probability that it is a cat / probability that it is not a cat z_dog = probability that it is a dog / probability that it is not a dog z_mouse = probability that it is a mouse / probability that it is not a mouse

Answer 16

This is the process of fitting many different models on many different datasets There are B datasets and therefore B models These predictions are then averaged at each point The point of this is to reduce overfitting without affecting underfitting Variance is reduced by a factor of 1/B (theoretically)

Answer 17

Bootstrapping is a way to generate a load of extra datasets from one dataset. All you do is take random samples with replacement from your dataset to create new datasets. This is commonly used in conjunction with model averaging to reduce variance

Answer 18

Bagging is just a combination of model averaging and bootstrapping. This is a way to easily reduce variance, essentially for free. 1. We can create a load of datasets from our one initial dataset 2. We can create a prediction function for each dataset 3. We can then find the average prediction at a given point 4. This prediction will have less variance than the original dataset * However, he theoretical improvement of 1/B doesn’t actually apply because the datasets are not really independent * But there is still a load of improvement for free!

Answer 19

Tree models usually suffer from overfitting and high variance, so their output is sensitive to new inputs. A Bagged tree is when bagging is applied to a tree to reduce the variance. The only downside being that it makes the output of the tree harder to interpret.

Answer 20

A random forest is a bagged tree (a tree that performs bagging to reduce variance) with one difference: we only split by a random subset of predictors at each step of the recursive binary splitting algorithm. We might have: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 But at one iteration we might only look for splits on: x1,x2,x3 Why would you do this? Well it turns out that it reduces the variance even further (The number of predictors that are randomly chosen at each iteration is a hyperparameter)

Answer 21

We can define how important each feature in a tree is. This is the amount of predictive improvement that we see by splitting up a parameter. So when you split a feature, how much does the mean squared error decrease? Sum all of those decreases over all the splits of that feature. You can find the importance of a feature over many trees, and find a more accurate value.

Answer 22

A basis function is just some function of your input features. For example, one of the basis functions we have used is in the gravity dataset. We defined a function of the input features G = m1*m2 / r^2 We then added this basis function as a separate feature and used linear regression to fit it. Basis functions are useful in that they can be applied to gradient boosting to improve a machine learning algorithm.

Answer 23

A basis function is a function of a feature or a set of features which is used as a new feature to a machine learning algorithm. Adaptive basis functions are basis functions which are learned from the data. There are usually denoted hm We first need to define a function space in which we will consider for our basis functions. We then need to find the basis functions which minimise the risk. It is possible to find these functions using gradient descent in some situations. i.e neutral networks The basis functions can then be added to the prediction function improve its predictive capacity.

Answer 24

* Gradient boosting is a function used to fit adaptive basis function models (as in, to find the hm). * The point of gradient boosting is to improve a model by repeatedly adding small improvements to it.

Answer 25

This is a greedy algorithm that is used to fit adaptive basis function models. Here’s how it works: 1. Start of by predicting a constant everywhere. 2. Find the function which minimises the gradient of your loss function, by adding an adaptive basis function. This can be possible with functional gradient descent. For ERM that is: argmin SUM L(y, f_(m-1) + b_m h_m ) 3. Repeat this M times (This is a hyperparameter) 4. Return Fm

Answer 26

This is Forward Stagewise Additive Modelling where the loss function that we use is the mean squared error.

Answer 27

This is what we use to find the best hm to add in gradient boosting. This is like gradient descent, but instead of finding the best parameters for a specific function, you are finding the best function. So we find the gradient of the loss from every prediction that our current function makes. These gradients tell you which direction to nudge your predictions at each point. What we do is find a hm that moves in the direction of that negative gradient. We can then add that to our previous model. That beta finds how much we nudge of function in the right direction. Smaller is usually better.

Answer 28

Because it will not necessarily generalise well, i.e. it could overfit. The space of functions which were considered was also too large, i.e. it needs to be constrained.

Answer 29

Because it helps to show how well the model will generalise. It prevents you overfitting it your data.

Answer 30

Bias can be reduced by expanding the set of considered functions, but in doing so, it becomes harder to estimate the right function within that (now larger) function space. If you fit non-linear data with a linear function, you'll get high bias. But if you fit a simple dataset with extremely deep tree, you will get high variance.

Answer 31

You can re-scale your features. This works because uniformly scaled features are equally as sensitive to changes in gradient. Whereas features that have massively different scales will mean that the descent will have to be very slow so as not to change one feature too much.

Answer 32

Yes! Different features means that your prediction function can change as well.

Answer 33

One hot encoding just allows you to convert a multiclass classifier into a set of binary classifiers. For example, if you had mouse-cat-dog, you cant feed the string 'mouse' into a ML algorithm, so that's why we instead make the following binary classifiers: IsCat, IsDog, IsMouse Multiclass classification is classifying into multiple classes, not representing one class as binary classifiers.

Answer 34

You have to 'bucket' the variable so that it becomes a categorical variable. That's the only way

Answer 35

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. Precision is the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Recall is calculated as the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity. A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve. PR Curves are only concerned with the correct prediction of the minority class, class 1, and hence are useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

Answer 36

Proportion of Positive Observations vs Prediction Values

Answer 37

``` In Regression: - The number of features - The sizes of the coefficient vectors - L1 Norm - L2 Norm In Trees: - The number of observations in a leaf node - The depth of the tree In KNN: - The value of k ```

Answer 38

Ivanov, because you are saying the complexity must be below some value.

Answer 39

This is: (beta1^q +beta2^q...)^ 1/q

Answer 40

The excess risk is the difference in risk between your function and the bayes function. This splits the function space into estimation and approximation error.

Answer 41

A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-values are calculated based on the assumed or known probability distribution of the specific statistic being tested. P-values are calculated from the deviation between the observed value and a chosen reference value, given the probability distribution of the statistic, with a greater difference between the two values corresponding to a lower p-value.

Answer 42

A parameter is a variable which can be learnt by a machine learning model and which dtermines the predictions. A hyperparameter is a variable which is set before the model runs and which determines the way in which the predictions are found.

Answer 43

You may either use feature selection to chose only one of them to include You may use feature engineering to combine the two features into one You do this because collinearity can obscure the predictive capacity which the features can provide. An example might be salary and disposable income.

Answer 44

Excess risk is the risk of your model in comparison with the risk that the bayes model would have. This, in combination with constraining the function space, brings about the concepts of estimation error and approximation error.

Answer 45

Continuous integration and either continuous delivery or continuous deployment. Code is compiled and delivered right after it is written. The aim is to increase early defect discovery, increase productivity, and provide faster release cycles.

Answer 46

As the training set size increases: Underfitting: The training error will increase logarithmically, the test error will remain the same. Overfitting: The test error will reduce slightly and the training error will increase slightly.

Answer 47

Common possible imports: from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score, roc_curve from sklearn.model_selection import train_test_split from sklearn.calibration import calibration_curve from sklearn import tree ``` #Plotting import matplotlib.pyplot as plt ``` ``` #Data Processing import pandas as pd import numpy as np ```

Answer 48

EXCEL: df_train = pd.read_excel(path,Excel_tab_name) df_test = pd.read_excel(path,Excel_tab_name) CSV: df_test = pd.read_csv(path) df_test = pd.read_csv(different_path) TEST/TRAIN SPLIT: df_train, df_test = train_test_split(df, test_size=0.2, random_state=123)

Answer 49

``` X_train = df_train[features_list] y_train = df_train[target_name] model = ModelType() model.fit(X_train, y_train) metric(y_test, model.predict(X_test)) ```

Answer 50

``` X_test = df_test[feature_list] y_test = df_test[target_name] df_test['NewDataPred'] = model.predict(X_test) ``` Metric( df_test['ActualValues'] , df_test['NewDataPred'] )

Answer 51

Contract your hypothesis space. - Add in regularisation - Feature selection - Bagging - Reduce the number of features

Answer 52

Expand your hypothesis space. - Relax regularisation - More features (feature engineering)

Machine Learning Flashcards

(76 cards)