Machine Learning Flashcards

1
Q

Consider the following table:

Salary Years Experience Age
30000 0 22
22000 5 28
45000 3 50

If salary is the output, what is the value of:
1      y^(2)
2     x_1^(2)
3     x^(1)
4     x_2^(3)
A

1 22000
2 5
3 0, 22
4 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we use linear models so often in machine learning?

A
  • They are powerful
  • They are simple, and hence
    • Easy to interpret
    • Easy to implement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Regression?

What is OLS regression?

A

The process of estimating relationships.

Ordinary Least Squares. Provides the minimum-variance mean-unbiased estimation of a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the loss function for OLS regression?

A

The mean squared error:

L(y, y^) = (y-y^)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What would be the constrained Empirical Risk Minimiser for linear regression?

A

1/N sum_i^N (actual value - predicted value)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is gradient descent?

What is the function that is works on called?

What is required of the input function?

How does it work?

What is the equation for gradient descent?

A

A way to find the values which will minimise a function.

The objective function

Gradient descent converges on the global minimum if J Is convex.

The way it works is to guess an answer and then incrementally move closer to the right one but moving towards the negative gradient.

x- alpha nabla J(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the definition of a convex function?

A

A function which is always below it’s chord or above its tangent

It can be thought of as bowl-shaped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is needed for gradient descent to work well?

When is gradient descent stopped?

A

• The step has to be the right size.
o Too big, and the function will diverge, meaning it will never find the minimum
o Too small and it will take too long
• The function has to be stopped at some point
o Either because it get close enough (the steps are small)
o Or you repeat the function a set number of times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is logistic regression and how does it differ to linear regression?

A

Logistic regression is the process of using a linear model to perform classification by employing a sigmoid function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a sigmoid function and what is the equation for it?

A

This function maps the real numbers to the space 0-1.

σ(z) = sigmoid(z) = 1 / (1+e^-z)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the log loss function?

Where does it come from?

A

This is another loss function, used in logistic regression, which takes the form:
L(y,y^) = -( ylog(y^) + (1-y) log(1-y^) )

It comes from the likelihood function. The likelihood function finds the probabilities that best explain a set of data . Minimising the log loss is like maximising the likelihood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is feature engineering?

How do you know what to modify?

A

This is the process of optimising which parameters you feed into a machine learning function in order to get the best prediction out of it.
• Use intuition
• Use domain knowledge- what are you looking at?
• Play with the data, can you get a linear looking function out of it?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How would you use feature engineering to get a linear function to approximate a non-linear function?

A

You could create new features that are functions of the data.
For example, you might start with your features being x1, and x2, but then add log(x1) and x2**2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is polynomial regression?

What is the main issue with it?

A

This is where feature engineering is used to allow linear regression to approximate polynomial functions.

You might start with your data being:
x

But end with:

x^2, x^4, x^7

The main issue is that you do not know how the function will behave outside of your dataset, so they frequently make odd predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe one-hot encoding. Why might you use it?

A

One-hot encoding changes a categorical variable to a set of binary datapoints.

For example, rather than dog- cat - mouse
you might have three features: IsDog, IsCat, IsMouse

Why do this? Well then you can look for a linear function that has the one-hot features as inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can linear regression approximate a piecewise linear function?

A

It’s possible to look for two separate gradients, except that one gradient applies only past a certain point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe Stochastic/ Mini- Batch Gradient Descent

A

One issue with gradient descent is that finding the sum over all N in the dataset to find the derivative can take a very long time.
• The solution is to use a subset of the dataset (n) to approximate the gradient

When n = 1 you have stochastic gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is feature selection, and why do we care what it is?

What are the four main types of feature selection that we care about?

A

This is the process of selecting which features to use.
• We may have a large number of available features
• We want to reduce the amount of computing power we use
• We want to increase the predictive power of the model
• We want to be careful of including too many features and overfitting

Coefficient Comparison
Correlation Comparison
Best Subset Selection
Forward Subset Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain coefficient comparison and correlation comparison.

A

These are types of feature selection.

Coefficient comparison compares the magnitudes of the coefficients in a linear function. Only coefficients above a certain size are selected.
The data MUST be normalised first, so that coefficients are not penalised due to the size of their data (e.g. m vs cm.)

Correlation comparison is the same as coefficient comparison except that it is the correlations which are compared.
This only works for linear functions since only for linear functions is the correlation defined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does best subset selection work and what does it do?

What is its biggest downside?

How many combinations are there?

A

This is a type of feature selection.

The model is found for every single possible combination of the input features.
The model with the lowest risk is then selected.

The biggest downside is that this is very computationally intensive and takes an incredibly long time to complete.
for p features there are 2p combinations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does forward subset selection work and what does it do?

What is its biggest downside?

A

This is a type of feature selection.

This is a greedy algorithm which is used to find the best combination of features which should be used.

  1. Start off with some constant
  2. Consider every predictor in your model and compare the result of adding one to the function in turn.
  3. Select the one which has the lowest loss.
  4. Repeat.
  5. Finally, we compare every version of the model with a test set and select the one which minimises the risk.

The downside is that it is only an approximation of the best model, since it does not actually consider all possible permutations of the features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a meta-algorithm?

A

An algorithm used to optimise the machine learning algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the two main sampling methods which we have learnt about?
How do they work?
Why might you use one over the other?
What if there are multiple variables to sample over?

A

Random sampling and Stratified sampling.

Random sampling takes a random sample of the data, with the downside being that it may not create a representative dataset

Stratified sampling first splits the data into homogenous subgroups, and then takes a sample of those. This creates a much more representative dataset.

If these are multiple variables to sample over, then another column can be created that represents a combination of the variables. This new column can then be stratified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What were the three main methods of model evaluation that we were taught?

A

1 Finding the expected loss (the risk) on a test set
2 Tuning the hyperparameters with a validation set and then Finding the expected loss on a test set
3 k-fold cross validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a validation set used for?

What does the normal train/validate/test split look like?

A

It is used to tune the hyperparameters.

Anything from 60/20/20 to 98/1/1 (the latter is used only if the dataset is very big)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How does k-fold cross validation work?

What is the normal range of values for k?

What is leave-one-out k-fold cross validation?

A
  1. Divide the group of data into k subgroups (sometimes called folds).
  2. Train the model on all the data expect for one subgroup.
  3. Evaluate the model on the one subgroup.
  4. Repeat for every subgroup
    It is then possible to find the mean and standard deviation of all the subgroups.
    We can then repeat this whole process using a different set of hyperparameters.

**Once the optimal hyperparameters are found, the model can be retained on all of the training data. **

The number k varies, but is usually between 3-10.
If k = N (where N is the number of datapoints that you have, then you have leave-one-out k-fold cross validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The metrics which are use to evaluate the predictions differ in regression and classification.

What are all of the different metrics that we have learnt?
Can you explain them?

A

Regression:
Mean Absolute Error: This is simply the average of the distance of the prediction from the actual value.
Mean squared error: This is the average of the square of the difference between the predictions and the actual data.
R2 – Value: This measures how well the model compares to just predicting the mean for all predictions.

Classification:
Accuracy: The accuracy is the percentage of predictions that are correct.
Log Loss: The log loss is a loss function which is like maximising the likelihood function.

True Positive Rate (TPR) and True Negative Rate (TNR): The proportion of positives that were predicted correctly and the proportion of negatives that were predicted correctly.

ROC, ROCAUC, Brier Score, Calibration curves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The metrics which are use to evaluate the predictions differ in regression and classification.

What are all of the different metrics that we have learnt?
Can you explain them?

A

Regression:
Mean Absolute Error: This is simply the average of the distance of the prediction from the actual value.
Mean squared error: This is the average of the square of the difference between the predictions and the actual data.
R2 – Value: This measures how well the model compares to just predicting the mean for all predictions.

Classification:
Accuracy: The accuracy is the percentage of predictions that are correct.
Log Loss: The log loss is a loss function which is like maximising the likelihood function.

True Positive Rate (TPR) and True Negative Rate (TNR): The proportion of positives that were predicted correctly and the proportion of negatives that were predicted correctly.

ROC, ROCAUC, Brier Score, Calibration curves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Explain a confusion matrix and all of the possible outcomes.

Give examples of situations where you would want to optimise for a specific sector of the confusion matrix.

A

A confusion matrix shows predictions against the actual values. This shows you what type of errors are being made. E.g., a false positive is when the actual value is negative, but you predict positive.

Sometimes you really don’t want false positives.
E.g. You don’t want a spam filter to delete important emails.
Sometimes you really don’t want false negatives.
E.g. You don’t want a medical test to miss a cancer diagnosis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is an ROC curve, and what does it show?

A

To convert a prediction to a classification you need to define a cut-off point. We normally use c=0.5, so that any prediction above 0.5 is classified as positive, and any below is classified as negative.

It is possible to vary the cut-off point to achieve a better prediction.

An ROC-curve plots all of the different possible values of TPR & 1-TNR for varying c.

Depending on which type of errors you care more about avoiding, you can vary the c value to achieve different results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is ROC-AUC?

Give examples of what values correspond that what outcomes.

A

This is the area under the curve of a ROC graph. The area is a way to measure the predictive ability of the model.
• ROC AUC = 0.5 : Your model is no better than a random guess.
• ROC AUC = 1 : Your model is perfect.

31
Q

What is ROC-AUC?

Give examples of what values correspond that what outcomes.

A

This is the area under the curve of a ROC graph. The area is a way to measure the predictive ability of the model.
• ROC AUC = 0.5 : Your model is no better than a random guess.
• ROC AUC = 1 : Your model is perfect.

32
Q

What is the brier score and how is it calculated?

What is calibration, and what is a calibration curve. What is it used for?

A

The brief score is the exact same as the formula for the mean squared error, expect that it is used on classification instead of regression.

This can be used as a way to gauge how well ‘calibrated’ our probabilities are.

Calibration is how well a predicted probability maps to the actual probability. If we took 10 predictions that are P=0.1 then we would expect only one of them to be positive.

A calibration curve is a plot of the calibration. The probabilities will need to be binned.

The ideal is for the probabilities to map perfectly, so that a straight line is achieved.

33
Q

What is regularisation?
What are the two main types of regularisation?
What does it optimise for?
Why is it used?

A
  • Regularisation is trading off some approximation error for better estimation error.
  • This allows us to shrink the hypothesis space.
  • There are two main types of regularisation: Ivanov and Tikhonov

Ivanov: Here we decide that the complexity must be below some value (that we set, so it’s a hyperparameter).
Tikhonov: Here we add in a penalty for the complexity of a function, so a function can be more complex but it will increase its risk.

Regularisation optimises the complexity of a function with respect to its predictive capability.

34
Q
Explain Lasso and Ridge Regression.
How do they differ?
Explain the norms that are used.
Why are these types of regression used?
What MUST you do to your data in order to use Lasso or Ridge regression?
A

Ridge regression is normal regression that uses a type of Tikhonov regularisation where the penalty is the L2-norm of the prediction function.
The L2-norm is a measure of the size of a vector, and is found by summing the squares of the sizes of the features and taking the square root of the sum.

Lasso regression is normal regression that uses a type of Tikhonov regularisation where the penalty is the L1-norm of the prediction function.
The L1-norm is a measure of the size of a vector, and is found by summing the absolute values of the features.

Lasso regression produces sparse solutions (ones where many betas go to exactly 0).

  • Lasso and Ridge Regression penalise functions with large betas, and lasso penalises the number of betas also
  • This means we tend to smaller solutions
  • This makes the function simpler
  • But also less sensitive to new inputs

We must standardise the features (mean = 0, s.d. = 1), so that we do not penalise the size of an input.

35
Q

What does KNN stand for?
What is k?
How does it work?
What are its disadvantages?

A

KNN is K-Nearest Neighbours.

k is a hyperparameter that sets the number of neighbours which are checked to determine a new datapoint’s value.

It works by assigning a value to a datapoint based on other data near to that point.

Mathematically, we define N_k(x), which is the number of nearest neighbours to a datapoint x.
This says that the value should be the average of the values which are in the nearest neighbour set.

  • Lower k can lead to more complex decision boundaries, but can lead to overfitting
  • The problem is that all training data needs to be stored
  • It is difficult to find the k nearest neighbours of a new datapoint
  • As the dimensions increase, the notion of ‘distance’ makes s=less and less sense
  • It is also very sensitive to the scale of the data
36
Q

What does CART stand for?
How do they work?
What are the names of the nodes?
What are its pros and cons?

A

CART stands for Classification and Regression Trees.

Trees work by splitting up the feature space into regions.
If the datapoint is in the region, then it is set to the region value.

The starting point is the root node, the end points are leaf nodes, and all other points are branch nodes.

  • Trees can easily approximate non-linear functions
  • The predictive performance is not sensitive to the scale of the data
  • But outside the datarange, you just predict a constant
37
Q

Recursive binary splitting is one of the methods used to optimise regression trees. How does it work?
How is the objective function minimised, and how many splits will need to be checked?

How is the recursion stopped?

A

We could split up a feature space fairly easily with intuition, but it is more difficult to achieve with machine learning.

Recursive Binary Splitting is a greedy algorithm to find the splitting points for a binary tree. It finds:
• The best feature to split, j.
• The best place to split that feature, s.

  1. Define two regions, a left region, and a right region:
  2. The objective function is the sum of the MSE in each region:
  3. Minimise the objective function with respect to j and s.
    The function is recursive, because once it defines a region, it repeats the process for the new region it created.

The objective function can’t be differentiated
• So we try splits between datapoints
• For p dimensions we will need p(N-1) splits, where N is the number of datapoints

• If recursion goes on forever, then each point will end up as a leaf node
• So we so the recursion in a number of ways:
o Set a minimum for the number of datapoints in each leaf node
o Limit the depth of the tree
o Stop splitting when the improvement from splitting gets too small

38
Q

What is Multiclass Classification?

How does it work?

A
  • Normally we have a binary classification, either in class 1 or class 2
  • It is possible to have multiclass classification, where there would be any number of classes

The solution is to convert the multiclass classification to a set of binary classifiers, and the class we predict that it is in is just the class that is most likely.

If our classes were cat - dog – mouse, then we could have:
z_cat = probability that it is a cat / probability that it is not a cat
z_dog = probability that it is a dog / probability that it is not a dog
z_mouse = probability that it is a mouse / probability that it is not a mouse

39
Q

Explain Model Averaging.

A

This is the process of fitting many different models on many different datasets

There are B datasets and therefore B models

These predictions are then averaged at each point

The point of this is to reduce overfitting without affecting underfitting

Variance is reduced by a factor of 1/B (theoretically)

40
Q

What is bootstrapping, and why is it used?

A

Bootstrapping is a way to generate a load of extra datasets from one dataset.

All you do is take random samples with replacement from your dataset to create new datasets.

This is commonly used in conjunction with model averaging to reduce variance

41
Q

Explain Bagging.

A

Bagging is just a combination of model averaging and bootstrapping. This is a way to easily reduce variance, essentially for free.

  1. We can create a load of datasets from our one initial dataset
  2. We can create a prediction function for each dataset
  3. We can then find the average prediction at a given point
  4. This prediction will have less variance than the original dataset
  • However, he theoretical improvement of 1/B doesn’t actually apply because the datasets are not really independent
  • But there is still a load of improvement for free!
42
Q

Why is bagging commonly used to create bagged trees?

A

Tree models usually suffer from overfitting and high variance, so their output is sensitive to new inputs.

A Bagged tree is when bagging is applied to a tree to reduce the variance.

The only downside being that it makes the output of the tree harder to interpret.

43
Q

Explain the Random Forest model and how it works.

A

A random forest is a bagged tree (a tree that performs bagging to reduce variance) with one difference: we only split by a random subset of predictors at each step of the recursive binary splitting algorithm.

We might have:

x1, x2, x3, x4, x5, x6, x7, x8, x9, x10

But at one iteration we might only look for splits on:

x1,x2,x3

Why would you do this? Well it turns out that it reduces the variance even further
(The number of predictors that are randomly chosen at each iteration is a hyperparameter)

44
Q

Describe the concept of feature importance.

A

We can define how important each feature in a tree is. This is the amount of predictive improvement that we see by splitting up a parameter.

So when you split a feature, how much does the mean squared error decrease? Sum all of those decreases over all the splits of that feature.

You can find the importance of a feature over many trees, and find a more accurate value.

45
Q

What are basis functions?

A

A basis function is just some function of your input features.

For example, one of the basis functions we have used is in the gravity dataset. We defined a function of the input features
G = m1*m2 / r^2
We then added this basis function as a separate feature and used linear regression to fit it.

Basis functions are useful in that they can be applied to gradient boosting to improve a machine learning algorithm.

46
Q

How do adaptive basis functions work?

How are they found?

A

A basis function is a function of a feature or a set of features which is used as a new feature to a machine learning algorithm.

Adaptive basis functions are basis functions which are learned from the data.

There are usually denoted hm

We first need to define a function space in which we will consider for our basis functions.

We then need to find the basis functions which minimise the risk.

It is possible to find these functions using gradient descent in some situations. i.e neutral networks

The basis functions can then be added to the prediction function improve its predictive capacity.

47
Q

Very simply, what is gradient boosting and what is it used for?

A
  • Gradient boosting is a function used to fit adaptive basis function models (as in, to find the hm).
  • The point of gradient boosting is to improve a model by repeatedly adding small improvements to it.
48
Q

Explain how Forward Stagewise Additive Modelling (FSAM) works.

A

This is a greedy algorithm that is used to fit adaptive basis function models. Here’s how it works:

  1. Start of by predicting a constant everywhere.
  2. Find the function which minimises the gradient of your loss function, by adding an adaptive basis function.

This can be possible with functional gradient descent.
For ERM that is:
argmin SUM L(y, f_(m-1) + b_m h_m )

  1. Repeat this M times (This is a hyperparameter)
  2. Return Fm
49
Q

What is L2 boosting?

A

This is Forward Stagewise Additive Modelling where the loss function that we use is the mean squared error.

50
Q

Explain functional gradient descent.

A

This is what we use to find the best hm to add in gradient boosting.

This is like gradient descent, but instead of finding the best parameters for a specific function, you are finding the best function.

So we find the gradient of the loss from every prediction that our current function makes.

These gradients tell you which direction to nudge your predictions at each point.

What we do is find a hm that moves in the direction of that negative gradient.

We can then add that to our previous model.

That beta finds how much we nudge of function in the right direction. Smaller is usually better.

51
Q

Why is the empirical risk minimiser not good enough to guarantee finding the best function?

A

Because it will not necessarily generalise well, i.e. it could overfit.

The space of functions which were considered was also too large, i.e. it needs to be constrained.

52
Q

Why do we evaluate a model on a test set?

A

Because it helps to show how well the model will generalise.

It prevents you overfitting it your data.

53
Q

Why is there usually a trade-off between bias and variance?

Can you give an example of a function with high bias and one with high variance?

A

Bias can be reduced by expanding the set of considered functions, but in doing so, it becomes harder to estimate the right function within that (now larger) function space.

If you fit non-linear data with a linear function, you’ll get high bias.

But if you fit a simple dataset with extremely deep tree, you will get high variance.

54
Q

What can you do to your features to make gradient descent faster?
How does this work?

A

You can re-scale your features.

This works because uniformly scaled features are equally as sensitive to changes in gradient.
Whereas features that have massively different scales will mean that the descent will have to be very slow so as not to change one feature too much.

55
Q

Does feature engineering effect your hypothesis space?

A

Yes!

Different features means that your prediction function can change as well.

56
Q

How does one-hot-encoding differ from multiclass classification?

A

One hot encoding just allows you to convert a multiclass classifier into a set of binary classifiers.

For example, if you had mouse-cat-dog, you cant feed the string ‘mouse’ into a ML algorithm, so that’s why we instead make the following binary classifiers:

IsCat, IsDog, IsMouse

Multiclass classification is classifying into multiple classes, not representing one class as binary classifiers.

57
Q

How do you perform stratified sampling on a continuous variable?

A

You have to ‘bucket’ the variable so that it becomes a categorical variable. That’s the only way

58
Q

What is a precision recall curve?

A

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced.

Precision is the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class.

Recall is calculated as the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity.

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve.

PR Curves are only concerned with the correct prediction of the minority class, class 1, and hence are useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

59
Q

What are the axes on the calibration curve?

A

Proportion of Positive Observations

vs

Prediction Values

60
Q

Name some variables that can be used to measure the complexity of a function.

A
In Regression:
- The number of features
- The sizes of the coefficient vectors
     - L1 Norm
     - L2 Norm 
In Trees:
- The number of observations in a leaf node
- The depth of the tree
In KNN:
- The value of k
61
Q

If I limit the depth of a tree, what kind of regularisation is that? Why?

A

Ivanov, because you are saying the complexity must be below some value.

62
Q

We know what the L1 and L2 norms are, what is the Lq norm?

A

This is:

(beta1^q +beta2^q…)^ 1/q

63
Q

Explain Excess Risk Decomposition.

A

The excess risk is the difference in risk between your function and the bayes function.
This splits the function space into estimation and approximation error.

64
Q

What is a p-value?

A

A p-value is a measure of the probability that an observed difference could have occurred just by random chance.

The lower the p-value, the greater the statistical significance of the observed difference.

P-values are calculated based on the assumed or known probability distribution of the specific statistic being tested.

P-values are calculated from the deviation between the observed value and a chosen reference value, given the probability distribution of the statistic, with a greater difference between the two values corresponding to a lower p-value.

65
Q

In machine learning models, what are parameters, and how do they differ from hyperparameters?

A

A parameter is a variable which can be learnt by a machine learning model and which dtermines the predictions.

A hyperparameter is a variable which is set before the model runs and which determines the way in which the predictions are found.

66
Q

What do you do if two of your variables have a significant collinearity? Why?
Can you give an example of two features which may have a collinearity?

A

You may either use feature selection to chose only one of them to include

You may use feature engineering to combine the two features into one

You do this because collinearity can obscure the predictive capacity which the features can provide.

An example might be salary and disposable income.

67
Q

Describe Excess Risk Decomposition.

A

Excess risk is the risk of your model in comparison with the risk that the bayes model would have. This, in combination with constraining the function space, brings about the concepts of estimation error and approximation error.

68
Q

What is CI/ CD?

A

Continuous integration and either continuous delivery or continuous deployment.

Code is compiled and delivered right after it is written.

The aim is to increase early defect discovery, increase productivity, and provide faster release cycles.

69
Q

What would learning curves look like for underfitting and overfitting?

A

As the training set size increases:

Underfitting: The training error will increase logarithmically, the test error will remain the same.
Overfitting: The test error will reduce slightly and the training error will increase slightly.

70
Q

Name all of the imports we use in SKLEARN.

A

Common possible imports:

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve
from sklearn import tree

#Plotting
import matplotlib.pyplot as plt
#Data Processing
import pandas as pd
import numpy as np
71
Q

What code would be used to read from a csv or excel file to create a DataFrame?
What about splitting one dataset into a test & train split?

A

EXCEL:
df_train = pd.read_excel(path,Excel_tab_name)
df_test = pd.read_excel(path,Excel_tab_name)

CSV:
df_test = pd.read_csv(path)
df_test = pd.read_csv(different_path)

TEST/TRAIN SPLIT:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=123)

72
Q

What code is used to convert Dataframes to inputs and then run an ML model?

A
X_train = df_train[features_list]   
y_train = df_train[target_name]
model   = ModelType()
model.fit(X_train, y_train)
metric(y_test, model.predict(X_test))
73
Q

What code is used to test a model?

A
X_test = df_test[feature_list]  
y_test = df_test[target_name] 
df_test['NewDataPred'] = model.predict(X_test)

Metric( df_test[‘ActualValues’] , df_test[‘NewDataPred’] )

74
Q

What can you do to reduce overfitting?

A

Contract your hypothesis space.

  • Add in regularisation
  • Feature selection
  • Bagging
  • Reduce the number of features
75
Q

What can you do to reduce underfitting?

A

Expand your hypothesis space.

  • Relax regularisation
  • More features (feature engineering)