Machine Learning Flashcards
Consider the following table:
Salary Years Experience Age
30000 0 22
22000 5 28
45000 3 50
If salary is the output, what is the value of: 1 y^(2) 2 x_1^(2) 3 x^(1) 4 x_2^(3)
1 22000
2 5
3 0, 22
4 50
Why do we use linear models so often in machine learning?
- They are powerful
- They are simple, and hence
- Easy to interpret
- Easy to implement
What is Regression?
What is OLS regression?
The process of estimating relationships.
Ordinary Least Squares. Provides the minimum-variance mean-unbiased estimation of a dataset.
What is the loss function for OLS regression?
The mean squared error:
L(y, y^) = (y-y^)^2
What would be the constrained Empirical Risk Minimiser for linear regression?
1/N sum_i^N (actual value - predicted value)^2
What is gradient descent?
What is the function that is works on called?
What is required of the input function?
How does it work?
What is the equation for gradient descent?
A way to find the values which will minimise a function.
The objective function
Gradient descent converges on the global minimum if J Is convex.
The way it works is to guess an answer and then incrementally move closer to the right one but moving towards the negative gradient.
x- alpha nabla J(x)
What is the definition of a convex function?
A function which is always below it’s chord or above its tangent
It can be thought of as bowl-shaped.
What is needed for gradient descent to work well?
When is gradient descent stopped?
• The step has to be the right size.
o Too big, and the function will diverge, meaning it will never find the minimum
o Too small and it will take too long
• The function has to be stopped at some point
o Either because it get close enough (the steps are small)
o Or you repeat the function a set number of times
What is logistic regression and how does it differ to linear regression?
Logistic regression is the process of using a linear model to perform classification by employing a sigmoid function.
What is a sigmoid function and what is the equation for it?
This function maps the real numbers to the space 0-1.
σ(z) = sigmoid(z) = 1 / (1+e^-z)
What is the log loss function?
Where does it come from?
This is another loss function, used in logistic regression, which takes the form:
L(y,y^) = -( ylog(y^) + (1-y) log(1-y^) )
It comes from the likelihood function. The likelihood function finds the probabilities that best explain a set of data . Minimising the log loss is like maximising the likelihood.
What is feature engineering?
How do you know what to modify?
This is the process of optimising which parameters you feed into a machine learning function in order to get the best prediction out of it.
• Use intuition
• Use domain knowledge- what are you looking at?
• Play with the data, can you get a linear looking function out of it?
How would you use feature engineering to get a linear function to approximate a non-linear function?
You could create new features that are functions of the data.
For example, you might start with your features being x1, and x2, but then add log(x1) and x2**2
What is polynomial regression?
What is the main issue with it?
This is where feature engineering is used to allow linear regression to approximate polynomial functions.
You might start with your data being:
x
But end with:
x^2, x^4, x^7
The main issue is that you do not know how the function will behave outside of your dataset, so they frequently make odd predictions.
Describe one-hot encoding. Why might you use it?
One-hot encoding changes a categorical variable to a set of binary datapoints.
For example, rather than dog- cat - mouse
you might have three features: IsDog, IsCat, IsMouse
Why do this? Well then you can look for a linear function that has the one-hot features as inputs.
How can linear regression approximate a piecewise linear function?
It’s possible to look for two separate gradients, except that one gradient applies only past a certain point.
Describe Stochastic/ Mini- Batch Gradient Descent
One issue with gradient descent is that finding the sum over all N in the dataset to find the derivative can take a very long time.
• The solution is to use a subset of the dataset (n) to approximate the gradient
When n = 1 you have stochastic gradient descent
What is feature selection, and why do we care what it is?
What are the four main types of feature selection that we care about?
This is the process of selecting which features to use.
• We may have a large number of available features
• We want to reduce the amount of computing power we use
• We want to increase the predictive power of the model
• We want to be careful of including too many features and overfitting
Coefficient Comparison
Correlation Comparison
Best Subset Selection
Forward Subset Selection
Explain coefficient comparison and correlation comparison.
These are types of feature selection.
Coefficient comparison compares the magnitudes of the coefficients in a linear function. Only coefficients above a certain size are selected.
The data MUST be normalised first, so that coefficients are not penalised due to the size of their data (e.g. m vs cm.)
Correlation comparison is the same as coefficient comparison except that it is the correlations which are compared.
This only works for linear functions since only for linear functions is the correlation defined.
How does best subset selection work and what does it do?
What is its biggest downside?
How many combinations are there?
This is a type of feature selection.
The model is found for every single possible combination of the input features.
The model with the lowest risk is then selected.
The biggest downside is that this is very computationally intensive and takes an incredibly long time to complete.
for p features there are 2p combinations.
How does forward subset selection work and what does it do?
What is its biggest downside?
This is a type of feature selection.
This is a greedy algorithm which is used to find the best combination of features which should be used.
- Start off with some constant
- Consider every predictor in your model and compare the result of adding one to the function in turn.
- Select the one which has the lowest loss.
- Repeat.
- Finally, we compare every version of the model with a test set and select the one which minimises the risk.
The downside is that it is only an approximation of the best model, since it does not actually consider all possible permutations of the features.
What is a meta-algorithm?
An algorithm used to optimise the machine learning algorithms.
What are the two main sampling methods which we have learnt about?
How do they work?
Why might you use one over the other?
What if there are multiple variables to sample over?
Random sampling and Stratified sampling.
Random sampling takes a random sample of the data, with the downside being that it may not create a representative dataset
Stratified sampling first splits the data into homogenous subgroups, and then takes a sample of those. This creates a much more representative dataset.
If these are multiple variables to sample over, then another column can be created that represents a combination of the variables. This new column can then be stratified.
What were the three main methods of model evaluation that we were taught?
1 Finding the expected loss (the risk) on a test set
2 Tuning the hyperparameters with a validation set and then Finding the expected loss on a test set
3 k-fold cross validation
What is a validation set used for?
What does the normal train/validate/test split look like?
It is used to tune the hyperparameters.
Anything from 60/20/20 to 98/1/1 (the latter is used only if the dataset is very big)
How does k-fold cross validation work?
What is the normal range of values for k?
What is leave-one-out k-fold cross validation?
- Divide the group of data into k subgroups (sometimes called folds).
- Train the model on all the data expect for one subgroup.
- Evaluate the model on the one subgroup.
- Repeat for every subgroup
It is then possible to find the mean and standard deviation of all the subgroups.
We can then repeat this whole process using a different set of hyperparameters.
**Once the optimal hyperparameters are found, the model can be retained on all of the training data. **
The number k varies, but is usually between 3-10.
If k = N (where N is the number of datapoints that you have, then you have leave-one-out k-fold cross validation.
The metrics which are use to evaluate the predictions differ in regression and classification.
What are all of the different metrics that we have learnt?
Can you explain them?
Regression:
Mean Absolute Error: This is simply the average of the distance of the prediction from the actual value.
Mean squared error: This is the average of the square of the difference between the predictions and the actual data.
R2 – Value: This measures how well the model compares to just predicting the mean for all predictions.
Classification:
Accuracy: The accuracy is the percentage of predictions that are correct.
Log Loss: The log loss is a loss function which is like maximising the likelihood function.
True Positive Rate (TPR) and True Negative Rate (TNR): The proportion of positives that were predicted correctly and the proportion of negatives that were predicted correctly.
ROC, ROCAUC, Brier Score, Calibration curves
The metrics which are use to evaluate the predictions differ in regression and classification.
What are all of the different metrics that we have learnt?
Can you explain them?
Regression:
Mean Absolute Error: This is simply the average of the distance of the prediction from the actual value.
Mean squared error: This is the average of the square of the difference between the predictions and the actual data.
R2 – Value: This measures how well the model compares to just predicting the mean for all predictions.
Classification:
Accuracy: The accuracy is the percentage of predictions that are correct.
Log Loss: The log loss is a loss function which is like maximising the likelihood function.
True Positive Rate (TPR) and True Negative Rate (TNR): The proportion of positives that were predicted correctly and the proportion of negatives that were predicted correctly.
ROC, ROCAUC, Brier Score, Calibration curves
Explain a confusion matrix and all of the possible outcomes.
Give examples of situations where you would want to optimise for a specific sector of the confusion matrix.
A confusion matrix shows predictions against the actual values. This shows you what type of errors are being made. E.g., a false positive is when the actual value is negative, but you predict positive.
Sometimes you really don’t want false positives.
E.g. You don’t want a spam filter to delete important emails.
Sometimes you really don’t want false negatives.
E.g. You don’t want a medical test to miss a cancer diagnosis.
What is an ROC curve, and what does it show?
To convert a prediction to a classification you need to define a cut-off point. We normally use c=0.5, so that any prediction above 0.5 is classified as positive, and any below is classified as negative.
It is possible to vary the cut-off point to achieve a better prediction.
An ROC-curve plots all of the different possible values of TPR & 1-TNR for varying c.
Depending on which type of errors you care more about avoiding, you can vary the c value to achieve different results.