Machine Learning Methods Flashcards

1
Q

As you learnt in Section 1, machine learning is a subset of AI whereby a model is given lots of data so it can learn to spot patterns and categorise future, unseen data. Furthermore, the flowchart below shows what can helpfully be described as the four types of machine learning. Note that semi-supervised learning (which uses both supervised and unsupervised learning techniques) and reinforcement learning will be covered in advanced modules.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In this section, you’ll learn about machine learning in more detail by focusing on three key areas:

1 Supervised and unsupervised learning

2 Transfer learning and active learning

3 Bias and variance

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supervised and unsupervised learning are two types of machine learning whereby algorithms are trained to uncover patterns and relationships within data. In this lesson, you will learn about these two types of machine learning as well as gain an understanding of parametric and non-parametric models (which can be used in both supervised and unsupervised learning).

A
  1. Supervised learning

Supervised learning is learning from labelled data to predict outcomes for new data.

An example of this would be giving an AI model labelled images of chest radiographs labelled as either showing abnormalities or not showing abnormalities. The model would learn from this labelled data and would then try to predict abnormalities in new, previously unseen radiographs.

Supervised learning is typically used for classification and regression tasks. Classification is where data is categorised into predefined classes or categories based on characteristics. The chest radiographs example above is an example of a classification task. Regression is where the model makes predictions on continuous numerical values. An example of this in a healthcare setting could be a model that takes patient data (such as their age, weight, and habits), and predicts their blood glucose levels to help create a more personalised patient plan without the need for more invasive methods.

The downside of supervised learning is that patterns can be missed due to an abundance of data, the model’s performance, and/or bias of human thinking processes. An abundance of data occurs when using large numbers of features (the term used in machine learning to describe a variable). This can be an issue with supervised learning because it might not be able to find the patterns amongst the ‘noise’. Noisy data are data with a large amount of additional meaningless information in it (technically described as data having a low signal-to-noise ratio). This can be rectified by using different methods, identifying and excluding useless features or increasing the number of observations within a dataset.

The model may also not be capable of dealing with complex data, which may also lead to mistakes. Also, if any of the data fed into the model is incorrect, this will be reflected in the output of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A note on K-Nearest Neighbours (KNN)

The K-Nearest Neighbours (KNN) algorithm is another example of supervised learning and can be used for both classification and regression tasks. The algorithm is given labelled data: a category or class label for classification tasks and a numerical value for regression tasks.

The ‘K’ parameter is defined before applying the algorithm. It represents how many of the nearest points will be considered in the algorithm. For example, if you wanted to consider the five closest points, then K would equal five.

The KNN algorithm is then able to calculate the distance between any new point and existing points in the dataset. The ‘nearest’ points are those with the smallest distance to the new point. The algorithm would therefore be able to work out the closest, in this case, five points.

By working out the closest five points, the algorithm would then be able to apply a category or numerical value to the new data point.

Put simply, in the example shown below, we have arbitrarily selected five as the number of closest labelled points used to identify the new unlabelled point. Three of group A are amongst the five closest points to the new unlabelled point, and two of group B are amongst the five closest points to the new unlabelled point. Therefore, voting from the five means the new unlabelled point belongs to group A.

Use the arrows below to navigate through a visual representation of this process.

A

Pros and cons of using the KNN algorithm

No specific value of K is guaranteed to make the model perfect so it’s important to consider the pros and cons of this method.

Select each tab below.

Pros
Simple to implement.
Robust to noisy training data.
Effective for large datasets.

Cons
Finding the best K can be time-consuming.
Computational cost is high as it requires finding the distance between all data points.
Increased risk of the algorithm capturing noise and irrelevant information from the training data when applied to large training dataspaces like radiomic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Unsupervised learning

Unsupervised learning is learning from unlabelled data to identify patterns and relationships.

This branch of machine learning can be used by clinicians and researchers to find hidden patterns, structures and/or relationships within complex medical datasets. This can lead to improved diagnosis, treatment, and comprehension of diseases and conditions.

In medicine, unsupervised learning is mostly used for clustering and dimensionality reduction tasks. Learn more about both below.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Clustering technique: Instead of using labelled data for the chest radiograph example, we could instead use the clustering technique. This would instead involve the unsupervised learning model grouping similar chest radiographs together based on their patterns or features. This technique would not require any labelling of the data. Clustering would mean that the model would learn to spot patterns in the chest radiographs and would be able to group the images based on features. This could help to identify types or stages of tumours, thus helping radiologists and oncologists with diagnosis and treatment.

A

Dimensionality reduction: Dimensionality reduction is all about the model reducing complexity whilst holding onto the most pertinent information in a dataset. An example of this might be a model that reduces the complexity of a dataset of MRI brain images. Instead of having to interpret complex brain images with a large amount of voxel-based data, a clinician would instead be able to interpret less complex images with fewer dimensions and only the most important features shown, thus making the clinician’s job easier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. Parametric and non-parametric models

Parametric models make specific assumptions about underlying data distributions whereas non-parametric models make minimal assumptions about data and allow for more flexibility.

Understanding the difference between parametric and non-parametric data is fundamental in guiding the selection type of machine learning algorithms. It also influences the preprocessing steps, model validation strategies, and interpretation of results.

To enable computers to perform human tasks, we need to train them. This is accomplished by building models (computer programs that recognise patterns in the training data), which can then be used to make predictions in other data. Models can be either parametric or non-parametric.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parametric models have fixed formulas where we need to find certain numerical values.

An example is linear regression, which is a statistical method used to model the relationship between one or more independent variables (predictors) and a dependent variable (outcome) by fitting a straight line to the observed data.

With linear regression, we assume a simple equation: y = ax + b, where the predictive parameters (missing values) are a and b, which can make estimates using training samples: x (input data, or features) and y (target variable).

This model makes a “linear assumption”, which means that it assumes the relationship between the input data and the target variable has a fixed relationship. If, for example, we wanted to use age to predict the likelihood of head injury, it would assume the change in risk would be equal for every one year increase in age, despite the fact that the true underlying risk is a curve with more risk at the extremes of age.

Parametric models are easy to understand and interpret. They are also fast to train and do not require much training data.

However, they are limited by their assumed mathematical form and do not fit all scenarios, leading to a poor ‘model fit’ or predictive performance meaning parametric models may not work well for some problems.

Parametric methods assume data are normally distributed or mathematically tractable frequency distributions which are closely related to the normal distribution.

A

Non-parametric models are those, which do not make such mathematical form assumptions. They are much more flexible as a result, which means they can adapt to various datasets more easily than parametric models.

They can often result in better predictive performance but can require more data and longer time to train. This type of model is also at a higher risk of becoming too complex, which might mean that it may not perform well on new, unseen datasets.

However, this improved performance is not universal and worse performance is often seen when used on data that is normally distributed. This shows why understanding the data you are using and its patterns is crucial in designing the right approach to developing models.

An example of a non-parametric model is the K-Nearest Neighbours (KNN) algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Transfer learning is taking a model built for a specific task and adapting it for another.

Transfer learning models use knowledge learned from previous tasks it has performed related to similar data. Instead of having to train a model from scratch, transfer learning methods mean that the model repurposes that pre-existing knowledge to tackle the new task.

This technique is helpful when you have limited access to data because the model can perform better more quickly and with less data. This is because it can focus on learning task-specific features rather than starting from scratch, learning how to deal with more general tasks.

This approach is particularly useful in medical imaging research where training datasets can often be limited in size.

An example of this would be to train an algorithm on a vast collection of general photographs and images before doing further training or fine-tuning on clinical image data. The first step will help the model to learn about shapes, lines and edges before being trained on how these patterns relate to the task in the interpretation of medical image

A

Active learning is building supervised learning models whilst minimising annotation effect.

The technique involves the model selecting the most informative data points that will then be labelled (usually by a human). Essentially, instead of the model passively taking in labelled data, the model queries the data it ascertains is most informative. This means that the required labelled data for training is much lower than in a standard supervised paradigm.

Active learning is most helpful when labelling data is expensive or time-consuming as the model can focus on the most relevant data points for learning.

However, it is important to note that active learning models run the risk of the model becoming overwhelmed by uninformative data labels. This is because not all labelled data are guaranteed to contain useful information for model development. If this model’s selection criteria for the ‘most informative’ data points isn’t effective, or if the human labelling these data points introduces errors or bias, then the model’s output will be negatively impacted. These are important points to consider when dealing with active learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bias and variance are important to consider when working with machine learning.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bias relates to the goodness of fit of the model on training data. In other words, bias refers to how well the model represents (“fits”) the patterns present in the training data. When there is a high level of bias, it means the model isn’t doing a good job of fitting the data. It means the model is consistently giving out incorrect outputs.

A

Variance determines the risk of poor model predictions on testing data. When there is a high level of variance, it means that its output is inconsistent when it is given new datasets. This leads to poor outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Bias and variance trade-off

The purpose of model development is to find the optimal trade-off between bias and variance so that both bias and variance are minimised.

See the images below for a visual representation of each bias and variance trade-off.1

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Regularisation

By introducing penalties into the model for complexity, regularisation in machine learning is used to stop models from becoming too complicated and from recording every tiny detail in the data. This helps achieve the desired bias-variance trade-off by controlling the model’s complexity, ensuring it captures enough data to identify underlying patterns without overfitting or underfitting.

Regularisation is used in slightly different ways depending on the specific scenario, by amending or removing coefficients. Coefficients are the numbers that multiply the predictor values. So, for example, if you were using a model to analyse imaging data to predict the severity of brain tumours based on characteristics observed in the images, there would be many characteristics extrapolated like shape, size, location, and so on. These characteristics would each be assigned a coefficient, which would determine how strongly each one would influence the predicted severity of the brain tumour. If the coefficient for brain tumour size was positive, it would mean that larger tumour sizes tend to be associated with more severe conditions. If the coefficient for tumour location was negative, it would mean that location would be associated with less severe conditions.

A

Regression is a type of statistical technique used in data science to analyse the relationship between variables. It typically involves finding the line of best fit for a given set of data points. LASSO and Ridge Regression are the two most popular regularisation techniques used in regression analysis for better model performance. Both techniques help to reduce overfitting and improve prediction accuracy. They do this by reducing the model’s complexity by introducing shrinkage or adding a penalty to complex coefficients. While both methods aim to reduce coefficients’ magnitudes, they differ in terms of how they do so: LASSO uses L1 regularisation while Ridge uses L2 regularisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Using L2 regularisation, the Ridge Regression can help shrink the coefficients of less significant features close to zero but not exactly zero. By doing so, it reduces the model’s complexity while still preserving its interpretability.

A

Unlike Ridge Regression, by using L1 regularisation, LASSO Regression can also force coefficients of less significant features to be exactly zero. Therefore, it is useful for identifying unimportant features that can be dropped from the model. As a result, LASSO Regression performs both regularisation and feature selection simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ridge Regression (L2 regularisation)

In linear regression, the goal is to find the best-fitting line that minimises the sum of squared differences between the observed and predicted values (ordinary least squares). However, when there are highly correlated variables, linear regression may become unstable and provide unreliable estimates. Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.

Ridge regression is a variation of linear regression, specifically designed to address multicollinearity in the dataset. Ridge regression introduces a regularisation term (λ) that penalises large coefficients, helping to stabilise the model and prevent overfitting.

Bias–variance trade-off

The superiority of ridge regression compared to the method of least squares arises from the inherent trade-off between variance and bias. In machine learning terms, ridge regression amounts to adding bias into a model for the sake of decreasing that model’s variance

A

LASSO (L1 regularisation)

LASSO stands for Least Absolute Shrinkage and Selection Operator. The key difference between Ridge and LASSO Regression lies in the regularisation term. In LASSO, the regularisation term is the absolute value of the coefficients, which tends to force some coefficients to become exactly zero. This feature of LASSO makes it a valuable tool for feature selection, as it automatically identifies and excludes irrelevant variables from the model.

Just like Ridge Regression, LASSO regression also trades off an increase in bias with a decrease in variance.

The methods compared show:

(1) Regularisation Type:

Ridge Regression uses L2 regularisation, which penalises the sum of squared coefficients.

LASSO Regression uses L1 regularisation, which penalises the sum of absolute values of coefficients.

(2) Feature Selection

Ridge Regression can shrink coefficients towards zero but doesn’t force them to be exactly zero.

LASSO Regression can set some coefficients to exactly zero, effectively performing feature selection.

(3) Suitability:

Ridge Regression is suitable when you want to prevent multicollinearity and maintain all features in the model.

LASSO Regression is suitable when you want to perform feature selection and retain only the most relevant variables.

In practice, LASSO and Ridge Regression are often used together!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Ground truth is the accurate and objective data or information that serves as the benchmark for training and evaluating machine learning models to ensure that the model’s performance aligns with information that is empirically known to be true. Ground truth is the standard against which the model’s predictions or classifications are compared to measure its accuracy and effectiveness.

AI models ‘learn’ using training data with labelled ground truths.

A
17
Q
  1. Ground truth in radiology

When it comes to radiology in particular, the robustness of the ground truth and training data are vitally important in understanding AI model decisions.

For radiological ground truth labels, it is important to consider:

How many readers were there? What expertise and training did the readers have?

What’s the intraobserver and interobserver variability?

Ask yourself: what is the aim and ‘use case’ of the AI model? Once you have determined the answer to this question, you can choose your ground truth and labels accordingly.

Remember: Issues can arise where there is diagnostic uncertainty. This is commonly the case where a diagnosis is a clinical one rather than based on an objective test

A
18
Q

Underfitting 🡪 The model is not complex enough to capture underlying patterns in data.

Underfitting leads to 🡪 high bias.

Overfitting 🡪 The model is too complex. It captures noise or random fluctuations in the training data.

Overfitting leads to 🡪 high variance.

A

Regularisation is used to stop models from becoming too complicated

Regularisation uses coefficients to multiply predictor values

LASSO and Ridge Regularisation are two popular regularisation techniques which help to reduce overfitting and improve prediction accuracy.

19
Q

Section 2: Machine Learning Methods. The key learning points for this section are recapped below:

Supervised and unsupervised learning are two types of machine learning whereby algorithms are trained to uncover patterns and relationships within data.

Supervised learning is learning from labelled data to predict outcomes for new data.

Unsupervised learning is learning from unlabelled data to identify patterns and relationships.

Parametric models make specific assumptions about underlying data distributions whereas non-parametric models make minimal assumptions about data and allow for more flexibility.

Transfer learning is taking a model built for a specific task and adapting it for another.

Active learning is a machine learning approach that aims to optimise annotation using a few small training samples.

Bias relates to the goodness of fit of the model on training data. In other words, bias refers to how well the model represents (‘fits’) the patterns present in the training data.

Variance determines the risk of poor model predictions on testing data. When there is a high level of variance, it means that its output is inconsistent when it is given new datasets. This leads to poor outputs.

The purpose of model development is to find the optimal trade-off between bias and variance so that both bias and variance are minimised.

Ground truth is the accurate and objective data or information that serves as the benchmark for training and evaluating machine learning models to ensure that the model’s performance aligns with information that is empirically known to be true.

A