Data science in medicine: machine learning Flashcards
What is machine learning and what is it based on?
- The construction of approximate, generalizing (predictive) models by learning from examples, for problems for which no full physical model is known (yet).
- Machine learning algorithms build a mathematical model based on sample data, known as ‘training data’, in order to make predictions or decisions without being explicitly programmed to perform the task.
Machine learning will find structure in data by:
- clustering
- outlier/anomaly detection
- dimensionality reduction, selecting useful features
- regression
- classification
Where are all these processes aimed at?
They are all aimed at generalisation → making a prediction for data you have not yet seen.
What is clustering in regard to machine learning?
Identification of ‘natural groups’ within a population, e.g. among a population of apples → identifying red and green apples.
What is outlier detection in regard to machine learning?
Identifying strange outliers/object within a population, e.g. among a population of apples → identifying one pear.
What is dimensionality reduction in regard to machine learning?
Dimensionality reduction is also called feature selection
Finding predictive measurements to identify supgroups within a population. For example, identifying specific characteristics for apples that are red and identifying different specific characteristics for apples that are green.
What is regression in regard to machine learning?
Identifying real-valued outputs, e.g. predicting prices of the different types of apples given the various characteristics of these different types of apples.
What is classification in regard to machine learning?
Distinguishing different groups within a population, e.g. distinguishing apples from pears.
Example of machine learning in gene expression diagnostics
In this example, the genetics of different patients are collected for the diagnosis/relapse in childhood leukemia.
- Clustering is used to identify similar subtypes of the disease in the group of patients.
- Clustering is also used to identify ‘clusters’ of genes with similar ‘disruptive’ processes.
- Outlier detection is used to identify technical errors and rare patient-rare genetic backgrounds.
- Dimensionality reduction is used to identify potential biomarkers that can predict the disease.
- Regression is used for predicting the survival time of the patients
- Classification is used for the prediction of e.g. metastasis
What is important to take into consideration when wanting to apply machine learning to e.g. be able to diagnose patients with a certain disease with the use of their genetics?
That the algorithm/machine learning is based on a mathematical representation of all the objects.
Thus: to implement machine learning, we have to find a mathematical representation of objects. Objects are usually represented by features (i.e. srts of useful measurements obtained from some sensors).
Imagine a dataset that contains information about the object (apple or pear), weight and colour.
- How would a dataset look like that is clustered by machine learning?
- How would a dataset look like where regression is applied by machine learning?
- How would a dataset look like where classification is applied by machine learning?
- Clustering → Based on the characteristics weight and colour, machine learning idenifies apples and pears within one population.
- Regression → Based on the characteristics object, weight and colour, machine learning identifies the price of the apples and pears.
- Classification → Based on the given characteristics, the apples and pears are labeled separately.
Look at the picture: what dataset (left or right) can more easily be applied for machine learning and why?
The left dataset, because:
- simple
- knowledge is present
- a few good features
- almost seperable classes (classification) or a linear relation (regression)
Not the right dataset, because:
- complex
- lack of knowledge
- many poor features
- overlapping classes (classification) or highly non-linear relation (regression)
Explain what a vector and vector space is.
- Vector → mathematical object characterized by size and direction (e.g. speed and power, both characterized by size and direction)
- Vector space → a set vectors added together and multiplied (i.e. scaled) by numbers called scalars.
Questions about vectors and model development are still coming xx