Bioinformatics Lecture 6 Flashcards
use of machine learning in biomedicine
most importantly personalised medicine
problem of datasets in biomedicine
a lot of data
of not many samples
-> increases risk of overfitting
unsupervised learning
mainly look at clusters
uses unlabelled data
finds patterns in data
problem unsupervised learning in biology
different causes of same outcome
or vice versa
k-means clustering algorithm
how many clusters explain the data best
stops when there is no change in assignment of points anymore
advantages k-means clustering algorithm
simple and fast
always works
easy to understand
disadvantages k-means clustering algorithm
need to choose k manually
is non-deterministic, depends on initial distribution of points
supervised learning
users labelled data
either classification or predictinon
classification use
finding discrete groups
regression use
predicting continuous traits
nearest mean classifier
looks for nearest group in a new sample
problem nearest mean classifier
some observations are more similar to the other group than the one that they are assigned to
nearest neighbour classifier
what is the nearest observation
or sometimes nearest k observations
overfitting
partly because there is so much data about each sample
the more complicated the curve the more likely it is that new observation doesn’t fit at all
rule of thumb against overfitting
you need ten times more feautures than n samples
solutions to overfitting
feature selection
either by dimensionality reduction or hand-picking
dimensionality reduction
PCA
e. g. if height and shape vary together there is only one dimension
if not, there are more
hand-picking features
based on biological knowledge
e. g. with TCGA (the cancer genome atlas)
disadvantage deep learning
big black box
nobody understands it
cross validation
train model
then test it on data it was not trained on
gold standard to assess fitting
cross validation with k models risk
chance finding of model that doesn’t actually work
validation methods
leave one out cross validation
n-fold cross validation