Bioinformatics Lecture 6 Flashcards
use of machine learning in biomedicine
most importantly personalised medicine
problem of datasets in biomedicine
a lot of data
of not many samples
-> increases risk of overfitting
unsupervised learning
mainly look at clusters
uses unlabelled data
finds patterns in data
problem unsupervised learning in biology
different causes of same outcome
or vice versa
k-means clustering algorithm
how many clusters explain the data best
stops when there is no change in assignment of points anymore
advantages k-means clustering algorithm
simple and fast
always works
easy to understand
disadvantages k-means clustering algorithm
need to choose k manually
is non-deterministic, depends on initial distribution of points
supervised learning
users labelled data
either classification or predictinon
classification use
finding discrete groups
regression use
predicting continuous traits
nearest mean classifier
looks for nearest group in a new sample
problem nearest mean classifier
some observations are more similar to the other group than the one that they are assigned to
nearest neighbour classifier
what is the nearest observation
or sometimes nearest k observations
overfitting
partly because there is so much data about each sample
the more complicated the curve the more likely it is that new observation doesn’t fit at all
rule of thumb against overfitting
you need ten times more feautures than n samples