week 4 - unsupervised learning Flashcards
how does unsupervised k-means clustering work?
It finds clusters within a dataset that doesn’t have any labels on it
first you initialise cluster centers randomly.
we then assign data points to the nearest centre. We then re-calculate the center as the mean of our cluster and reassign data to the nearest cluster centre
We keep doing this until it reaches some sort of convergence criteria (the centre shifts become very small)
how can we figure out whether a k-means clustering solution is good.
We can use the average sillouhette coefficient
This coefficient is based on intra/inter cluster distance
This measures whether data points within a cluster a close to other points within the cluster, but far away from points not within the cluster
each sillouhette coefficient can indicate whether a datapoint has been misclassified. -1 means probably misclassified. 0 means on the boundary and near 1 means well classified
why might we want to decompose highly correlated variables?
Because we might be able to get a similar amount of data with just one dimension
So dimensionality reduction can reduce the curse of dimensionality problems whilst retaining a lot of the information
what is the difference between regression and PCA?
Regression minimises the error when predicting y from x
PCA fits a line that minimises the distance between all the points. Rather than predicting y from x, this line captures the variance in the data.
Also, note that in multiple dimensions, PCA still fits a line through the data whereas regression fits a plane.
How does PCA capture sources of variance?
PCA fits a line that minimises the distance between all the points (minimum sum of squared distances between the points and the line). The position of points relative to this line captures the biggest source of variance in the data. This can get compressed into 1-dimension, by projecting each data point onto the one dimensional line.
Then, for a second dimension, we fit an orthogonal line that captures the remaining variance. This is because all the remaining variance will be orthogonal to the first line.
The amount of orthogonal lines i.e components, is equivalent to the amount of dimensions in the data.
How do we determine the best number of PCA components?
We can create a scree plot, which plots the explained variance and cumulative explained variance according to the number of components. You can then keep the number of components that correspond to the inflection point in the scree plot.
We can optimise the number of components as a hyperparameter (when PCA is combined with supervised learning)
Is PCA supervised or unsupervised? And can it be combined with the other type
PCA is unsupervised but it is often combined with supervised learning
What is the main goal of PCA?
To reduce the dimensionality of the data to address the curse of dimensionality.
This is particuarly helpful when we have correlated variables, as we don’t lose much important information by taking principle components
It also, when combined with supervised learning, can reduce overfitting to noise thus making supervised learning more powerful
what can combining pca with clustering do?
You can conduct clustering on a PCA from functional neuroimaging data to obtain particular brain ‘states’
What are alternatives to PCA?
ICA
Hierachal clustering
Gaussian mixture modelling