Unsupervised Learning Flashcards

1
Q

unsupervised learning

A

learning algorithm works out the patterns data contains for the purpose of compact representation (dimensionality reduction) and categorisation & analysis (clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dimensionality reduction

A

transform data into a latent space representation of that data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

latent space

A

reduced dimensional space in which essential features of input data are encoded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

latent space in autoencoder

A

compressed representation of input data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

encoding

A

process of transforming/compressing data into latent representation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

decoding

A

process of recovering/reconstructing data from latent space compressed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

auto encoder

A

MLP (fully connected network) whose first hidden layer constitutes an encoder with the output of layer being the latent representation then the next layer after latent space is decoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how does dimensionality reduction happen with autoencoders

A

encoding input data into a lower-dimensional space by setting #neurons to be smaller than # of neurons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how is network trained with autoencoder

A

network is trained end to end to minimise MSE (difference between input and reconstructed output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

usages of autoencoders

A

compressing file into a zip
noise reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

auto encoder architecture picture

A

input and output number same
hourglas shape
more hidden layers allows you to find more features
reduced neuron count in encoder to find compressed latent data (important patterns) then expand neuron count to extract

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

PCA stands for

A

principal component analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is PCA and what is it used for

A

used for dimensionality reduction while preserving as much variance as possible in the data

It transforms the original dataset into a new coordinate system, where the axes (called principal components) are ordered by the amount of info it conveys about data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to do PCA to compute components

A

given Nx D matrix
compute mean value for each attribute
compute covariance matrix C
get sorted eigen values
first K rows of E gives K most important components of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

applications of PCA

A

data visualisation (visualise high-dimensional data in 2D or 3D)
noise reduction (retains only the most significant principal components)
feature reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AE vs PCA

A

AE
- can handle non-linear data
- more flexible but longer to train
PCA
- only handle linear projections

17
Q

latent variable

A

variable that is not directly observed but is inferred from other variables that are observed
like intelligence which is inferred from say responses on tests

18
Q

K means algorithm

A

unsupervised learning algorithm designed to partition data into K distinct clusters
minimises within cluster variance aiming to group similar data points

19
Q

4 step of K-Means algorithm

A

1) initialise number of clusters K and randomly select K initial clusters
2) assignment
each data point assigned to nearest cluster group by calculating euclidean distance
3) algorithm updates cluster centres as means of each cluster group
4) repeat 2 and 3 until assignment to clusters doesn’t change

20
Q

K means alg is prone

A

prone to finding local minima

21
Q

how do you find correct # of clusters

A

run and use heuristic to judge quality of clustering for different choices of K

22
Q

have __ sets of data to mitigate under/over fitting data

A

training data to train the model
validation data to tune model parameters
test data to test the performance of the model

23
Q

instead of a single validation set you could have

A

repeated cross validation
The dataset is divided into
k subsets (or folds).
The model is trained on
k-1 and validated on the last one
generates approx set of how well classifier will do on unseen data

24
Q

evaluation metrics for both classification and regression

A

classification: accuracy, ROC curve (illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate )
regression: MSE and MAE