PCA Flashcards
1
Q
Why do we need data reduction techniques?
A
- Because of mammoth data sets – these datasets have loads of observations
- Only a few important features contribute to the dataset
2
Q
What does data reduction allow?
A
- Data reduction allows us to extract the necessary information from a huge array of data and remove redundant data
- Removes noise
3
Q
2 types of data reduction techniques
A
- Dimensionality reduction – reduce the number of input variables in a dataset
- Numerosity reduction – reduce data volume by using suitable forms of data representation
4
Q
examples of dimensionality reduction
A
- Wavelet transform
- Attribute subset selection
- Principal component analysis
5
Q
examples of numerosity reduction
A
histogram
sampling
clustering
6
Q
wavelet transofrm
A
can be applied to ECG signals
• Helps to convert ECG signal into a form which makes it much easier for the QRS peak finder algorithms
7
Q
Attribute subset selection
A
- Find a minimum set of attributes which find the same solution
- With this you have reduced cost because there is less variables
- Makes it easier for pattern recognition
8
Q
Principal component analysis
A
- A variable reduction technique
- Reduces larger set of variables into smaller set of ‘artificial variables’ called principal components that account for most of the variance in the original variables
9
Q
PCA can be used to solve 3 major problems:
A
- Removing unrelated variables
- Reducing redundancy in a set of variables
- Removing multicollinearity
10
Q
Assumptions for PCA
A
- You have multiple continuous variables
- Linear relationship between all variables
- No outliers
- Large sample size for PCA to produce reliable result
11
Q
Steps of principal component analysis:
A
- Calculate covariance matrix
- Compute the eigenvectors and eigenvalues of the covariance matrix to identify principal components
- Choose which components to keep (ones with high eigen values)
- Reorient the data from the original axes to the ones represented by the principal components using the covariance matrix
12
Q
Covariance matrix
A
• It is a square matrix that shows covariances of each pair of variables
13
Q
Principal component
A
- In order to perform a PCA you need to find the axis of the greatest variance which is the line of best fit
- This line is the first principal component
- Then you project your data points onto the first principal component
- Before, each person was represented by lung function and oxygen but now they are represented by one principal component
- Second principal component accounts for next highest variance
14
Q
Eigenvalues and eigenvectors
A
- A matrix represents a linear transformation which means that a matrix contains a set of rules for moving data points around
- One type of linear transformation is shearing – where data points are sheared by multiplying them by the matrix
- Eigenvector corresponds to direction
- Eigenvalue is how far along the line the data point has moved and corresponds to distance
- Eigenvalues represent total amount of variance that be explained by a given principal component
- Ranking eigenvectors in order of their eigenvalues highest to lowest you get principial component in order of significance
15
Q
Rotations
A
- Goal of rotation is to improve the interpretability of the factor solution by reaching a simple solution
- Rotation helps make it easier to interpret PCA