Lecture 7: Chemometrics Part 2 Flashcards
What does PC do if the first principle component doesn’t sufficiently describe the data?
- If the first PC is not sufficient to describe the spread of data, the calculation is repeated to find PC2
- The process is continued until all the variability within the dataset has been accounted for and modelled
PC2 is…
PC2 is orthogonal to PC1 as that’s where you will find the next largest variation within the data set.
What does PC2 look at?
The residual variation
Data =
Data = Model + Error
Data = S
Data = Structure + Noise
Structure = ?
The bits that are important, the bits that tell you where the sample groupings are
Noise
Could be random noise from instrument, could be fluctuations in instrument, temperature changes, etc
What is residual variance?
Residual variance is random noise, this is not useful or helpful to model the data.
What happens if you model noise?
It means when you start to use the model for classification you’re trying to force an unknown sample into a model that has lots of noise going on
What is the ideal model in PCA?
We want to create a model that captures as much of the information in the dataset, in as few PCs as possible
What do the PCs describe?
The PCs will describe the structure (explained variance)
How do you determine the optimum number of PCs in PCA?
Determination of the optimum number of principle components is achieved via the inspection of loadings plot, scores plot, explained variance plot.
How is a score generated for PCs?
- Each PC is a linear combination of the original variables
- The distance along each PC from the mean gives a sample its “score”
What does the distance between each sample and PC give you?
Distance between each sample and the principle component gives you the residuals (residual variation)
What does the distance between each sample and the mean of all the samples give you?
- Distance between each sample and the mean of all the samples gives you your score (this score can be positive or negative)
- Each sample will have a different score for each principle component.
Where is the majority of the information found in PCA?
The majority of the information is in the first few PCs, particularly if you have a good data set.