Lecture 7: Chemometrics Part 2 Flashcards
What does PC do if the first principle component doesn’t sufficiently describe the data?
- If the first PC is not sufficient to describe the spread of data, the calculation is repeated to find PC2
- The process is continued until all the variability within the dataset has been accounted for and modelled
PC2 is…
PC2 is orthogonal to PC1 as that’s where you will find the next largest variation within the data set.
What does PC2 look at?
The residual variation
Data =
Data = Model + Error
Data = S
Data = Structure + Noise
Structure = ?
The bits that are important, the bits that tell you where the sample groupings are
Noise
Could be random noise from instrument, could be fluctuations in instrument, temperature changes, etc
What is residual variance?
Residual variance is random noise, this is not useful or helpful to model the data.
What happens if you model noise?
It means when you start to use the model for classification you’re trying to force an unknown sample into a model that has lots of noise going on
What is the ideal model in PCA?
We want to create a model that captures as much of the information in the dataset, in as few PCs as possible
What do the PCs describe?
The PCs will describe the structure (explained variance)
How do you determine the optimum number of PCs in PCA?
Determination of the optimum number of principle components is achieved via the inspection of loadings plot, scores plot, explained variance plot.
How is a score generated for PCs?
- Each PC is a linear combination of the original variables
- The distance along each PC from the mean gives a sample its “score”
What does the distance between each sample and PC give you?
Distance between each sample and the principle component gives you the residuals (residual variation)
What does the distance between each sample and the mean of all the samples give you?
- Distance between each sample and the mean of all the samples gives you your score (this score can be positive or negative)
- Each sample will have a different score for each principle component.
Where is the majority of the information found in PCA?
The majority of the information is in the first few PCs, particularly if you have a good data set.
What does an explained variance / scores plot tell you?
An explained variance (or score) plot reveals the optimum number of PCs
What will happen if your samples have lots of variance?
If your samples have lots of variance and lots of samples you’ll need more PCs to accurately account for the variation within the data set.
What is a scores plot?
- The scores plot is a map of the samples
- Samples closest together are the most similar
What happens to the score of a sample depending om the PC used?
Each sample will have a different score for every PC so the map will change depending on the PC chose.
What is an advantage of a scores plot?
You can separate data points that would overlap on other PCs
Why is PCA better than CA?
It tells you why samples group together.
What are loadings?
Loadings are the link between the scores plot and the chemistry of the samples
What is a loadings plot?
- A loadings plot is a map of variables
- The loading plots are the links between the scores plot and the chemistry of the samples
- Loadings plots relate directly to the spectra and then relate directly back to the chemical composition because the spectrums give you your molecular fingerprint