Chemometrics Flashcards
Define the term principal component regression
- finds relationships between two matrices (X and Y)
What are latent variables
- underlying factors not directly measured but that defines the important variation / information in the data
Define the term variance
- a measure of the spread of the data
- the square of the standard deviation
What is covariance and correlation
- measures on how two variables vary in the same way
- formally correlation is normalised covariance
What is the matrix equation and what do the various components represent
- X = TP^T + E
- T – scores (samples)
- P^T – loadings ( variables)
- E – residuals of unfitted data or noise
- TP^T – contains the structure or information in the data
- X – the raw data table
What are PCs
- latent variables that describe a new direction in n dimensional space that explains a large amount of the variation in the data
- usually a smaller number of LVs are needed to capture the relevant information in the data
What are outliers
- samples that are numerically distant from the rest of the data
What are score plots
- scores for the PC e.g. PC1 vs. PC3
- provides information about objects/ samples
What are loading plots
- loadings for the the PC e.g. PC1 vs. PC3
- provides information about the variables
What is a biplot
- an overlay of scores and loadings
- information about the relationship between variables and objects
What are some of the reasons that outliers occur
- measurement error
- labelling error
- noise
- unique/ extreme sample
- interesting
- unwanted
What are some ways of spotting and detecting outliers
- scoreplot
- residual
- hotelling T^2
- validation
What are the two kinds of outliers?
1) extreme but in model
2) outside the model
How are outliers found from scores
- by inspecting score plots
- useful because peculiar behaviour can be easily spotted
How are outliers spotted using residuals
- samples or variables not consistent with the model i.e. they are a different pattern than the model describes get high residuals
What is a model
- a summary of your best knowledge of a system at the time of investigation
- a good model can be used to predict future events with confidence
What is validation
- means texting if a model is general or only descriptive for the calibration samples
- can be tested using only part of the samples for building the model and predicting the cost
Describe the 3 types of validation
- statistical - fit
- confidence intervals
- not used in chemometrics
- empirical I - fit and prediction
- cross validation or test set validation
- often used in chemometrics
- empirical II - fir to real life
- external information, interview validation
(Primarily PCA)
- external information, interview validation
How is cross validation performed
- by systematically leaving some of the original samples out of the model and see how well they are predicted
- the most simple method is leave one out where with n samples n models are built leaving every sample out once
What are the benefits of preprocessing
- models with lower prediction error
- simpler models that are more robust and/or more easy to interpret
What is mean-centring
- the mean value for each variable is substrated from the data
What is scaling and what two categories does it fall under
- multiplying (some) variables with a factor so that the variance gets similar numerical values
- auto scaling and block scaling are the two types of scaling
What is autoscaling
- each column is divided by its standard deviation so that the variance gets similar numerical values
What is block-scaling
- giving blocks of data equal or modified importance
-
What are consequences of mean centring
- allows the PCs to capture the relevant variation in the data
- without mean centring it is common that PC1 only describes the deviation from zero
What are the consequences of correctly preformed scaling
- gives each variable equal or comparable importance
- if scaling is not done variables with high numerical values maybe give inappropriate importance
Define the term principle component analysis
- describes the correlation structure X
- uncovers the underlying variation in the data (X)
- and provides a summary showing how observations are related and if there any deviating observations or groups of observations in the data