Lecture 6&7 - Chemometrics Flashcards
what is chemometrics
a multivariate statistical analysis that is computationally intensive and is applied to chemical systems or processes to find patterns and trends in the data
what does multivariate mean
analyses multiple variables
what does computationally intensive mean
the need for a computer and algorithms
name 6 things chemometrics can help us do
- reduce complex datasets
- identify and quantify sample groups
- optimise our experimental parameters
- identify covariance and pick out important variables
- give reproducible measures of data
- visualise the data (a picture is better)
what is meant by covariance
which parts of the data are associated/not independent
what effect does chemometrics have on subjectivity
reduces it in the analysis of data but does not eliminate it completely as human still interpret the data
what can chemometric reveal that may have not been noticed
underlying/ not obvious trends between variables
what is the main aim of chemometrics
to maximise output and quality with minimal cost
chemometrics has been used in forensic science since 2009 - what benefits has this provided
improved efficiency in forensic workflow
better quality of the use of resources for forensic purposes
how is chemometric applicable to forensic science (6)
statistical framework
replaces use of unique and match without use of stats to support (subjectivity)
can counteract bias
quicker than manual data interpretation
don’t need an expert but someone does need to be able to interpret the output data
can predict trace behaviour
what can be identified using multivariate analysis that may not be seen in univariate analysis
outliers
what is univariate analysis
analysis that only considers one variable
give an example of where multivariate analysis may be beneficial in forensic analysis and the variables that could be considered
pollen dispersion
considering time of year and weather
fingermarks
considering sweatiness of someone and weather conditions
what are the 4 broad categories of chemometrics
experiment design (DOE)
exploratory data analysis (EDA - what is the data showing me)
classification
regression
what does the DOE (design of experiment) affect in forensic science
evidence collection, storage, analysis instrument selection and optimisation
what can the DOE part of chemometrics be used to streamline in forensics in the future
efficiency, quality and reproducibility by establishing optimised workflows
what does the regression part of chemometrics involve
a version calibration curves based on a linear y=mx+c relationship which maps the effect of multivariate independent variables
we can make educated predictions based of the curve
What happens in the EDA part of chemometrics
data is reduced by an algorithm that looks how variables correspond
identifies groupings of samples in complex datasets
visualises trends
making the data more manageable
EDA is an unsupervised technique - what does this mean
it explores the data without any prior assumptions or knowledge of the samples (reducing human bias and subjectivity)
what is a supervised technique
the building of classification rules for grouping samples together - done from EDA analysis
name the two most commonly used EDA techniques
CA = cluster analysis
PCA = principal component analysis
name 3 features of cluster analysis (CA)
unsupervised technique
samples are grouped into clusters based on a measure of similarity (a calculated distance)
the output is a dendrogram
what are the two types of cluster analysis
ACA - agglomerative cluster analysis
HCA - hierarchical cluster analysis
what is agglomerative cluster analysis (ACA)
individual samples are grouped into clusters
what is hierarchical cluster analysis (HCA)
clusters are split into individual samples
in cluster analysis which samples are the most similar
the ones grouped closest to zero
how are the number of clusters decided in cluster analysis
the analyst decides on the stopping rules - where the subjectivity is introduced
what is a limitation of cluster analysis
the visualisation of relationships is made clear but it can not tell WHY these grouping occur
what happens in PCA analysis
an algorithm
an unsupervised technique
assesses all variable in a dataset and decides which are relevant and correlated
in PCA when the algorithm finds a correlation what is this defined as
a Principal Component (PC)
what does the PC describe in PCA analysis
the largest variation between samples with PC1 being given the highest priority
For ‘good’ data how many principal components would you expect to see in PCA
more than 1
multiple PC’s are identified by the algorithm until all the variability within the dataset has been modelled
what is data comprised of
structure and noise or model and error
where data = spectra and chromatograms
structure/model = useful info (explained variance)
noise/error = not useful info to data interpretation (instrument noise, lab temp fluctuation) (residual variance)
how are principal components represented in PCA
a straight line = a linear combination of the original variables
in PCA what assigns a sample a ‘score’
the distance along the PC line from the mean (can be positive or negative)
each sample has a different score for each principal component
how many PCs can a model have in PCA
as many as the number of original variables
what type of plot reveals the optimum number of PCs for a given number of variables in PCA
an explained variance or scree plot
the % on this plot (y axis) is the variance in the data set
keep on adding PCs until you are adding less than 1% variance as here you are likely to be looking at noise
this is a poor method of decides how many PCs to use
what is a better method to use than scree/explained variance plots
scores plots
what is a scores plot
a map of the samples where each data point is a sample and similar samples are clusters
samples closer to each other are more similar and vice versa
the x and y axis correspond to two PC’s these can be any
plotting different PCs against eachother gives different clusters
how can you tell by looking at scores plots that a PC is modelling noise and not useful information
if plotting PC1 vs PC4 and then PC1 vs PC5 gives very similar sample clusters then it is likely that PC5 is just modelling noise and isn’t adding any value to the data
what is a benefit of PCA over CA
PCA can tell you why samples cluster but CA can’t
scores plots however do not always reveal immediate trends so what type of plot can be used instead
skittles plots
what is a skittles plot
again this is plotting PCs against each other e.g PC1 vs PC2
samples are then grouped according to predefined categories e.g creams, liquids, loose powders, pressed powders and mousses
what is the benefit of 3D scores plots
for even better data visualisation
different combinations of PCs can be plotted for better groupings
what is a PCA loading plot
the link between a scores plot and the chemistry of the samples - tells us why certain samples group together (why PCA is better than CA)
(scores plots = map of samples
loadings = map of variables)
y axis = loading
x axis = raman shift (example)
the loading value can be +ve or -ve
what do loading values represent
The weight of a particular variable - e.g in slide 19 lecture 7 the blue line tells you that anything grouped into PC1 will have the long peak at the start in the spectrum
name one type of supervised chemometric technique
LDA = linear discriminant analysis
what happens in LDA = linear discriminant analysis
PCs are used to create classification rules
classifying unknown samples
LDA mathematically works similarly to PCA but what are PCs and scores called instead
why are loadings also calculated in LDA
PCs = discriminant functions
scores = discriminant values
loadings are also calculated like in PCA to maximise separation between known sample groupings
briefly explain how PCA and LDA differ
PCA
- looks for most variation between samples
- applied loading weight where variation is found
plots PC1 vs PC2 (or any PC number)
LDA
- tries to minimise variation within sample groups to maximise the separation of groupings by prioritising the loading values
plots class 1 vs class 2
in LDA what are samples in the same group given
similar discriminant values (PCs)
those in different groups are given different discriminant values
name 4 limitations to using chemometrics for trace evidence analysis
- can’t make up for poor data
- data preprocessing can be important but less is more (overprocessing data can cause issues if you are trying to force the data to show what it doesn’y)
- sample size is better if bigger but this is hard in trace analysis
- do the samples use accurately reflect the population
what are things that can happen during sample collection and analysis that chemometrics does not take into account
contamination
degradation
replicates taken
controlled sample collection
reproducibility
what is chemometric not a substitution for
human data interpretation
the user must be able to understand and explain the methods to the judge and jury