Chemometrics Flashcards by Ella Ibrahim

Define chemometrics

computationally intensive, multivariate (many variables) statistical analysis that is specifically applied to chemical systems or processes

How well did you know this?

Not at all

Perfectly

What six things can chemometrics do?

1 - primarily to reduce complex datasets
- spectrometers give out lots of data which takes too long to go through manually

2 - identify and quantify sample groupings
- can see which samples are similar/dissimilar through quantitative value

3 - optimise experimental parameters

4 - isolate important variables and identify covariance
- some peaks are not important as some may be common to all samples (flat baseline region)
- covariance: which parts of data are reliant on each other, which ones are not independent/which ones tend to happen together

5 - provide reproducible measures of data
- anyone can take raw data from spectrum and put into chemometrics package and run on any instrument with same technique and parameters - get same results
- removes subjectivity

6 - allows for better visualisation of data
- easier to show picture when stand up as expert witness

How well did you know this?

Not at all

Perfectly

Briefly explain the history of chemometrics

it is not new (new in forensics but not in general)
originally there was scepticism (was seen that if you needed algorithms then you must have bad data collected = poor science involved)
chemometrics was routinely used in industry for process optimisation and quality control e.g. food and pharmaceutical industries
aim is to maximise output and quality with minimal cost
growing use in chemical engineering, biomedical sciences and materials science
since 2009 - emerging use in forensic science (improved efficiency in forensic workflow and better quality of forensic provision)

How well did you know this?

Not at all

Perfectly

What did the national academy of sciences (NAS) report published in 2009 say?

forensic science is a mess and needs sorting out
a need for statistical framework - hence chemometrics
want replacement of unique/indistinguishable/match with numerical probabilities in Q vs K comparisons
need standard terminology across all disciplines of forensic science
not knowing what happened at scenes will help counteract bias
chemometrics is quicker than manual data interpretation (cost efficient)
chemometrics can help use models to predict trace behaviour (background, transfer, persistence and activity level) - model how a trace would be expected to transfer/persist in an environment given certain factors
chemometrics does not negate need for expert - eliminates lot of subjectivity but still need human to interpret final result

How well did you know this?

Not at all

Perfectly

Describe the difference between univariate and multivariate

multivariate means many variables
univariate means one variable (e.g. melting/boiling points)

How well did you know this?

Not at all

Perfectly

What are the spectra in forensics (univariate or multivariate)?

univariate approach is too simplistic for complex data. it doesn’t take covariance into account
for example, faults can only be detected when MVA applied
in forensics they are multivariate
we must consider transfer, background, persistence and activity level

How well did you know this?

Not at all

Perfectly

Describe three situations where MVA might be beneficial

1 - considering pollen as a form of TE
- likelihood of finding pollen will be much higher at certain times of year

2 - when someone puts a fingermark down how sweaty they are relates to temperature
- on a hot day will be more sebum in print and leave a more patent rather than latent print

3 - titanium dioxide pigment is used in makeup in two ways: active ingredient (sunscreen) layer or interference effect (shimmer/sheen)
- if spectrum shows it contains both titanium dioxide and mica = covariant and points in direction of interference pigment
- if spectrum shows it contains titanium dioxide, no mica but zinc oxide (another sunscreen with different UV range blockage) - would indicate SPF makeup with broad spectrum coverage

How well did you know this?

Not at all

Perfectly

What are the four categories of chemometrics?

1 - design of experiments (helps you to design better experiments more effectively to get maximum amount of data out of it)

2 - exploratory data analysis (what is data showing me, how do samples compare, similar/dissimilar)

3 - classification (models for things like transfer persistence using model created in EDA)

4 - regression

How well did you know this?

Not at all

Perfectly

Explain design of experiments/DOE (what will it improve in future of FS, what will it effect, how does it work)

relates to experimental set up
will be used in future to streamline FS provision - improve efficiency, quality and reproducibility (how many experiments can do in one day, ensure data and way analysing it is scientifically robust, will get same answer with diff analyst on diff day on diff machine, will get same interpretation from diff analyst)
DOE will affect evidence collection, storage, instrument selection, parameter optimisation etc.
DOE image shows dots in 4 corners (measurements we have taken) and DOE interpolates between these measured parameters

How well did you know this?

Not at all

Perfectly

Explain regression analysis (what is it, how does it work, what does it allow for, give an example)

chemometric version of a calibration curve based on y = mx + c linear relationship but this time multivariate
it maps the effect of multiple independent variables (predictors) upon dependent variable (response)
allows prediction of quantitative sample properties (puts numbers on things)
for example
ink deposition on paper and someone asks how ink would change over time
use regression analysis, if have seen day 6 can use regression to suggest what would have looked like on day 5

How well did you know this?

Not at all

Perfectly

Explain exploratory data analysis/EDA (what is it, how does it work, what three things does it allow for, what are two most commonly used EDA techniques)

dimensionality reduction (data mining)
it reduces data that has many variables e.g. raman spectra into just a few measures called principle components
pattern recognition technique - identify groupings and patterns
helps visualise trends that may have otherwise gone unnoticed
determination of sample similarity in complex data and gives this a number
cluster analysis (CA)
principal component analysis (PCA)

How well did you know this?

Not at all

Perfectly

Other than presence/absence of peaks, what else can be a sign if a sample is similar or dissimilar

not always presence/absence of peaks that is the tell-tale as to whether a sample is similar or dissimilar – sometimes it is the relationship between peaks

How well did you know this?

Not at all

Perfectly

Describe the difference between an unsupervised and a supervised technique?

unsupervised - exploring the data without any prior assumptions or knowledge of the samples

supervised - building classification rules for known sample groupings

How well did you know this?

Not at all

Perfectly

describe cluster analysis (supervised/unsupervised, what is it, two types, what is the output, why is it not entirely objective, what does analyst have to decide, 3 positives and 1 negative)

unsupervised
samples grouped into clusters based on calculated distance (measure of their similarity)
agglomerative - individual samples into clusters
hierarchal - cluster into individual samples
they are opposites
output is dendrogram
there are different ways of calculating distances and linking criteria and this is a decision a human needs to make (introduces subjectivity)
analyst decides on stopping rules to determine number of clusters arbitrarily (must state stopping rule)
good initial technique as it simplifies complex data (as it is dimensionality reduction technique(
not limited to quantitative data (can use numbers and types of animals
visualisation of relationships
however can only tell you there are groupings but not why

How well did you know this?

Not at all

Perfectly

describe principal component analysis (supervised/unsupervised, better/worse than CA, what is process)

unsupervised
superlative to cluster analysis
assesses all variables within a dataset e.g. spectrum and then decides which are relevant
it then determines which variables are correlated
where the algorithm finds correlation, defines it as a principal component (PC)
PC that describes largest variation between samples will be given PC1
if first PC not sufficient to describe spread of data (it isn’t), then calculation repeated to find PC2 (at right angle to PC1) - looks at residual variance to find next amount of variation
process continued until all variability within dataset has been accounted for and modelled
stop when modelling noise

How well did you know this?

Not at all

Perfectly

Why is it not good if data can be described by 1 PC?

Study These Flashcards

there is not enough variation in data

What is the aim of our principal component analysis model?

Study These Flashcards

to create a model that captures as much info in dataset in as few PCs as possible

this is counterintuitive as we have just said more PCs = more variations described

What 2 components are data (spectra, chromatograms etc) comprised of?

What does PC describe?

What info is left over?

What is ideal model based on these two components?

When can non-ideal model occur?

Study These Flashcards

structure and noise

PC describes structure (explained variance)
important bits - tell you where sample groupings are
whatever info is leftover is random noise (residual variance) from instruments in lab, temp fluctuations etc.
stuff isn’t useful when modelling the data
model with structure and no noise (when using model for classification will be tricky as unknown sample forced into model with loads of noise)
above can happen when using more and more PC’s

What is each PC?

Study These Flashcards

Each PC is a linear combination of the original variables (wavenumbers from spectral output)

Define score

Study These Flashcards

the distance along each PC from the mean to the sample (where mean = mean of all samples)
can be positive or negative
each sample will have a different score for each PC until we decide to stop using PCs when all we are modelling is noise

What is the correlation between number of PCs and number of original variables

Study These Flashcards

the model can have as many PCs as original variables

Where is majority of good information?

In what scenario will we need more PCs?

Study These Flashcards

in first few PC’s (especially if we have good robust data set)
if have vastly different samples and lots of variance then need more PCs to accurately account for all info and variance in set

What helps us determine the number of optimum PCs that we want to retain in a model?

How is this done?

What are two better ways to do this?

Study These Flashcards

an explained variance (or scree) plot
> 1 % variance rule is where keep all PCs until you get to a point where you are adding less than 1 % between PCs
inspection of scores pattern or check loadings

What is scores plot?

How is this used to help determine optimum PCs we want to remain in model?

Study These Flashcards

scores plot is mapped with the samples where each data point is 1 sample and similar samples are clustering but we can say what PCs we want to map
scores plot will change depending on what PCs are mapped
for example if plot PC1/4 and PC1/5 and they show the same then know not to include 5 as it isn’t showing anything useful

What are skittle plots?

- when score plots do not look nice - represent dropping a bag of skittles, hence the name - they do not always reveal immediate trends - samples can be grouped on skittle plots according to predefined categories

What version of scores plots can be used for increased visualisation? What is good and bad for these? How are these helpful?

- 3D scores plots - can look messy but good if have data that behaves itself - plot different combinations of PCs for enhanced discrimination - each 3D scatter plot uses different PCs - quite subtle but placement of sample changes depending upon what PCs we are comparing - this is because each sample will have a different score for each PC

What are PCA loadings? What do loading plots relate directly back to? What does magnitude of loading indicate?

- loadings are the link between scores plot and the chemistry of the samples - a scores plot is map of samples but a loadings plot is a map of variables - tells you which samples are grouping - can be positive or negative - loading plots relate directly to the spectra and then relate directly back to the chemical composition because the spectra give you your molecular fingerprint - magnitude/value of loading = weightings for particular variable

What do loading plots explain?

why samples are grouping

For something that is below zero on the loadings plot, where should you expect to see it on a scores plot?

- below 0 on loadings plot means it is negatively correlated with a PC which means that we will find it on negative side of scores plot - this is because loading plots and scores plots are directly comparable - can overlay them to see this

What is LDA? supervised/unsupervised? how does it work?

- linear discriminant analysis (LDA) - supervised chemometric technique (need to have knowledge of samples and have some idea of sample groupings) - LDA uses PCs to create classification rules/model - classification runs off back of PCA - when setting up LDA algorithm can use PCA scores

why are loadings calculated in LDA how are loadings calculated in LDA? what are PCs and scores called in LDA?

- loadings in LDA are calculated specifically to maximise separation between known groups - loadings are tweaked to make sure difference between samples is as large as possible whilst simultaneously trying to make the difference between samples of the same known group are as small as possible - trying to make samples from known groups as different as possible so when it comes to classify which group it is, the groups are so far apart such that they cannot fall into the middle and be ambiguous - called tweaking PCs = discriminant functions scores = discriminant value

How does LDA work in comparison to PCA?

- PCA looks for most variation between samples and applies loading weight wherever variation is found - where variation is found, it identifies what it is and says whether it has a large weighting or not - what if that variation is not found between samples of different groups but among samples that should be in same group - LDA tries to minimise variation withing groups to maximise separation between groups by prioritising loadings - samples belonging to the same group are given similar discriminant values and those from different groups, different discriminant values

Describe the caveats of chemometrics?

- do not want a model that is so specific it cannot cope with new unknown samples - cannot compensate with chemometrics for bad data as it will not fix rubbish data - pre-processing is important but less is more (changing spectral output so minimising random noise - baseline correct, smoothing and normalisation BUT do not overprocess data as it will overfit data, make it into something you want it to be then it loses what it should be - need to have a good enough sample size to give a good dataset and reflect population (although difficult in trace analysis) - was sample collection controlled and reproducible - sample contamination or degradation - appropriate analytical method - correct parameters (include replicates for example analyse same thing 5 times) - beware of cognitive bias (chemometrics can only reduce subjectivity, not remove it) chemometrics is not a substitute for human interpretation (years of experience/exposure to different cases) - validation and test set importance - make sure model is robust and not adversely affected by outliers - success in research does not equal success in forensic casework - practitioner must be able to communicate findings to the judge and jury therefore they must understand methods

What are the benefits of using PCA over visual comparison?

- PCA is dimensionality reduction technique - allows for trends to be noticed and visualised - can be missed when visual - pattern recognition technique - not obvious when visualised - not always the absence and presence of peaks which determine if sample is similar/dissimilar - can be covariance between peaks - cannot be visualised - quantitatively determines sample similarity - objective as is a number - unsupervised 0 do not need any prior knowledge of data - more time efficient

Chemometrics Flashcards

(34 cards)