Chemometrics Flashcards

1
Q

Define chemometrics

A

computationally intensive, multivariate (many variables) statistical analysis that is specifically applied to chemical systems or processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What six things can chemometrics do?

A

1 - primarily to reduce complex datasets
- spectrometers give out lots of data which takes too long to go through manually

2 - identify and quantify sample groupings
- can see which samples are similar/dissimilar through quantitative value

3 - optimise experimental parameters

4 - isolate important variables and identify covariance
- some peaks are not important as some may be common to all samples (flat baseline region)
- covariance: which parts of data are reliant on each other, which ones are not independent/which ones tend to happen together

5 - provide reproducible measures of data
- anyone can take raw data from spectrum and put into chemometrics package and run on any instrument with same technique and parameters - get same results
- removes subjectivity

6 - allows for better visualisation of data
- easier to show picture when stand up as expert witness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Briefly explain the history of chemometrics

A
  • it is not new (new in forensics but not in general)
  • originally there was scepticism (was seen that if you needed algorithms then you must have bad data collected = poor science involved)
  • chemometrics was routinely used in industry for process optimisation and quality control e.g. food and pharmaceutical industries
  • aim is to maximise output and quality with minimal cost
  • growing use in chemical engineering, biomedical sciences and materials science
  • since 2009 - emerging use in forensic science (improved efficiency in forensic workflow and better quality of forensic provision)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What did the national academy of sciences (NAS) report published in 2009 say?

A
  • forensic science is a mess and needs sorting out
  • a need for statistical framework - hence chemometrics
  • want replacement of unique/indistinguishable/match with numerical probabilities in Q vs K comparisons
  • need standard terminology across all disciplines of forensic science
  • not knowing what happened at scenes will help counteract bias
  • chemometrics is quicker than manual data interpretation (cost efficient)
  • chemometrics can help use models to predict trace behaviour (background, transfer, persistence and activity level) - model how a trace would be expected to transfer/persist in an environment given certain factors
  • chemometrics does not negate need for expert - eliminates lot of subjectivity but still need human to interpret final result
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the difference between univariate and multivariate

A
  • multivariate means many variables
  • univariate means one variable (e.g. melting/boiling points)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the spectra in forensics (univariate or multivariate)?

A
  • univariate approach is too simplistic for complex data. it doesn’t take covariance into account
  • for example, faults can only be detected when MVA applied
  • in forensics they are multivariate
  • we must consider transfer, background, persistence and activity level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe three situations where MVA might be beneficial

A

1 - considering pollen as a form of TE
- likelihood of finding pollen will be much higher at certain times of year

2 - when someone puts a fingermark down how sweaty they are relates to temperature
- on a hot day will be more sebum in print and leave a more patent rather than latent print

3 - titanium dioxide pigment is used in makeup in two ways: active ingredient (sunscreen) layer or interference effect (shimmer/sheen)
- if spectrum shows it contains both titanium dioxide and mica = covariant and points in direction of interference pigment
- if spectrum shows it contains titanium dioxide, no mica but zinc oxide (another sunscreen with different UV range blockage) - would indicate SPF makeup with broad spectrum coverage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the four categories of chemometrics?

A

1 - design of experiments (helps you to design better experiments more effectively to get maximum amount of data out of it)

2 - exploratory data analysis (what is data showing me, how do samples compare, similar/dissimilar)

3 - classification (models for things like transfer persistence using model created in EDA)

4 - regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain design of experiments/DOE (what will it improve in future of FS, what will it effect, how does it work)

A
  • relates to experimental set up
  • will be used in future to streamline FS provision - improve efficiency, quality and reproducibility (how many experiments can do in one day, ensure data and way analysing it is scientifically robust, will get same answer with diff analyst on diff day on diff machine, will get same interpretation from diff analyst)
  • DOE will affect evidence collection, storage, instrument selection, parameter optimisation etc.
  • DOE image shows dots in 4 corners (measurements we have taken) and DOE interpolates between these measured parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain regression analysis (what is it, how does it work, what does it allow for, give an example)

A
  • chemometric version of a calibration curve based on y = mx + c linear relationship but this time multivariate
  • it maps the effect of multiple independent variables (predictors) upon dependent variable (response)
  • allows prediction of quantitative sample properties (puts numbers on things)
  • for example
  • ink deposition on paper and someone asks how ink would change over time
  • use regression analysis, if have seen day 6 can use regression to suggest what would have looked like on day 5
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain exploratory data analysis/EDA (what is it, how does it work, what three things does it allow for, what are two most commonly used EDA techniques)

A
  • dimensionality reduction (data mining)
  • it reduces data that has many variables e.g. raman spectra into just a few measures called principle components
  • pattern recognition technique - identify groupings and patterns
  • helps visualise trends that may have otherwise gone unnoticed
  • determination of sample similarity in complex data and gives this a number
  • cluster analysis (CA)
  • principal component analysis (PCA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Other than presence/absence of peaks, what else can be a sign if a sample is similar or dissimilar

A

not always presence/absence of peaks that is the tell-tale as to whether a sample is similar or dissimilar – sometimes it is the relationship between peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the difference between an unsupervised and a supervised technique?

A

unsupervised - exploring the data without any prior assumptions or knowledge of the samples

supervised - building classification rules for known sample groupings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

describe cluster analysis (supervised/unsupervised, what is it, two types, what is the output, why is it not entirely objective, what does analyst have to decide, 3 positives and 1 negative)

A
  • unsupervised
  • samples grouped into clusters based on calculated distance (measure of their similarity)
  • agglomerative - individual samples into clusters
  • hierarchal - cluster into individual samples
  • they are opposites
  • output is dendrogram
  • there are different ways of calculating distances and linking criteria and this is a decision a human needs to make (introduces subjectivity)
  • analyst decides on stopping rules to determine number of clusters arbitrarily (must state stopping rule)
  • good initial technique as it simplifies complex data (as it is dimensionality reduction technique(
  • not limited to quantitative data (can use numbers and types of animals
  • visualisation of relationships
  • however can only tell you there are groupings but not why
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

describe principal component analysis (supervised/unsupervised, better/worse than CA, what is process)

A
  • unsupervised
  • superlative to cluster analysis
  • assesses all variables within a dataset e.g. spectrum and then decides which are relevant
  • it then determines which variables are correlated
  • where the algorithm finds correlation, defines it as a principal component (PC)
  • PC that describes largest variation between samples will be given PC1
  • if first PC not sufficient to describe spread of data (it isn’t), then calculation repeated to find PC2 (at right angle to PC1) - looks at residual variance to find next amount of variation
  • process continued until all variability within dataset has been accounted for and modelled
  • stop when modelling noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it not good if data can be described by 1 PC?

A

there is not enough variation in data

17
Q

What is the aim of our principal component analysis model?

A

to create a model that captures as much info in dataset in as few PCs as possible

this is counterintuitive as we have just said more PCs = more variations described

18
Q

What 2 components are data (spectra, chromatograms etc) comprised of?

What does PC describe?

What info is left over?

What is ideal model based on these two components?

When can non-ideal model occur?

A

structure and noise

  • PC describes structure (explained variance)
  • important bits - tell you where sample groupings are
  • whatever info is leftover is random noise (residual variance) from instruments in lab, temp fluctuations etc.
  • stuff isn’t useful when modelling the data
  • model with structure and no noise (when using model for classification will be tricky as unknown sample forced into model with loads of noise)
  • above can happen when using more and more PC’s
19
Q

What is each PC?

A

Each PC is a linear combination of the original variables (wavenumbers from spectral output)

20
Q

Define score

A
  • the distance along each PC from the mean to the sample (where mean = mean of all samples)
  • can be positive or negative
  • each sample will have a different score for each PC until we decide to stop using PCs when all we are modelling is noise
21
Q

What is the correlation between number of PCs and number of original variables

A

the model can have as many PCs as original variables

22
Q

Where is majority of good information?

In what scenario will we need more PCs?

A
  • in first few PC’s (especially if we have good robust data set)
  • if have vastly different samples and lots of variance then need more PCs to accurately account for all info and variance in set
23
Q

What helps us determine the number of optimum PCs that we want to retain in a model?

How is this done?

What are two better ways to do this?

A
  • an explained variance (or scree) plot
  • > 1 % variance rule is where keep all PCs until you get to a point where you are adding less than 1 % between PCs
  • inspection of scores pattern or check loadings
24
Q

What is scores plot?

How is this used to help determine optimum PCs we want to remain in model?

A
  • scores plot is mapped with the samples where each data point is 1 sample and similar samples are clustering but we can say what PCs we want to map
  • scores plot will change depending on what PCs are mapped
  • for example if plot PC1/4 and PC1/5 and they show the same then know not to include 5 as it isn’t showing anything useful
25
Q

What are skittle plots?

A
  • when score plots do not look nice
  • represent dropping a bag of skittles, hence the name
  • they do not always reveal immediate trends
  • samples can be grouped on skittle plots according to predefined categories
26
Q

What version of scores plots can be used for increased visualisation?

What is good and bad for these?

How are these helpful?

A
  • 3D scores plots
  • can look messy but good if have data that behaves itself
  • plot different combinations of PCs for enhanced discrimination
  • each 3D scatter plot uses different PCs - quite subtle but placement of sample changes depending upon what PCs we are comparing - this is because each sample will have a different score for each PC
27
Q

What are PCA loadings?

What do loading plots relate directly back to?

What does magnitude of loading indicate?

A
  • loadings are the link between scores plot and the chemistry of the samples
  • a scores plot is map of samples but a loadings plot is a map of variables
  • tells you which samples are grouping
  • can be positive or negative
  • loading plots relate directly to the spectra and then relate directly back to the chemical composition because the spectra give you your molecular fingerprint
  • magnitude/value of loading = weightings for particular variable
28
Q

What do loading plots explain?

A

why samples are grouping

29
Q

For something that is below zero on the loadings plot, where should you expect to see it on a scores plot?

A
  • below 0 on loadings plot means it is negatively correlated with a PC which means that we will find it on negative side of scores plot
  • this is because loading plots and scores plots are directly comparable - can overlay them to see this
30
Q

What is LDA?

supervised/unsupervised?

how does it work?

A
  • linear discriminant analysis (LDA)
  • supervised chemometric technique (need to have knowledge of samples and have some idea of sample groupings)
  • LDA uses PCs to create classification rules/model
  • classification runs off back of PCA
  • when setting up LDA algorithm can use PCA scores
31
Q

why are loadings calculated in LDA

how are loadings calculated in LDA?

what are PCs and scores called in LDA?

A
  • loadings in LDA are calculated specifically to maximise separation between known groups
  • loadings are tweaked to make sure difference between samples is as large as possible whilst simultaneously trying to make the difference between samples of the same known group are as small as possible
  • trying to make samples from known groups as different as possible so when it comes to classify which group it is, the groups are so far apart such that they cannot fall into the middle and be ambiguous - called tweaking

PCs = discriminant functions
scores = discriminant value

32
Q

How does LDA work in comparison to PCA?

A
  • PCA looks for most variation between samples and applies loading weight wherever variation is found
  • where variation is found, it identifies what it is and says whether it has a large weighting or not
  • what if that variation is not found between samples of different groups but among samples that should be in same group
  • LDA tries to minimise variation withing groups to maximise separation between groups by prioritising loadings
  • samples belonging to the same group are given similar discriminant values and those from different groups, different discriminant values
33
Q

Describe the caveats of chemometrics?

A
  • do not want a model that is so specific it cannot cope with new unknown samples
  • cannot compensate with chemometrics for bad data as it will not fix rubbish data
  • pre-processing is important but less is more (changing spectral output so minimising random noise - baseline correct, smoothing and normalisation BUT do not overprocess data as it will overfit data, make it into something you want it to be then it loses what it should be
  • need to have a good enough sample size to give a good dataset and reflect population (although difficult in trace analysis)
  • was sample collection controlled and reproducible
  • sample contamination or degradation
  • appropriate analytical method
  • correct parameters (include replicates for example analyse same thing 5 times)
  • beware of cognitive bias (chemometrics can only reduce subjectivity, not remove it) chemometrics is not a substitute for human interpretation (years of experience/exposure to different cases)
  • validation and test set importance - make sure model is robust and not adversely affected by outliers
  • success in research does not equal success in forensic casework
  • practitioner must be able to communicate findings to the judge and jury therefore they must understand methods
34
Q

What are the benefits of using PCA over visual comparison?

A
  • PCA is dimensionality reduction technique
  • allows for trends to be noticed and visualised - can be missed when visual
  • pattern recognition technique - not obvious when visualised
  • not always the absence and presence of peaks which determine if sample is similar/dissimilar - can be covariance between peaks - cannot be visualised
  • quantitatively determines sample similarity - objective as is a number
  • unsupervised 0 do not need any prior knowledge of data
  • more time efficient