Chemometrics Flashcards
Define chemometrics
computationally intensive, multivariate (many variables) statistical analysis that is specifically applied to chemical systems or processes
What six things can chemometrics do?
1 - primarily to reduce complex datasets
- spectrometers give out lots of data which takes too long to go through manually
2 - identify and quantify sample groupings
- can see which samples are similar/dissimilar through quantitative value
3 - optimise experimental parameters
4 - isolate important variables and identify covariance
- some peaks are not important as some may be common to all samples (flat baseline region)
- covariance: which parts of data are reliant on each other, which ones are not independent/which ones tend to happen together
5 - provide reproducible measures of data
- anyone can take raw data from spectrum and put into chemometrics package and run on any instrument with same technique and parameters - get same results
- removes subjectivity
6 - allows for better visualisation of data
- easier to show picture when stand up as expert witness
Briefly explain the history of chemometrics
- it is not new (new in forensics but not in general)
- originally there was scepticism (was seen that if you needed algorithms then you must have bad data collected = poor science involved)
- chemometrics was routinely used in industry for process optimisation and quality control e.g. food and pharmaceutical industries
- aim is to maximise output and quality with minimal cost
- growing use in chemical engineering, biomedical sciences and materials science
- since 2009 - emerging use in forensic science (improved efficiency in forensic workflow and better quality of forensic provision)
What did the national academy of sciences (NAS) report published in 2009 say?
- forensic science is a mess and needs sorting out
- a need for statistical framework - hence chemometrics
- want replacement of unique/indistinguishable/match with numerical probabilities in Q vs K comparisons
- need standard terminology across all disciplines of forensic science
- not knowing what happened at scenes will help counteract bias
- chemometrics is quicker than manual data interpretation (cost efficient)
- chemometrics can help use models to predict trace behaviour (background, transfer, persistence and activity level) - model how a trace would be expected to transfer/persist in an environment given certain factors
- chemometrics does not negate need for expert - eliminates lot of subjectivity but still need human to interpret final result
Describe the difference between univariate and multivariate
- multivariate means many variables
- univariate means one variable (e.g. melting/boiling points)
What are the spectra in forensics (univariate or multivariate)?
- univariate approach is too simplistic for complex data. it doesn’t take covariance into account
- for example, faults can only be detected when MVA applied
- in forensics they are multivariate
- we must consider transfer, background, persistence and activity level
Describe three situations where MVA might be beneficial
1 - considering pollen as a form of TE
- likelihood of finding pollen will be much higher at certain times of year
2 - when someone puts a fingermark down how sweaty they are relates to temperature
- on a hot day will be more sebum in print and leave a more patent rather than latent print
3 - titanium dioxide pigment is used in makeup in two ways: active ingredient (sunscreen) layer or interference effect (shimmer/sheen)
- if spectrum shows it contains both titanium dioxide and mica = covariant and points in direction of interference pigment
- if spectrum shows it contains titanium dioxide, no mica but zinc oxide (another sunscreen with different UV range blockage) - would indicate SPF makeup with broad spectrum coverage
What are the four categories of chemometrics?
1 - design of experiments (helps you to design better experiments more effectively to get maximum amount of data out of it)
2 - exploratory data analysis (what is data showing me, how do samples compare, similar/dissimilar)
3 - classification (models for things like transfer persistence using model created in EDA)
4 - regression
Explain design of experiments/DOE (what will it improve in future of FS, what will it effect, how does it work)
- relates to experimental set up
- will be used in future to streamline FS provision - improve efficiency, quality and reproducibility (how many experiments can do in one day, ensure data and way analysing it is scientifically robust, will get same answer with diff analyst on diff day on diff machine, will get same interpretation from diff analyst)
- DOE will affect evidence collection, storage, instrument selection, parameter optimisation etc.
- DOE image shows dots in 4 corners (measurements we have taken) and DOE interpolates between these measured parameters
Explain regression analysis (what is it, how does it work, what does it allow for, give an example)
- chemometric version of a calibration curve based on y = mx + c linear relationship but this time multivariate
- it maps the effect of multiple independent variables (predictors) upon dependent variable (response)
- allows prediction of quantitative sample properties (puts numbers on things)
- for example
- ink deposition on paper and someone asks how ink would change over time
- use regression analysis, if have seen day 6 can use regression to suggest what would have looked like on day 5
Explain exploratory data analysis/EDA (what is it, how does it work, what three things does it allow for, what are two most commonly used EDA techniques)
- dimensionality reduction (data mining)
- it reduces data that has many variables e.g. raman spectra into just a few measures called principle components
- pattern recognition technique - identify groupings and patterns
- helps visualise trends that may have otherwise gone unnoticed
- determination of sample similarity in complex data and gives this a number
- cluster analysis (CA)
- principal component analysis (PCA)
Other than presence/absence of peaks, what else can be a sign if a sample is similar or dissimilar
not always presence/absence of peaks that is the tell-tale as to whether a sample is similar or dissimilar – sometimes it is the relationship between peaks
Describe the difference between an unsupervised and a supervised technique?
unsupervised - exploring the data without any prior assumptions or knowledge of the samples
supervised - building classification rules for known sample groupings
describe cluster analysis (supervised/unsupervised, what is it, two types, what is the output, why is it not entirely objective, what does analyst have to decide, 3 positives and 1 negative)
- unsupervised
- samples grouped into clusters based on calculated distance (measure of their similarity)
- agglomerative - individual samples into clusters
- hierarchal - cluster into individual samples
- they are opposites
- output is dendrogram
- there are different ways of calculating distances and linking criteria and this is a decision a human needs to make (introduces subjectivity)
- analyst decides on stopping rules to determine number of clusters arbitrarily (must state stopping rule)
- good initial technique as it simplifies complex data (as it is dimensionality reduction technique(
- not limited to quantitative data (can use numbers and types of animals
- visualisation of relationships
- however can only tell you there are groupings but not why
describe principal component analysis (supervised/unsupervised, better/worse than CA, what is process)
- unsupervised
- superlative to cluster analysis
- assesses all variables within a dataset e.g. spectrum and then decides which are relevant
- it then determines which variables are correlated
- where the algorithm finds correlation, defines it as a principal component (PC)
- PC that describes largest variation between samples will be given PC1
- if first PC not sufficient to describe spread of data (it isn’t), then calculation repeated to find PC2 (at right angle to PC1) - looks at residual variance to find next amount of variation
- process continued until all variability within dataset has been accounted for and modelled
- stop when modelling noise