Lecture 7: Chemometrics Part 2 Flashcards

1
Q

What does PC do if the first principle component doesn’t sufficiently describe the data?

A
  • If the first PC is not sufficient to describe the spread of data, the calculation is repeated to find PC2
  • The process is continued until all the variability within the dataset has been accounted for and modelled
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

PC2 is…

A

PC2 is orthogonal to PC1 as that’s where you will find the next largest variation within the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does PC2 look at?

A

The residual variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data =

A

Data = Model + Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data = S

A

Data = Structure + Noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Structure = ?

A

The bits that are important, the bits that tell you where the sample groupings are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Noise

A

Could be random noise from instrument, could be fluctuations in instrument, temperature changes, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is residual variance?

A

Residual variance is random noise, this is not useful or helpful to model the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What happens if you model noise?

A

It means when you start to use the model for classification you’re trying to force an unknown sample into a model that has lots of noise going on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the ideal model in PCA?

A

We want to create a model that captures as much of the information in the dataset, in as few PCs as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do the PCs describe?

A

The PCs will describe the structure (explained variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you determine the optimum number of PCs in PCA?

A

Determination of the optimum number of principle components is achieved via the inspection of loadings plot, scores plot, explained variance plot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is a score generated for PCs?

A
  • Each PC is a linear combination of the original variables
  • The distance along each PC from the mean gives a sample its “score”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the distance between each sample and PC give you?

A

Distance between each sample and the principle component gives you the residuals (residual variation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the distance between each sample and the mean of all the samples give you?

A
  • Distance between each sample and the mean of all the samples gives you your score (this score can be positive or negative)
  • Each sample will have a different score for each principle component.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Where is the majority of the information found in PCA?

A

The majority of the information is in the first few PCs, particularly if you have a good data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does an explained variance / scores plot tell you?

A

An explained variance (or score) plot reveals the optimum number of PCs

18
Q

What will happen if your samples have lots of variance?

A

If your samples have lots of variance and lots of samples you’ll need more PCs to accurately account for the variation within the data set.

19
Q

What is a scores plot?

A
  • The scores plot is a map of the samples
  • Samples closest together are the most similar
20
Q

What happens to the score of a sample depending om the PC used?

A

Each sample will have a different score for every PC so the map will change depending on the PC chose.

21
Q

What is an advantage of a scores plot?

A

You can separate data points that would overlap on other PCs

22
Q

Why is PCA better than CA?

A

It tells you why samples group together.

23
Q

What are loadings?

A

Loadings are the link between the scores plot and the chemistry of the samples

24
Q

What is a loadings plot?

A
  • A loadings plot is a map of variables
  • The loading plots are the links between the scores plot and the chemistry of the samples
  • Loadings plots relate directly to the spectra and then relate directly back to the chemical composition because the spectrums give you your molecular fingerprint
25
Q

What do loadings explain?

A

It explains why samples are grouping

26
Q

For something that is negatively correlated to a PC, where should you expect to see it on a scores plot?

A

Anything that is negatively correlated to any PC means we will find it on the negative side of the scores plot for that PC because they are directly comparable.

27
Q

What does the magnitude of the loadings indicate?

A

The weighting of the variables

28
Q

What type of technique is linear discriminant analysis?

A

Supervised

29
Q

How does Linear Discriminant Analysis (LDA) create classification rules?

A

It uses principle components

30
Q

Why are loadings calculated in Linear Discriminant Analysis (LDA)?

A

Loadings are calculated to maximise separation between known groups

31
Q

What are the principle components called in linear discriminant analysis?

A

Discriminant functions

32
Q

What does LDA ask you at the start?

A

If you want to use PCA scores and if so how many scores.

33
Q

What does LDA do?

A

It tries to make samples from known groups as different as we can, so when we classify, the groups are sufficiently far apart so when we add new unknowns it can be grouped

34
Q

What is the fisher ratio? (LDA)

A

Loadings are tweaked that the difference between the samples is as large as possible whilst simultaneously trying to make the difference between samples of known group as small as possible.

35
Q

What does LDA try to minimise?

A

LDA tries to minimise variation within groups to maximise separation between groups by prioritising loadings

36
Q

What discriminant values are given to samples belonging to the same and different groups?

A

Samples belonging to the same group are given similar discriminant values and those from different groups are given different discriminant values

37
Q

What are the drawbacks of chemometrics?

A
  • It doesn’t fix data
  • Preprocessing is needed
  • Requires large sample size which is difficult for trace evidence
  • Samples need to reflect population
  • Sample collection has to be controlled and reproducible
  • Not a substitute for human interpretation
  • Communicating it to a jury can be difficult
38
Q

What do you have to be careful of when applying spectral preprocessing in chemometrics?

A

Over processing will over fit your data and force your data to be what you want it to to be but then it loses what it should be.

39
Q

What is spectral preprocessing?

A
  • Changing spectral output so you’re minimising random noise
    E.g baseline correction, smoothing, detrending, derivatives, etc
40
Q

What bias do you need to be aware of in chemometrics?

A

Cognitive bias