Statistical methods for chemical analysis Flashcards
What are statistical methods for chemical analysis?
- Data
- Distributions
- Associations
- Graphical methods
- Hypothesis testing
- Averages
- Power
What are some different data types and what is each one mentioned?
*** Nominal/ categorial **
o Data that you can put into a names category
-E.g., alive, or dead
*** Ordinal **
o Data that you can order (and categorise)
-E.g., Mild/ moderate/ severe
*** Interval/ ratio **
o Data that has a measurement (and you can order and categorise)
- Interval- differences between measurements are equal e.g., time, temperature
- Ratio- has a true zero so can be negative- e.g., heights, weight, percentage, concentration
What are the 5 rules for significant figures in measurements?
- All non-zero numbers **are significant **
e.g., 563 has 3 sig. figs. - All zeros between non-zero numbers **are significant **
e.g., 24006 has 5 sig. fig
2.404 has 4 sig. fig. - Leading zeros are **not significant **
e.g., 0.0063 has 2 sig/fig - Trailing zeros after a number are** not significant **
e.g., 420 have 2 sig. fig. -
Unless there is a decimal point before trailing zeros
e.g., 420.0 has 4 sig fig.
Data is based on measurements that are uncertain.
Not all digits have meaning (are significant) and only those numbers derived from a measurement should be written down. For instance trailing zeros if written must have meaning.
When doing addition/ subtraction and multiplications/ divisions what are some ensurences with significant figures that need to be made?
**Adding and subtracting- **the answer must reflect the reliability of the least precise number
2.2 + 2.66 = 4.9 (rounded to the least number after the dp)- as only have precision to 2 sf
* Multiplication and divisions- report with the least number of significant figures
* 14 is not the same the 14.0- same value but different meanings about its trustworthiness
o 2.5 x 3.42 = 8.6 (calculator 8.55 2 s.f)
o 3.10 x 4.520= 14.0 (calculator 14.012)
o 5.042 x 20= 100 (calculator 100.84)- note 1 sf in answer
o 5.042 x20.0= 101 (calculator 100.84)-note 3 sf in answer
How do would you round 5 in the following instances:
* Less than 5
* Greater than 5
* 5
* Exactly 5 (followed only by zeros)
Whats the relationship between accuracy and precision?
There isn’t one
What is accuracy?
A measurement of average difference between experimental value and true value
Differences are due to systematic errors
The true value must be known
Every measurmeent has an associated uncertainty
Whats precision?
How close measurements are to each other
The differences due to random errors
The distribution of the random measurements is **guassian or normal **
How is the normal distribution of data described?
As the standard deviation
When is a histogram used?
For normally distributed data when a large sample size is used and this is better as leads to a bell-shaped curve
Sometimes a histogram can have skewed data
What kind of graphical method is used to compare groups and distributions?
Box and whisker plot
What are the different averages and what is each one?
What is the equation for calculating the mean?
What is the mean?
How is it calculates?
How is it represented in a box and whisker plot?
What is the mode and what type of thing do you have to look out for?
If a statistical test is carried out and it gets that p<0.05, what does this mean?
P<0.05 shows less that 5% chance that these two data sets came from the same distribution which suggest that they are different sets
Draw a box and whisker plot and what does each aspect of itrepresent and how would the minimum and maximum otherwise be written?
For standard deviations:
* What kind of data is it used for?
* What is it a measure of?
* What does it describe?
What is the equation for the standard deviation and variance?
What is the** 95% reference range**?
**Standard deviation **
o Gives an indication of spread
o 95% of observations with mean +/-2sd (actually 1.96 sd)
o 95% reference (normal) range; expect 95% of the samples to be within this range in the data set
What is standard error?
The standard deviation of the means of the representative data is known as the standard error
Graphical representation of a standard error
What is the 95% confidence interval?
Standard error is a standard deviation
o Of means, rather than data observations
o 95% of means lie within the mean (of means) +/- 2se (**95% confidence interval) **
What is the equation for calculating standard error?
Error bars, deciding when to use them in your data
What is a normal distribution data set?
Data based on continuous distributions follow a mathematical distribution- usually a normal distribution
What do parametric tests rely on?
Parametric tests rely on the data being normally distributed- plot your data
What can you use if your data is not normally distributed?
If your data is not normally distributed you may be able to transform it mathematically, or use a non-parametric test E.g., log the data values, plot, and test for normality
What does the central limit theorem suggest?
Central limit theorem suggests that you can usually use parametric tests if you have a large sample size (>30)
When should a non-parametic test be used?
Non-parametric tests do not assume a particular distribution/normal distribution. Use these if your data is better represented by a median than a mean
Parametric tests normally assume that the variances in the sets of data are homogenous (homoscedastic). What can be done to support this?
o Use an F test to check
o If **In doubt, use a non-parametric test
**
What are F tests?
- F test looks to see if the ratio of the variances falls outside an expected level
- Depends on the degrees of freedom (n-1) in each group and the variance (s2)
What are F tests?
- F test looks to see if the ratio of the variances falls outside an expected level
- Depends on the degrees of freedom (n-1) in each group and the variance (s2)
When doing hypothesis tests, what is the first thing to consider?
Need to consider whether the data is independent (unpaired) or dependent (paired)
o Patients given treatment V patients given placebo
- 2 sets of independent data
o Patients measured at baseline and then after treatment
-1 set of data- the difference- normally distributed, even if the original data was not
What is a null hypothesis?
- The null hypothesis H0 assumes that there will be no observed difference because of an experiment
- The statistical test aims to look for evidence against the null hypothesis- a result that is so different from this distribution that we believe it has not occurred just by chance
- For example, if a result falls into the extremes of the distribution we might be prepared to reject the null hypothesis
- If the result does not fall into the extremes of the distribution we cannot reject the null hypothesis, but that does not mean that we accept the null hypothesis
Whats the alternative hypothesis?
- The alternative hypothesis H1 assumes that there will be an observed difference as a results of an experiment
- If, what we see, is not representative of the data distribution, then we reject the null and accept the alternative hypothesis
- P<0.05- less than 5% chance of the measurement falling into the null hypothesis distribution
- Result fall outside that 95% confidence interval
What is the general equation for a test statistic and what are some examples statistical tests which can be done?
- All statistics tests involve calculating a test statistic
- Test statistic is compared with a particular distribution
- E.g., F test, T test, Chi squared test etc.
Deciding what statistical test to use…
For a t- test (or students t-test) what does the distribution describe, what are they used to compare?
- The t test distribution describes sample data from the normal distribution
- As the amount of data increases, so it approaches the normal distribution
- T-tests are used to compare two sets of normally distributed data
What are the 3 different forms of t-tests and the equations for each?
3 different forms of t-test **
o Independent samples** t-test compares means of 2 different groups
o Paired samples t-test compares means from the same group at different times
o One sample t-test compares the mean of a group against the known mean
How do you calculate the degrees of freedom for multiple data sets?
Calculating degrees of freedom of samples: (number in sample A + number in sample B) -number of different data sets
Degrees of freedom= n-1** (for one data set)**
One-tailed or two-tailed tests
What percentage do they lie in, in the normal distribution curve?
If there are more than 2 groups to test, what is used?
AVOVA
What is AVOVA?
What is it used to compare?
- ANOVA (analysis of variation)
- Used to compare multiple groups in a single test- an extension of the t-test
What are the different types of AVOVA test you can have and what is each one used for?
* One-way ANOVA- compares 3 or more single independent variables
* MANOVA- tests effect of one or more independent variable on two or more dependent variables
o E.g., repeated measures over time in treated and placebo groups
* Null- all sample means are identical
*** Alternate- **at least one sample mean is significantly different
When the term ‘power’ is used in stats, what is this describing and what is a good level of power?
Power- How many samples do I need to test?
* Do I have enough power?
o Is my sample size large enough to detect a significant difference where a difference truly exists (although the truth is not known to you)?
* Questions to ask
o What power do I need? do I want to be 80% (80% power) that I will detect a difference in my test, if one really exists- or 90% sure?
o Power = Beta
o The higher the power, the more samples I will need
What level of significance do I want to set?
If we decide that something that occurs is less than 5% of the time in an experiment is unlikely to be due to chance, then we set the p value at what? and alpha become what?
If we feel we need to be more certain that this is not a chance event, then we should set the p and alpha values to what?
What level of significance do I want to set?
If we decide that something that occurs is less than 5% of the time in an experiment is unlikely to be due to chance, then we set the p value at 0<0.05 **
-Alpha= 0.05 **
If we feel we need to be more certain that this is not a chance event, then we should set **p<0.01 **
**-Alpha= 0.01 **
**
The lower the p value set, the more samples we will need to detect a difference where one truly exists**
When is power the greatest?
When the variability is reduced
What different things may power be?
- Power is the probability of rejecting the null hypothesis when in fact the null hypothesis is false
- Power is the probability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false
- Power is the probability that a test of significance will pick upon an effect that is present
- Power is the probability that a test of significance will detect a deviation from the null hypothesis, should such a deviation exist
- Power is the probability of avoiding a type II error (a false negative)
What is the equation for calculating power?
With power there are type I and type II errors, what are each?
With associations how is it decided what statistical test to use?
For categorical data the chi-squared test can be used, what is the equation for this test?
Associations- observed data
Associations- expected data
Associations- calculations
The associations flow chart when looking at relationships
How is correlation measured and what are the different types?
Associations- plotting data
Associations method comparison…
What is linear regression and what used to fit the line?
- Line is fitted using the **least squares method **
- Minimises the sum of squares of the residuals (the vertical difference of a point from a fitted line)
Predictions from associations
With associations there are r and R squared,what is each of these?
-
r is the correlation coefficient
o indicated the strength of the relationship between two variables
o ranges for -1 to +1 where 0 is no correlations - **R square is the regression coefficient **
o Indicates how well the x variable can be used to predict the variable on the y axis
o Ranges from 0 (poor predictor) to 1 (excellent predictor)
o R squared= 0.8 implies that the y (outcome) variable explains 80% of the variation seen in the x (dependent) variable
In HPLC the principles of linear regression are used to predict the concentration of an analyte based on a standard curve
In terms of validation of the method it is also important to determine the limit of detection (LOD) and the **limit of quantification (LOQ) **and this can be done easily in excel
What is the LOD and LOQ
o LOS is the lowest amount of analyte that can be detected
o LOQ is the lowest amount of analyte that can be quantified with reasonable accuracy and precision
Graphing the data-HPLC analysis of caffeine
Part 2
Part 3
Part 4