CHEMOMETRICS Flashcards
When do we use permutation tests or bootstrapping
when the observed data is sampled from an unknown or mixed
distribution
low sample sizes
Where outliers are a problem?
Too complex to estimate
the distribution?
Note this is an alternative to non parametric approaches
How do permutation tests work ? under what basis
Assume that if A and B are the same then labels don’t matter
so if testing to see if groups A and B are different
Steps:
1) calculate observed test (for example t test often non parametric- can be anything, ANOVA, quadratic etc) - called to
2) place all in a single group
3) - randomly assign to groups of equal size
4) calculate new test stat
5) repeat - for every single possible random placement into groups
6) arrand all the tests stats in ascending order - this is an empirical dist based on the data
7) if t0 falls outside the middle 95% of the empirical distribution then reject null hypo
What is an exact test vs approximate test in permutations?
exact does all the possible combos whereas approximate samples from all and samples some
What is Bootstrapping
Generates an emprical distribution but based off replacing the members of the original sample with other random members of the original sample (sampling with replacement) - basically just make a bunch of data sets with the same # of samples using those original values and that’s the equivalent of running the experiment a bunch of times - this way we can see where the data really lies instead of having just one set
(again can do with any stat)
What is Jackknifing
It’s a mean to estimate variance by doing subsampling (randomly leaving out samples from the set
What is K fold cross validation
used to validate a predictive model - splits data into K subsets each held out in turn as a validation set to test
What is a time series?
longitudinal data sets - over time - they plot the data (what happened) but also try to predict what happens next (forecast)
What are the steps in time series analysis
1) visualize data
2)Smooth /clean -
3)decomposition (eg if seasonally such as monthly or quarterly - can be decomposed into trend component (change in level over time)
4) show irregular components (not part of trend
What are trends people see in time series
They see additive trend (increase over time)
Additive seasonal (see it go up and down with seasons - almost sinusoidal)
and multiplicative trend (with seasonal gets larger/wider)
How are things smoothed in timem series
movign average - average points next to you - k = how many points
Exponential forecasting models
single - a series with constant level and irregular component (no trend or seasonal)
Double (holt) - exponential- series with a level and a trend
Triple (Holt Winters) exponential- series with level, trend and seasonal
Types of Error
I - alpha rejection of true null hypothesis (false positive)
II - beta - non rejection of false neative
What is LOD
lowest amount of analyte in sample that can be detected WITHIN a specific confidence level
is LOD agreed upon?
no - typically s/n relationship
Draw curves for signal to noise and blank and what shades represent what
Used for LOD determination - want stdev of blank but ours to be 3x that
So that we have a distribution over our blank - we want the lowest signal we analyze to be above that but how much overlap in dist?
we ideally want just a 5% overlap and to do that we need 3.3 stdev - that means our distribution overlaps with the blank distribution such that the portion in the blank is our BETA rate - false negative
and the region ov overlap in our sample dist is alpha - false positive.
Basically want a 5% overalp between the 2 so often 2 *sd of blank or 3.3 uis used to achieve that - so 5% for type I and type II error (type I is in sample Type 2 is in blank
Old answer:
LOQ vs LOD
10x
Calculate LOD or LOQ from signal to noise
need to use it with a nother method to verify
its mean + either 3 or 10 * stdev
if linear cal curve its 3.3 or 10 * stdev / b
slope of linear regression
What are selectivity and specitificity
selectivity - abiltiy of method to determine analyte in complex matrix without interference
Specificity - confirm the method ability to assess the analytes in presence of any other components that might be present (including matrix)
so specificity is selectivity +
Accuracy vs precision
accruacy - trueness or bias - measure of systematic error compare to reference,
Precision -closeness of repeated individual measurements under specified conditions
How to run accuracy and rpecision tests
against standard material want accruacy within and between run - bias - use a low and high QC
Precision - use % CV
ROBUST what is it
capacity of method to be uanffected by natural variation - test over range of parameters
UNCERTAINTY
sig source must be identified and tabulated
2 types
A and B
A is random
B is systematic
example - user skill, sampling, environe , instrument, etc
Stability
use QC - store at room temp, 4 cetc test against fresh
HOW DO WE HANDLE NON DETECTS
Exclude or delete from data set (worst)
Substitue (0, 1/2 LOD , LOD etc
Left and right indicate whether its too low or too high in terms of an unknown
What is survival analysis and NADA
- how long will it be before event occurs (eg medical)
NADA is non detects and data analysis
Fit What is fit for purpose
ensures the analytical method fills certain criteria of reliability and can can perform - gives us confidence shows its reproducible and repeatable
(so as a list
reproducible, broad coverage - sensitivity and selectivity, linearity, precision, stable)
What are some key poitns to precision testing
sample should be stable and homogenous (representative of whats bein tested)
Should be applied to the whole sample preparation method analysis procedure
2 factors - precision estimate and design of precision experiment
What are the types of precision estimates
1) REPEATABILITY -within batch or intra assay - one analyst on the same equipment over a short time period
2) Intermediate precision - made in a single lab but variable conditions different days, analysts, equipment etc - within lab reproducibility
3) Reproducibility - DIfferent labs - different equipment (interlab)
Types of precision experiments
Simple replication - repeated measurements on a suitable sample - want 6-15 reps
NEsted design - used when cant generate enough reps with simple replication (not feasible) - basically each batch has different params - so can be inter lab, intra lab etc
PRECISION limits - what are and howto calc
repeatabiltiy limit = r = t* root(2) *s
confidence interal 95% for difference between two results obtained under repeatbility conditions
Reproducibility is R = troot(2)sr
t is 2 tailed students t tested for confidence level and DOF,
They are calculated by multiplying the repeatability standard deviation (sr) or the reproducibility standard deviation (sR) by 2.8 respectively. The factor 2.8 is derived from 1.96 (95% of the population is within 1.96 standard deviations of the mean) times the square root of 2.
How to statistically evaluate precision estimates
F test
What is bias and how calculated and how evaluated
difference form true value - so just mean - accepted - can be %
t test statistic
Ruggedness study - how evaluated/set up
PLACKET BURMAN -
7 parameters to study - you pick (eg extraction time) each has levels (eg 30 min extract vs 10 min)
to investigate effect - difference between average of results of parameter at normal level vs average of results at alternate level
Measurement of uncertainty - what is and how tested
dispersion of values possible for measurement - eg stdev
can be propgated
ROC curves
OK so a ROC curve is a plot of TP RATE against FP rate
we often take AUC - area under curve
AUC ranges in value from 0-1 - a model with 100% wrong predictions has an AUC of 0 and has an AUC of1 if 100% right
REceiever operating characteristics - evalute prediction accuracy of classifier model
- tradeoff between sensitivity and specificity - the same LOD vs blank curve
formula
TPR = TP / all tP
FPR = FP / all FP
area under cuvrve AUC is the thing
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
What is metrology
formal system to enable informed decision through data assessment - levels of confidence to what we’re doing - reliable network of measurements for use to confidently make assessments about concentration
3 fundanmentals of metrology
1) TRACABILITY - SI - translates units to results - go from standards to where you are (higher order to CRM’s
2) UNCERTAINTY - (of measure) -using the rsd in results to make claims not individual points - the dist
3) VALIDATION (methods et)
What is QA and exmaples
he planned and systematic activities implemented in a quality system so that quality
requirements for a service will be fulfilled, quality assurance occurs before the data is collected
eg suitable lab environment, educated staff, training, documented and validated methods, preventative actions, etc
What is QC
Quality Control: the observation techniques and activities used to evaluate and report quality,
quality control occurs during and after data is collected
examples: blanks, spiked samples, controls, reference materials etc
Example of system suitabiltiy testing
small number of standards - acquire data for accuracy precision - not in bio, m/z accuracyRT peak shape are assessed
Dispersion ratio - what is it
Standard deviation for pooled QC sample vs test sample stdevso can use the D- ratio (MAD of QC/MAD of sample) and
AD of 0% means technical variance Is 0 - perfect measurement all cahgnes are due to biological cause
AD of 100% means all variance is due to noise - no bio info
Whats pooled QC
generated a single QC sample that can be distributed evenly throughout analytical batch
Batch design and pooled QC - what can you do
basically run QC throughout to see time based variance
Main uses of reference materials
examine skills of analyst
cOntrols
precision accuracy
accreditation
means uncertainty
How to score profiency testing results
2 steps - specify assigned value
and setting the standard dev
- ASSIGNMED VALUE - can either be known (CRM), REFERENCE value ( one lab determines) or it can be determined based off consensusfrom other labs
STDEV - set by scheme organizer - set by prescription or based on the results of a reproducibility experiment, from a general model (eg horowitz funct)
What is Z and Q in proficiecny testing
Z score is what we thinkg - value minus mean divided by stdev
z less than 2 (abs value of Z) pretty good - less than 3 - hmm questionable
Q score - alternative to Z - takes no account of stdev - dsit of Q centered on 0 then - relies on EXTERNAL PRESCRIPTION of acceptability
WHat is a YOUDEN plot
scatter plot - plots results from multiple labs on graph to show
if labs are equals, outliers, inconsistencies etc
x and y each represent one of the reported values (eg concentration of analytes A and B)
draw lines parallel to x and y axis and depending on where they are indicates - various things about results - eg random error vs systematic error
SHEWHART plot
sequential plots of observatiosn from QC material analyzed in succesiely - mean QC for each run and measurement # (y axis shows the mean
General princiiples of experimental design
Resaerch method where manipulate independant variables and look at dependant variable
things to do:
arrange experiments for cancellation or comparison..? - bias
plan to do replication or independent uncertainty estimates (precision)
Need statistical analysis or approach
Experimental designs list 4
Simple replication - series of observations on a single test material
Lienar calibration design - observations at a range of levels (some quantitative factor)
Nested: - has levels of factors in unique to that level
Factorial - has factors or levels but not wholly distinct - eg one group can be one factor, another can be another and and another group can be both factors at once
Why do we randomzie in expperimental design
to minimize nuisance effects - unwanted effects that influence the results - eg not effected by ordering/sampling order i
What is blocking
Basically have all replicates/groups of test materials subject to same nuisance effects (run at the same time - eg we have sets a b and c
What is blocking
Basically have all replicates/groups of test materials subject to same nuisance effects (run at the same time - eg we have sets a b and c - we can run them separately or run all in the same trial so subject o sam eeffects
Sampling theory (define randomization, representation and composite)
Randomization - equal membrs of pop - equal chance for selection
Representation - have enough of a population to draw inference on total pop
Composite - reduce effort by combining individuals to make a subset
List different sampling strats
Simple - everything equal chance (easy but not great for long continuous sequences also doesn’t reflect sub groups in population)
Stratified - divide pop into segments and randomly sample each segment - good because minimizes variance further - can get unique pockets
Systematic - First select random m then further ar at a fixed interval -simple and easy - regularly covers everything - cannot deal with any number specific variation - will miss it
4 quantities of power analysis
sample size
significance level (alpha 0 probabiltiy of making type I error)
Power - one minus the probability of making a type II error (probability of finding an effect is there
effect size - magnitude of the effect under alternate research hypothesis
How do you determine how many participants are needed for a study
power.t.test power package - theres a test you can do - uses sig level, power level etc can do for various tests, ANOVA
need means, common error variance etc, anova, linear regression chi squared etc
What is proportionality constant k
basically signal from instrumetn = the concentration * this factor
What is single point cal
basically just using this proportionality factor - just one point (S = k*C)- I guess also by default does through 0 then
Sensitivity from calibration curve
sensitivity is the slope b - capiabiltiy of responding reliably across changes in analyte concentreation
What is r in cal curve -
Its the pearson correlation coefficient - to describe relationship of response and concnvertation - 1- -1 describing correlation
R^2 measure how close data fits to linear model - 99% means 99% of difference variability in our responseis accounted for by changes in concentration
How to evaluate matrix effect
take sample matrix - extrat and spike sample in - compare to a normal standard solution (response/response) -1 ) - if neg value suppression
OR can do spiked recovery - compare matrix unspiked to matrix spiked (in same matrix) -
this is (spiked sample - unspiked ) / Cadded x100
Tyes of blanks -
method blank -unspiked sample
reagent blank -ust solvent
afield blank - unspiekd sample goes for trip (trip same but unopened)
Weighted regression
error with a emasruement proportional to conetration so with larger concentration more error so we give more weight to points where error bars are smallest for higher weights (divide by n -
Methods of standard addition
make cal curve in sample -
ISTD
strucutre nalogue, Stable isotop elabeled
Isotope dilution
absically do consecutive dilutions to make inteernal standard - - same ISTD in all samples
Multi LDR
basically if not large enough - make 2 curves
OMICS quantitation - how
no calc curve so do - response IS/conc IS = response target/conc target
What is a neural net
series of algorithims designed to recognzie underlying relationships ina large data set (input, hidden layer - outut
What is lachine learning
compute rprogram that improves its performance in a task through experience
4 ingredients of machine learning
1) data
2) a model that specifies how input data related to output
3) a loss function - shows how well model performs
4) optimization algorithm - so it can improve the model and minimize the loss function
Wat is overfitting (also udnerfitting)
your mode matches the training set too closely and isn’t generalizable (underfitting is if too loose - wrong assumption made)
What is supervised machine learning and what are common types
You tell it to develop mdoel based on input AND output
eg classification or regression
What is a decision tree
binary splits on predictor variable to create a ree and classify observations into one of two groups (repetitively)
this way we can choose the predictor that best splits the two groups - want HOMOGENEITY in each group maximized (eg the groupings make sense)
What is aconditional inference tree
splits based on signifigance tests
What is Random Forest
Ensembe learning aproach - uses multiple learning approaches to improve classification rates
What are support vector machines
SVMS are UNSUPERVISED machine learning - for classification and regression - seeks for optimal hyperplane for separating two classes in multidimensional space
What is a confusion matrix
a matrix of basically :
True Neg False Pos
False Neg True Pos
Stats from a confusion matrix
Sensitivity - TP / (total actual positive (FN+TP)
Specificity - TN/ (total actual negs (TN+FP))
False positive rate - FP / (TN+FP)
Precision = TP / (TP+FP)
What is PLS-DA
supervised pattern recognition - its is partial least square discriminant analysis - asks if groups are different and which features explain
How do distance based clustering methods work
1- calc a centroid
2- distance from each point to centroid of each group is calculated
3- sample assigned to group of closest centroid
Clustering - is it supervised or un - and describe it
unsupervised - data reduction technique - exaclty what it sounds like - cluster your obs
2 types of clusterings - bottom up and top down explain
How to normalize for clustering
scale
standardize to a mean of - and sd of 1
divide by max
Common steps for cllustering
normalize
screen for outliers
calculate distances
What is a dendogram
clustering but in a clade kind of
pros and cpns of dendogram
finds comapct clusters, sensitive to loutliers - need to remember interp that makes clustering make sensr
What are the different linkage types
single, complete, average , centroid
How to interpret dendo grams
heigh indicates order joined, read from bottom up - height reflects distance
What is k means clustering
select k centroids - assign each data point to closest centroid - recalculate the centroids as the average of all data points in a cluster
assign data point to closest centroid - continue steps 3 and 4 until observations no longer rassigned
Partitioning around medoids
K means is based on means so suscpetibleto outliers
PAM is k means but uses median as observation not mean
List the variable types
Continuous - numeric across any set of numbers
Ordinal - categorical but can be ranked eg grades
nominal -are categorical and cant be ranked
counts- are non negative integers (come from counting not ranking)
How do you test for sig frelationship between two nominal (categorical) variables)
Chi Squared again interpret the p value (p value are probability of obtaining the sampled result so less than 5 means less than 5% chance this is a false positive (low chance they independent)
Chi square limitations
should be used when observations greater than 50 and individual expected frequencies are no fewer than 5 - so BIG things
What is Fishers test for indeoendance for
nominal or categorical variables for small sample sizes
What is cochran mantel haenzel
Test for 2 nominal variables conditionally independant in each stratum of a 3rd variable
What is measure of association
for nominal variables - if you have a significant result from an independence test can test strength of that relationship eg can use for chi quared
What is a mosaic plot
visualize data sets with 2 or more categoirlca variables - colors shadings, size etc all use to demonstrate things
What are generalzied inear models vs logistic regression
linear models but for categorical variables where dist isn’t normal
often the variable can be categorical like binary or different groupings or categories (group A group B)
or OUT come variables that count up and take a limited # such as traffic accidents - not often distributed normal
LOGISTICS REGRESSIon - is used when the response Is BINARY
Overdispersion what is it
when observed variance is larger than what it should be leading to inaccurate significance testing
can test with deviance in R - if the value is close to1 no dispersion
What is a poisson regression
used where response variables is # of events to occur or counts - so you have y being a response and x is predictor variable
interpret the results: its a log value so eg if we have an x value that gets an estimate value of 0.022 - that means that a 1 increase in our x value is increased with a 0.022 increase in log mean # of y
What is PCA and what is it used for
Unsupervised multivariate (encompasses simultaenous observations and analysis of more than one outcome) - for high dimension data - used to identify patterns
every feature is used to calculate principal components (so dimension reducing approach to summarize large data)
What type of data can beanalyzed with PCA
mutlidimensional data sets (usualyl 2 groups, 3 reps each - biological reps, technical reps, profile analysis etc
How is PCA done -on a base level
looks at variability of a feature or variable across samples - and does that for each variable.
plot all observations on plot - and draw the lines with the best fit -minmizes the distance (it maximizes variances..?) - we make PC 2 perpindicular but calculated the same way (which one is best fit - keep going until stop
How do the PC’s in PCA compare
PC 1 is the most important and captures the most info
What is an issue with PCA
you give up some accuracy - since you are using less data reducing it down (parsimony want to explain the data with the least # of qualifiers
How are PC’s calculated
based on the variance (which puts it on the p[lot) and the magnitdue of that (eg how much does it influence the PC (eg if looking at genes and a result from one - those with the greatest variabiliy have the greatest impact on the PC’s
What do you need to consider before doing PCA
Scaling - (to make variables and the magnitude of influence comparable) - eg can do log transformation, mean centering etc
Overla - TRANSFORM - CENTER- SCALE
so transform ypically log
center - subtract mean from each and scale by dividing by stdev
What is a SCREEE plot
Line graph shows the proportion each PC accounts for variability
generally has elbow shape as first one or two generally suggest most of the variance with the first showing the most (the first 3 should be 80% or else its not a great PCA - maybe do something else)
What is a PCA scores plot
scres calculated for each PC plotted against each other (generally just show PC1 and PC2 - kind of like a corrgram - so plot PC 1 on the x and PC2 on the y for example to compare the influence on the data
What is a PCA LOADING plot
shows all observations and demonstrates which features most greatly influence the PC scored (so a plot for each PC) ( the farther from the origin the greater the influence
What is a PCA biplot
combination of scores plot and loading plot (essentially superimposed upon each other
What are some outlier tests and what do they do
Dixon (Q ) single - for small data
Grubbs -
Iglewicz Hoaglin - robust test for multi outlier - 2 sided - z score
What makes a non parametric method non paremtric vs robust
non parametric use the median
robust - based on the idea that sample pop is in fact NORMAL but has significant outliers
When to use non parametric
small data sets
different dsistrubtions
categorical
What is the wilcoxon signed rank
paired t test (two sample paired t test (non parametric)
What is the mann whitney U
two sample indenpendant t test (non parametric)
What is kruskal wallis
non parametric ANOVA (one way)
What is SPearman rank correlation
non parametric pearson correlation
Local Regression LOWESS and LOESS
regression analysis
What is regression used for
describe relationship 0 gie an equation
What is a residual in regression
signed difference between observed and fitted value
What is correlatoin coefficient
degree of linear assoc between x and y variables
What is second order polynomial regression
DOF n-3 and 3 params a b and c term (cx^2 bx +a
Scatter Plot matrix
Plot linear relationship between a whole bunch of variables
Bonferroni adjust p value what does it mean
adjust p value based on # of tests doing
What is hat statistic
p/n - shows high leverage or outliers
Covariance what is
tells you how 2 data sets change together in tandem
Correlation
Tells you when a change in one variable leads to another
Covariacne vs correlation
covariance is affected by change in scale
covariance keeps units
- 0 when independent for both
correlation descibres the degree to which 2 variables move in sequence
Assumptions for correlation
Normal dist
What is a corrgram
shows a bunch of cariables against each other
also scatter plot matrix
correlation does it equal causation?
no - causation means it causes it directly
What is ANOVA used for
analysis of variance - have variance associated with 2 or more things eeg an
population means of groups all equal or not equal
data grouped by factor like dose
How do we calculate variance in anova (or what types are there)
theres with each analyst , within group factor
and between group factor
What are assumptions for ANOVA
independance of observations
normality of residuals
homoscedasticity
ANova null and alt hypo
null all means the same - alt they different
What test do we get from ANOVA
we get F - compare F calc to F crit - if f calc is less than f crit - no sig difference
What is a post hoc test
tukeys hsd tels you whats different
What is aconfounding factor
a variable that could also explain group differences on the dependant variable - we are not interested in this -its a nuisance variable - want to remove it
What type of anova deals with confounding factors?
ANCOVA - add your nuisance as a covariate
What is ANOVA with MULTIPLE dependant variables
MANOVA - multivariate analysis of variance
what is MANCOVA
multivariate with covariate
What are ANCOVA assumptions
Linearity between covariate and outcome variable at each level of the independent variable (so basically all of your groups of the dependant need to be equally influenced by our covariate - or more like it does in fact effect our dependant variable at each independent variable level)
Homogeneity of regression slopes - sloesp of covariate against outcome variable should be same across groups (so basically no interaction of dependant variable and covariate - same effect across all independatn levels)
Outcome variable normal Dist
Homoscedascitiy
What is 2 factor ANOVA
AANOVA - but with subjects assigned to two groups that are a cross classification of independent variable levels
eg for TOEFL scores as outcome
can initially have one independent variable - educational level (3 groups in there)
but then can add another group - learning styles (which has 4 groups in there
2 way or factor ANOVA - assumptions
1
Dependant variable continuous
2 Both independant variables should have >- 2 levels
3 Independance of observations
4) dependant variable normally dist for each combo of independent variables
5) - homoscedasticity
6- balanced design
What are 2 way anova hypotehses
no difference in meanas of factor A
no difference in means of factor B
no interaction between A and B
What is 2 way factorial anova?
grouped again 2 dependant variables but cross classification between the two (eg can have medicine type but also dose - (so can have 2 med types and then 3 dosages for each emd type
Interaction plot
2 way anova plot
repeated meaures vs replication
repeated measuer is different measures on the same subject, replicates is with replicates
what is MANOVA
multivariate analysis - means that we have 2 dependant variables - looking at two factors (can combine with others MACNOVA< 2 way mmanova etc
MANOVA assumptions
independant observations
no outliers for outcome
multivariate normality
no multicolinearity (dependant variables cant be related)j
inearity between all outcome variables for each group
homogeneity of variances
What is a mahalanoblis plot
if data follows a multivariate normal dist - data points should fill on the line
How do you post hoc Manova
univariate one way anova for each outcome
or TUKEY
what are teh 5 descriptive stats:
Frequency.counts (mode)
central tendency/location (mean)
dispersion (stdev)
position (quartiles - medians etc)
shape of observations (skew and kurtosis)
Steps for sig testing
state null hypo
state alternate hypo
check dist
select test
choose sig level
calc stat
obtain crit value and compare
What is cohens D
measure effect size
What is distribution
function that shows possible values fora variable and tendency to occur
normal dist stdev
68%, 95 then 99
what is z scores
value - mean div by stdev (basically normallie to the dist)
How to get probability of value occufring
get z value - escribes area to the left look up in table
what is central limit theorem
dist of means gets closer and closer to normal the bigger the sample size
What is skew
taling and fronting
kurotiss
is peakedness - can be narrow or flat
How to test for normality
graphically histogram, QQ plot
stat test anderson darling etg