CHEMOMETRICS Flashcards

Question

What is survival analysis and NADA

Answer 1

- how long will it be before event occurs (eg medical) NADA is non detects and data analysis

Answer 2

ensures the analytical method fills certain criteria of reliability and can can perform - gives us confidence shows its reproducible and repeatable (so as a list reproducible, broad coverage - sensitivity and selectivity, linearity, precision, stable)

Answer 3

sample should be stable and homogenous (representative of whats bein tested) Should be applied to the whole sample preparation method analysis procedure 2 factors - precision estimate and design of precision experiment

Answer 4

1) REPEATABILITY -within batch or intra assay - one analyst on the same equipment over a short time period 2) Intermediate precision - made in a single lab but variable conditions different days, analysts, equipment etc - within lab reproducibility 3) Reproducibility - DIfferent labs - different equipment (interlab)

Answer 5

Simple replication - repeated measurements on a suitable sample - want 6-15 reps NEsted design - used when cant generate enough reps with simple replication (not feasible) - basically each batch has different params - so can be inter lab, intra lab etc

Answer 6

repeatabiltiy limit = r = t* root(2) *s confidence interal 95% for difference between two results obtained under repeatbility conditions Reproducibility is R = t*root(2)*sr t is 2 tailed students t tested for confidence level and DOF, They are calculated by multiplying the repeatability standard deviation (sr) or the reproducibility standard deviation (sR) by 2.8 respectively. The factor 2.8 is derived from 1.96 (95% of the population is within 1.96 standard deviations of the mean) times the square root of 2.

Answer 7

difference form true value - so just mean - accepted - can be % t test statistic

Answer 8

PLACKET BURMAN - 7 parameters to study - you pick (eg extraction time) each has levels (eg 30 min extract vs 10 min) to investigate effect - difference between average of results of parameter at normal level vs average of results at alternate level

Answer 9

dispersion of values possible for measurement - eg stdev can be propgated

Answer 10

OK so a ROC curve is a plot of TP RATE against FP rate we often take AUC - area under curve AUC ranges in value from 0-1 - a model with 100% wrong predictions has an AUC of 0 and has an AUC of1 if 100% right REceiever operating characteristics - evalute prediction accuracy of classifier model - tradeoff between sensitivity and specificity - the same LOD vs blank curve formula TPR = TP / all tP FPR = FP / all FP area under cuvrve AUC is the thing An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Answer 11

formal system to enable informed decision through data assessment - levels of confidence to what we're doing - reliable network of measurements for use to confidently make assessments about concentration

Answer 12

1) TRACABILITY - SI - translates units to results - go from standards to where you are (higher order to CRM's 2) UNCERTAINTY - (of measure) -using the rsd in results to make claims not individual points - the dist 3) VALIDATION (methods et)

Answer 13

he planned and systematic activities implemented in a quality system so that quality requirements for a service will be fulfilled, quality assurance occurs before the data is collected eg suitable lab environment, educated staff, training, documented and validated methods, preventative actions, etc

Answer 14

Quality Control: the observation techniques and activities used to evaluate and report quality, quality control occurs during and after data is collected examples: blanks, spiked samples, controls, reference materials etc

Answer 15

small number of standards - acquire data for accuracy precision - not in bio, m/z accuracyRT peak shape are assessed

Answer 16

Standard deviation for pooled QC sample vs test sample stdevso can use the D- ratio (MAD of QC/MAD of sample) and AD of 0% means technical variance Is 0 - perfect measurement all cahgnes are due to biological cause AD of 100% means all variance is due to noise - no bio info

Answer 17

generated a single QC sample that can be distributed evenly throughout analytical batch

Answer 18

basically run QC throughout to see time based variance

Answer 19

examine skills of analyst cOntrols precision accuracy accreditation means uncertainty

Answer 20

2 steps - specify assigned value and setting the standard dev - ASSIGNMED VALUE - can either be known (CRM), REFERENCE value ( one lab determines) or it can be determined based off consensusfrom other labs STDEV - set by scheme organizer - set by prescription or based on the results of a reproducibility experiment, from a general model (eg horowitz funct)

Answer 21

Z score is what we thinkg - value minus mean divided by stdev z less than 2 (abs value of Z) pretty good - less than 3 - hmm questionable Q score - alternative to Z - takes no account of stdev - dsit of Q centered on 0 then - relies on EXTERNAL PRESCRIPTION of acceptability

Answer 22

scatter plot - plots results from multiple labs on graph to show if labs are equals, outliers, inconsistencies etc x and y each represent one of the reported values (eg concentration of analytes A and B) draw lines parallel to x and y axis and depending on where they are indicates - various things about results - eg random error vs systematic error

Answer 23

sequential plots of observatiosn from QC material analyzed in succesiely - mean QC for each run and measurement # (y axis shows the mean

Answer 24

Resaerch method where manipulate independant variables and look at dependant variable things to do: arrange experiments for cancellation or comparison..? - bias plan to do replication or independent uncertainty estimates (precision) Need statistical analysis or approach

Answer 25

Simple replication - series of observations on a single test material Lienar calibration design - observations at a range of levels (some quantitative factor) Nested: - has levels of factors in unique to that level Factorial - has factors or levels but not wholly distinct - eg one group can be one factor, another can be another and and another group can be both factors at once

Answer 26

to minimize nuisance effects - unwanted effects that influence the results - eg not effected by ordering/sampling order i

Answer 27

Basically have all replicates/groups of test materials subject to same nuisance effects (run at the same time - eg we have sets a b and c

Answer 28

Basically have all replicates/groups of test materials subject to same nuisance effects (run at the same time - eg we have sets a b and c - we can run them separately or run all in the same trial so subject o sam eeffects

Answer 29

Randomization - equal membrs of pop - equal chance for selection Representation - have enough of a population to draw inference on total pop Composite - reduce effort by combining individuals to make a subset

Answer 30

Simple - everything equal chance (easy but not great for long continuous sequences also doesn't reflect sub groups in population) Stratified - divide pop into segments and randomly sample each segment - good because minimizes variance further - can get unique pockets Systematic - First select random m then further ar at a fixed interval -simple and easy - regularly covers everything - cannot deal with any number specific variation - will miss it

Answer 31

sample size significance level (alpha 0 probabiltiy of making type I error) Power - one minus the probability of making a type II error (probability of finding an effect is there effect size - magnitude of the effect under alternate research hypothesis

Answer 32

power.t.test power package - theres a test you can do - uses sig level, power level etc can do for various tests, ANOVA need means, common error variance etc, anova, linear regression chi squared etc

Answer 33

basically signal from instrumetn = the concentration * this factor

Answer 34

basically just using this proportionality factor - just one point (S = k*C)- I guess also by default does through 0 then

Answer 35

sensitivity is the slope b - capiabiltiy of responding reliably across changes in analyte concentreation

Answer 36

Its the pearson correlation coefficient - to describe relationship of response and concnvertation - 1- -1 describing correlation R^2 measure how close data fits to linear model - 99% means 99% of difference variability in our responseis accounted for by changes in concentration

Answer 37

take sample matrix - extrat and spike sample in - compare to a normal standard solution (response/response) -1 ) - if neg value suppression OR can do spiked recovery - compare matrix unspiked to matrix spiked (in same matrix) - this is (spiked sample - unspiked ) / Cadded x100

Answer 38

method blank -unspiked sample reagent blank -ust solvent afield blank - unspiekd sample goes for trip (trip same but unopened)

Answer 39

error with a emasruement proportional to conetration so with larger concentration more error so we give more weight to points where error bars are smallest for higher weights (divide by n -

Answer 40

make cal curve in sample -

Answer 41

strucutre nalogue, Stable isotop elabeled

Answer 42

absically do consecutive dilutions to make inteernal standard - - same ISTD in all samples

Answer 43

basically if not large enough - make 2 curves

Answer 44

no calc curve so do - response IS/conc IS = response target/conc target

Answer 45

series of algorithims designed to recognzie underlying relationships ina large data set (input, hidden layer - outut

Answer 46

compute rprogram that improves its performance in a task through experience

Answer 47

1) data 2) a model that specifies how input data related to output 3) a loss function - shows how well model performs 4) optimization algorithm - so it can improve the model and minimize the loss function

Answer 48

your mode matches the training set too closely and isn't generalizable (underfitting is if too loose - wrong assumption made)

Answer 49

You tell it to develop mdoel based on input AND output eg classification or regression

Answer 50

binary splits on predictor variable to create a ree and classify observations into one of two groups (repetitively) this way we can choose the predictor that best splits the two groups - want HOMOGENEITY in each group maximized (eg the groupings make sense)

Answer 51

splits based on signifigance tests

Answer 52

Ensembe learning aproach - uses multiple learning approaches to improve classification rates

Answer 53

SVMS are UNSUPERVISED machine learning - for classification and regression - seeks for optimal hyperplane for separating two classes in multidimensional space

Answer 54

a matrix of basically : True Neg False Pos False Neg True Pos

Answer 55

Sensitivity - TP / (total actual positive (FN+TP) Specificity - TN/ (total actual negs (TN+FP)) False positive rate - FP / (TN+FP) Precision = TP / (TP+FP)

Answer 56

supervised pattern recognition - its is partial least square discriminant analysis - asks if groups are different and which features explain

Answer 57

1- calc a centroid 2- distance from each point to centroid of each group is calculated 3- sample assigned to group of closest centroid

Answer 58

unsupervised - data reduction technique - exaclty what it sounds like - cluster your obs

Answer 59

scale standardize to a mean of - and sd of 1 divide by max

Answer 60

normalize screen for outliers calculate distances

Answer 61

clustering but in a clade kind of

Answer 62

finds comapct clusters, sensitive to loutliers - need to remember interp that makes clustering make sensr

Answer 63

single, complete, average , centroid

Answer 64

heigh indicates order joined, read from bottom up - height reflects distance

Answer 65

select k centroids - assign each data point to closest centroid - recalculate the centroids as the average of all data points in a cluster assign data point to closest centroid - continue steps 3 and 4 until observations no longer rassigned

Answer 66

K means is based on means so suscpetibleto outliers PAM is k means but uses median as observation not mean

Answer 67

Continuous - numeric across any set of numbers Ordinal - categorical but can be ranked eg grades nominal -are categorical and cant be ranked counts- are non negative integers (come from counting not ranking)

Answer 68

Chi Squared again interpret the p value (p value are probability of obtaining the sampled result so less than 5 means less than 5% chance this is a false positive (low chance they independent)

Answer 69

should be used when observations greater than 50 and individual expected frequencies are no fewer than 5 - so BIG things

Answer 70

nominal or categorical variables for small sample sizes

Answer 71

Test for 2 nominal variables conditionally independant in each stratum of a 3rd variable

Answer 72

for nominal variables - if you have a significant result from an independence test can test strength of that relationship eg can use for chi quared

Answer 73

visualize data sets with 2 or more categoirlca variables - colors shadings, size etc all use to demonstrate things

Answer 74

linear models but for categorical variables where dist isn't normal often the variable can be categorical like binary or different groupings or categories (group A group B) or OUT come variables that count up and take a limited # such as traffic accidents - not often distributed normal LOGISTICS REGRESSIon - is used when the response Is BINARY

Answer 75

when observed variance is larger than what it should be leading to inaccurate significance testing can test with deviance in R - if the value is close to1 no dispersion

Answer 76

used where response variables is # of events to occur or counts - so you have y being a response and x is predictor variable interpret the results: its a log value so eg if we have an x value that gets an estimate value of 0.022 - that means that a 1 increase in our x value is increased with a 0.022 increase in log mean # of y

Answer 77

Unsupervised multivariate (encompasses simultaenous observations and analysis of more than one outcome) - for high dimension data - used to identify patterns every feature is used to calculate principal components (so dimension reducing approach to summarize large data)

Answer 78

mutlidimensional data sets (usualyl 2 groups, 3 reps each - biological reps, technical reps, profile analysis etc

Answer 79

looks at variability of a feature or variable across samples - and does that for each variable. plot all observations on plot - and draw the lines with the best fit -minmizes the distance (it maximizes variances..?) - we make PC 2 perpindicular but calculated the same way (which one is best fit - keep going until stop

Answer 80

PC 1 is the most important and captures the most info

Answer 81

you give up some accuracy - since you are using less data reducing it down (parsimony want to explain the data with the least # of qualifiers

Answer 82

based on the variance (which puts it on the p[lot) and the magnitdue of that (eg how much does it influence the PC (eg if looking at genes and a result from one - those with the greatest variabiliy have the greatest impact on the PC's

Answer 83

Scaling - (to make variables and the magnitude of influence comparable) - eg can do log transformation, mean centering etc Overla - TRANSFORM - CENTER- SCALE so transform ypically log center - subtract mean from each and scale by dividing by stdev

Answer 84

Line graph shows the proportion each PC accounts for variability generally has elbow shape as first one or two generally suggest most of the variance with the first showing the most (the first 3 should be 80% or else its not a great PCA - maybe do something else)

Answer 85

scres calculated for each PC plotted against each other (generally just show PC1 and PC2 - kind of like a corrgram - so plot PC 1 on the x and PC2 on the y for example to compare the influence on the data

Answer 86

shows all observations and demonstrates which features most greatly influence the PC scored (so a plot for each PC) ( the farther from the origin the greater the influence

Answer 87

combination of scores plot and loading plot (essentially superimposed upon each other

Answer 88

Dixon (Q ) single - for small data Grubbs - Iglewicz Hoaglin - robust test for multi outlier - 2 sided - z score

Answer 89

non parametric use the median robust - based on the idea that sample pop is in fact NORMAL but has significant outliers

Answer 90

small data sets different dsistrubtions categorical

Answer 91

paired t test (two sample paired t test (non parametric)

Answer 92

two sample indenpendant t test (non parametric)

Answer 93

non parametric ANOVA (one way)

Answer 94

non parametric pearson correlation

Answer 95

regression analysis

Answer 96

describe relationship 0 gie an equation

Answer 97

signed difference between observed and fitted value

Answer 98

degree of linear assoc between x and y variables

Answer 99

DOF n-3 and 3 params a b and c term (cx^2 bx +a

Answer 100

Plot linear relationship between a whole bunch of variables

Answer 101

adjust p value based on # of tests doing

Answer 102

p/n - shows high leverage or outliers

Answer 103

tells you how 2 data sets change together in tandem

Answer 104

Tells you when a change in one variable leads to another

Answer 105

covariance is affected by change in scale covariance keeps units - 0 when independent for both correlation descibres the degree to which 2 variables move in sequence

Answer 106

Normal dist

Answer 107

shows a bunch of cariables against each other also scatter plot matrix

Answer 108

no - causation means it causes it directly

Answer 109

analysis of variance - have variance associated with 2 or more things eeg an population means of groups all equal or not equal data grouped by factor like dose

Answer 110

theres with each analyst , within group factor and between group factor

Answer 111

independance of observations normality of residuals homoscedasticity

Answer 112

null all means the same - alt they different

Answer 113

we get F - compare F calc to F crit - if f calc is less than f crit - no sig difference

Answer 114

tukeys hsd tels you whats different

Answer 115

a variable that could also explain group differences on the dependant variable - we are not interested in this -its a nuisance variable - want to remove it

Answer 116

ANCOVA - add your nuisance as a covariate

Answer 117

MANOVA - multivariate analysis of variance

Answer 118

multivariate with covariate

Answer 119

Linearity between covariate and outcome variable at each level of the independent variable (so basically all of your groups of the dependant need to be equally influenced by our covariate - or more like it does in fact effect our dependant variable at each independent variable level) Homogeneity of regression slopes - sloesp of covariate against outcome variable should be same across groups (so basically no interaction of dependant variable and covariate - same effect across all independatn levels) Outcome variable normal Dist Homoscedascitiy

Answer 120

AANOVA - but with subjects assigned to two groups that are a cross classification of independent variable levels eg for TOEFL scores as outcome can initially have one independent variable - educational level (3 groups in there) but then can add another group - learning styles (which has 4 groups in there

Answer 121

1 Dependant variable continuous 2 Both independant variables should have >- 2 levels 3 Independance of observations 4) dependant variable normally dist for each combo of independent variables 5) - homoscedasticity 6- balanced design

Answer 122

no difference in meanas of factor A no difference in means of factor B no interaction between A and B

Answer 123

grouped again 2 dependant variables but cross classification between the two (eg can have medicine type but also dose - (so can have 2 med types and then 3 dosages for each emd type

Answer 124

2 way anova plot

Answer 125

repeated measuer is different measures on the same subject, replicates is with replicates

Answer 126

multivariate analysis - means that we have 2 dependant variables - looking at two factors (can combine with others MACNOVA< 2 way mmanova etc

Answer 127

independant observations no outliers for outcome multivariate normality no multicolinearity (dependant variables cant be related)j inearity between all outcome variables for each group homogeneity of variances

Answer 128

if data follows a multivariate normal dist - data points should fill on the line

Answer 129

univariate one way anova for each outcome or TUKEY

Answer 130

Frequency.counts (mode) central tendency/location (mean) dispersion (stdev) position (quartiles - medians etc) shape of observations (skew and kurtosis)

Answer 131

state null hypo state alternate hypo check dist select test choose sig level calc stat obtain crit value and compare

Answer 132

measure effect size

Answer 133

function that shows possible values fora variable and tendency to occur

Answer 134

68%, 95 then 99

Answer 135

value - mean div by stdev (basically normallie to the dist)

Answer 136

get z value - escribes area to the left look up in table

Answer 137

dist of means gets closer and closer to normal the bigger the sample size

Answer 138

taling and fronting

Answer 139

is peakedness - can be narrow or flat

Answer 140

graphically histogram, QQ plot stat test anderson darling etg

CHEMOMETRICS Flashcards

(166 cards)