CHEMOMETRICS Flashcards

1
Q

When do we use permutation tests or bootstrapping

A

when the observed data is sampled from an unknown or mixed
distribution
low sample sizes
Where outliers are a problem?
Too complex to estimate
the distribution?
Note this is an alternative to non parametric approaches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do permutation tests work ? under what basis

A

Assume that if A and B are the same then labels don’t matter
so if testing to see if groups A and B are different
Steps:
1) calculate observed test (for example t test often non parametric- can be anything, ANOVA, quadratic etc) - called to
2) place all in a single group
3) - randomly assign to groups of equal size
4) calculate new test stat
5) repeat - for every single possible random placement into groups
6) arrand all the tests stats in ascending order - this is an empirical dist based on the data
7) if t0 falls outside the middle 95% of the empirical distribution then reject null hypo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an exact test vs approximate test in permutations?

A

exact does all the possible combos whereas approximate samples from all and samples some

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Bootstrapping

A

Generates an emprical distribution but based off replacing the members of the original sample with other random members of the original sample (sampling with replacement) - basically just make a bunch of data sets with the same # of samples using those original values and that’s the equivalent of running the experiment a bunch of times - this way we can see where the data really lies instead of having just one set
(again can do with any stat)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Jackknifing

A

It’s a mean to estimate variance by doing subsampling (randomly leaving out samples from the set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is K fold cross validation

A

used to validate a predictive model - splits data into K subsets each held out in turn as a validation set to test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a time series?

A

longitudinal data sets - over time - they plot the data (what happened) but also try to predict what happens next (forecast)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the steps in time series analysis

A

1) visualize data
2)Smooth /clean -
3)decomposition (eg if seasonally such as monthly or quarterly - can be decomposed into trend component (change in level over time)
4) show irregular components (not part of trend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are trends people see in time series

A

They see additive trend (increase over time)
Additive seasonal (see it go up and down with seasons - almost sinusoidal)
and multiplicative trend (with seasonal gets larger/wider)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are things smoothed in timem series

A

movign average - average points next to you - k = how many points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Exponential forecasting models

A

single - a series with constant level and irregular component (no trend or seasonal)
Double (holt) - exponential- series with a level and a trend
Triple (Holt Winters) exponential- series with level, trend and seasonal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Types of Error

A

I - alpha rejection of true null hypothesis (false positive)
II - beta - non rejection of false neative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is LOD

A

lowest amount of analyte in sample that can be detected WITHIN a specific confidence level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

is LOD agreed upon?

A

no - typically s/n relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Draw curves for signal to noise and blank and what shades represent what

A

Used for LOD determination - want stdev of blank but ours to be 3x that
So that we have a distribution over our blank - we want the lowest signal we analyze to be above that but how much overlap in dist?
we ideally want just a 5% overlap and to do that we need 3.3 stdev - that means our distribution overlaps with the blank distribution such that the portion in the blank is our BETA rate - false negative
and the region ov overlap in our sample dist is alpha - false positive.
Basically want a 5% overalp between the 2 so often 2 *sd of blank or 3.3 uis used to achieve that - so 5% for type I and type II error (type I is in sample Type 2 is in blank

Old answer:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

LOQ vs LOD

A

10x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Calculate LOD or LOQ from signal to noise

A

need to use it with a nother method to verify
its mean + either 3 or 10 * stdev
if linear cal curve its 3.3 or 10 * stdev / b
slope of linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are selectivity and specitificity

A

selectivity - abiltiy of method to determine analyte in complex matrix without interference
Specificity - confirm the method ability to assess the analytes in presence of any other components that might be present (including matrix)
so specificity is selectivity +

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Accuracy vs precision

A

accruacy - trueness or bias - measure of systematic error compare to reference,
Precision -closeness of repeated individual measurements under specified conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to run accuracy and rpecision tests

A

against standard material want accruacy within and between run - bias - use a low and high QC
Precision - use % CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ROBUST what is it

A

capacity of method to be uanffected by natural variation - test over range of parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

UNCERTAINTY

A

sig source must be identified and tabulated
2 types
A and B
A is random
B is systematic
example - user skill, sampling, environe , instrument, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Stability

A

use QC - store at room temp, 4 cetc test against fresh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

HOW DO WE HANDLE NON DETECTS

A

Exclude or delete from data set (worst)
Substitue (0, 1/2 LOD , LOD etc
Left and right indicate whether its too low or too high in terms of an unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is survival analysis and NADA

A
  • how long will it be before event occurs (eg medical)
    NADA is non detects and data analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Fit What is fit for purpose

A

ensures the analytical method fills certain criteria of reliability and can can perform - gives us confidence shows its reproducible and repeatable
(so as a list
reproducible, broad coverage - sensitivity and selectivity, linearity, precision, stable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are some key poitns to precision testing

A

sample should be stable and homogenous (representative of whats bein tested)
Should be applied to the whole sample preparation method analysis procedure
2 factors - precision estimate and design of precision experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the types of precision estimates

A

1) REPEATABILITY -within batch or intra assay - one analyst on the same equipment over a short time period
2) Intermediate precision - made in a single lab but variable conditions different days, analysts, equipment etc - within lab reproducibility
3) Reproducibility - DIfferent labs - different equipment (interlab)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Types of precision experiments

A

Simple replication - repeated measurements on a suitable sample - want 6-15 reps
NEsted design - used when cant generate enough reps with simple replication (not feasible) - basically each batch has different params - so can be inter lab, intra lab etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

PRECISION limits - what are and howto calc

A

repeatabiltiy limit = r = t* root(2) *s
confidence interal 95% for difference between two results obtained under repeatbility conditions

Reproducibility is R = troot(2)sr

t is 2 tailed students t tested for confidence level and DOF,
They are calculated by multiplying the repeatability standard deviation (sr) or the reproducibility standard deviation (sR) by 2.8 respectively. The factor 2.8 is derived from 1.96 (95% of the population is within 1.96 standard deviations of the mean) times the square root of 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How to statistically evaluate precision estimates

A

F test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is bias and how calculated and how evaluated

A

difference form true value - so just mean - accepted - can be %
t test statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Ruggedness study - how evaluated/set up

A

PLACKET BURMAN -
7 parameters to study - you pick (eg extraction time) each has levels (eg 30 min extract vs 10 min)
to investigate effect - difference between average of results of parameter at normal level vs average of results at alternate level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Measurement of uncertainty - what is and how tested

A

dispersion of values possible for measurement - eg stdev
can be propgated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

ROC curves

A

OK so a ROC curve is a plot of TP RATE against FP rate
we often take AUC - area under curve
AUC ranges in value from 0-1 - a model with 100% wrong predictions has an AUC of 0 and has an AUC of1 if 100% right

REceiever operating characteristics - evalute prediction accuracy of classifier model
- tradeoff between sensitivity and specificity - the same LOD vs blank curve
formula
TPR = TP / all tP
FPR = FP / all FP
area under cuvrve AUC is the thing
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate
https://en.wikipedia.org/wiki/Receiver_operating_characteristic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is metrology

A

formal system to enable informed decision through data assessment - levels of confidence to what we’re doing - reliable network of measurements for use to confidently make assessments about concentration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

3 fundanmentals of metrology

A

1) TRACABILITY - SI - translates units to results - go from standards to where you are (higher order to CRM’s
2) UNCERTAINTY - (of measure) -using the rsd in results to make claims not individual points - the dist
3) VALIDATION (methods et)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is QA and exmaples

A

he planned and systematic activities implemented in a quality system so that quality
requirements for a service will be fulfilled, quality assurance occurs before the data is collected

eg suitable lab environment, educated staff, training, documented and validated methods, preventative actions, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is QC

A

Quality Control: the observation techniques and activities used to evaluate and report quality,
quality control occurs during and after data is collected
examples: blanks, spiked samples, controls, reference materials etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Example of system suitabiltiy testing

A

small number of standards - acquire data for accuracy precision - not in bio, m/z accuracyRT peak shape are assessed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Dispersion ratio - what is it

A

Standard deviation for pooled QC sample vs test sample stdevso can use the D- ratio (MAD of QC/MAD of sample) and
AD of 0% means technical variance Is 0 - perfect measurement all cahgnes are due to biological cause
AD of 100% means all variance is due to noise - no bio info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Whats pooled QC

A

generated a single QC sample that can be distributed evenly throughout analytical batch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Batch design and pooled QC - what can you do

A

basically run QC throughout to see time based variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Main uses of reference materials

A

examine skills of analyst
cOntrols
precision accuracy
accreditation
means uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How to score profiency testing results

A

2 steps - specify assigned value
and setting the standard dev

  • ASSIGNMED VALUE - can either be known (CRM), REFERENCE value ( one lab determines) or it can be determined based off consensusfrom other labs

STDEV - set by scheme organizer - set by prescription or based on the results of a reproducibility experiment, from a general model (eg horowitz funct)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is Z and Q in proficiecny testing

A

Z score is what we thinkg - value minus mean divided by stdev
z less than 2 (abs value of Z) pretty good - less than 3 - hmm questionable
Q score - alternative to Z - takes no account of stdev - dsit of Q centered on 0 then - relies on EXTERNAL PRESCRIPTION of acceptability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

WHat is a YOUDEN plot

A

scatter plot - plots results from multiple labs on graph to show
if labs are equals, outliers, inconsistencies etc
x and y each represent one of the reported values (eg concentration of analytes A and B)
draw lines parallel to x and y axis and depending on where they are indicates - various things about results - eg random error vs systematic error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

SHEWHART plot

A

sequential plots of observatiosn from QC material analyzed in succesiely - mean QC for each run and measurement # (y axis shows the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

General princiiples of experimental design

A

Resaerch method where manipulate independant variables and look at dependant variable

things to do:
arrange experiments for cancellation or comparison..? - bias

plan to do replication or independent uncertainty estimates (precision)

Need statistical analysis or approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Experimental designs list 4

A

Simple replication - series of observations on a single test material
Lienar calibration design - observations at a range of levels (some quantitative factor)
Nested: - has levels of factors in unique to that level
Factorial - has factors or levels but not wholly distinct - eg one group can be one factor, another can be another and and another group can be both factors at once

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Why do we randomzie in expperimental design

A

to minimize nuisance effects - unwanted effects that influence the results - eg not effected by ordering/sampling order i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is blocking

A

Basically have all replicates/groups of test materials subject to same nuisance effects (run at the same time - eg we have sets a b and c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is blocking

A

Basically have all replicates/groups of test materials subject to same nuisance effects (run at the same time - eg we have sets a b and c - we can run them separately or run all in the same trial so subject o sam eeffects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Sampling theory (define randomization, representation and composite)

A

Randomization - equal membrs of pop - equal chance for selection
Representation - have enough of a population to draw inference on total pop
Composite - reduce effort by combining individuals to make a subset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

List different sampling strats

A

Simple - everything equal chance (easy but not great for long continuous sequences also doesn’t reflect sub groups in population)
Stratified - divide pop into segments and randomly sample each segment - good because minimizes variance further - can get unique pockets
Systematic - First select random m then further ar at a fixed interval -simple and easy - regularly covers everything - cannot deal with any number specific variation - will miss it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

4 quantities of power analysis

A

sample size
significance level (alpha 0 probabiltiy of making type I error)
Power - one minus the probability of making a type II error (probability of finding an effect is there
effect size - magnitude of the effect under alternate research hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

How do you determine how many participants are needed for a study

A

power.t.test power package - theres a test you can do - uses sig level, power level etc can do for various tests, ANOVA
need means, common error variance etc, anova, linear regression chi squared etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is proportionality constant k

A

basically signal from instrumetn = the concentration * this factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is single point cal

A

basically just using this proportionality factor - just one point (S = k*C)- I guess also by default does through 0 then

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Sensitivity from calibration curve

A

sensitivity is the slope b - capiabiltiy of responding reliably across changes in analyte concentreation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is r in cal curve -

A

Its the pearson correlation coefficient - to describe relationship of response and concnvertation - 1- -1 describing correlation
R^2 measure how close data fits to linear model - 99% means 99% of difference variability in our responseis accounted for by changes in concentration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How to evaluate matrix effect

A

take sample matrix - extrat and spike sample in - compare to a normal standard solution (response/response) -1 ) - if neg value suppression
OR can do spiked recovery - compare matrix unspiked to matrix spiked (in same matrix) -
this is (spiked sample - unspiked ) / Cadded x100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Tyes of blanks -

A

method blank -unspiked sample
reagent blank -ust solvent
afield blank - unspiekd sample goes for trip (trip same but unopened)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Weighted regression

A

error with a emasruement proportional to conetration so with larger concentration more error so we give more weight to points where error bars are smallest for higher weights (divide by n -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Methods of standard addition

A

make cal curve in sample -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

ISTD

A

strucutre nalogue, Stable isotop elabeled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Isotope dilution

A

absically do consecutive dilutions to make inteernal standard - - same ISTD in all samples

68
Q

Multi LDR

A

basically if not large enough - make 2 curves

69
Q

OMICS quantitation - how

A

no calc curve so do - response IS/conc IS = response target/conc target

70
Q

What is a neural net

A

series of algorithims designed to recognzie underlying relationships ina large data set (input, hidden layer - outut

71
Q

What is lachine learning

A

compute rprogram that improves its performance in a task through experience

72
Q

4 ingredients of machine learning

A

1) data
2) a model that specifies how input data related to output
3) a loss function - shows how well model performs
4) optimization algorithm - so it can improve the model and minimize the loss function

73
Q

Wat is overfitting (also udnerfitting)

A

your mode matches the training set too closely and isn’t generalizable (underfitting is if too loose - wrong assumption made)

74
Q

What is supervised machine learning and what are common types

A

You tell it to develop mdoel based on input AND output
eg classification or regression

75
Q

What is a decision tree

A

binary splits on predictor variable to create a ree and classify observations into one of two groups (repetitively)
this way we can choose the predictor that best splits the two groups - want HOMOGENEITY in each group maximized (eg the groupings make sense)

76
Q

What is aconditional inference tree

A

splits based on signifigance tests

77
Q

What is Random Forest

A

Ensembe learning aproach - uses multiple learning approaches to improve classification rates

78
Q

What are support vector machines

A

SVMS are UNSUPERVISED machine learning - for classification and regression - seeks for optimal hyperplane for separating two classes in multidimensional space

79
Q

What is a confusion matrix

A

a matrix of basically :
True Neg False Pos
False Neg True Pos

80
Q

Stats from a confusion matrix

A

Sensitivity - TP / (total actual positive (FN+TP)
Specificity - TN/ (total actual negs (TN+FP))
False positive rate - FP / (TN+FP)
Precision = TP / (TP+FP)

81
Q

What is PLS-DA

A

supervised pattern recognition - its is partial least square discriminant analysis - asks if groups are different and which features explain

82
Q

How do distance based clustering methods work

A

1- calc a centroid
2- distance from each point to centroid of each group is calculated
3- sample assigned to group of closest centroid

83
Q

Clustering - is it supervised or un - and describe it

A

unsupervised - data reduction technique - exaclty what it sounds like - cluster your obs

84
Q

2 types of clusterings - bottom up and top down explain

A
85
Q

How to normalize for clustering

A

scale
standardize to a mean of - and sd of 1
divide by max

86
Q

Common steps for cllustering

A

normalize
screen for outliers
calculate distances

87
Q

What is a dendogram

A

clustering but in a clade kind of

88
Q

pros and cpns of dendogram

A

finds comapct clusters, sensitive to loutliers - need to remember interp that makes clustering make sensr

89
Q

What are the different linkage types

A

single, complete, average , centroid

90
Q

How to interpret dendo grams

A

heigh indicates order joined, read from bottom up - height reflects distance

91
Q

What is k means clustering

A

select k centroids - assign each data point to closest centroid - recalculate the centroids as the average of all data points in a cluster
assign data point to closest centroid - continue steps 3 and 4 until observations no longer rassigned

92
Q

Partitioning around medoids

A

K means is based on means so suscpetibleto outliers
PAM is k means but uses median as observation not mean

93
Q

List the variable types

A

Continuous - numeric across any set of numbers
Ordinal - categorical but can be ranked eg grades
nominal -are categorical and cant be ranked
counts- are non negative integers (come from counting not ranking)

94
Q

How do you test for sig frelationship between two nominal (categorical) variables)

A

Chi Squared again interpret the p value (p value are probability of obtaining the sampled result so less than 5 means less than 5% chance this is a false positive (low chance they independent)

95
Q

Chi square limitations

A

should be used when observations greater than 50 and individual expected frequencies are no fewer than 5 - so BIG things

96
Q

What is Fishers test for indeoendance for

A

nominal or categorical variables for small sample sizes

97
Q

What is cochran mantel haenzel

A

Test for 2 nominal variables conditionally independant in each stratum of a 3rd variable

98
Q

What is measure of association

A

for nominal variables - if you have a significant result from an independence test can test strength of that relationship eg can use for chi quared

99
Q

What is a mosaic plot

A

visualize data sets with 2 or more categoirlca variables - colors shadings, size etc all use to demonstrate things

100
Q

What are generalzied inear models vs logistic regression

A

linear models but for categorical variables where dist isn’t normal
often the variable can be categorical like binary or different groupings or categories (group A group B)
or OUT come variables that count up and take a limited # such as traffic accidents - not often distributed normal

LOGISTICS REGRESSIon - is used when the response Is BINARY

101
Q

Overdispersion what is it

A

when observed variance is larger than what it should be leading to inaccurate significance testing
can test with deviance in R - if the value is close to1 no dispersion

102
Q

What is a poisson regression

A

used where response variables is # of events to occur or counts - so you have y being a response and x is predictor variable

interpret the results: its a log value so eg if we have an x value that gets an estimate value of 0.022 - that means that a 1 increase in our x value is increased with a 0.022 increase in log mean # of y

103
Q

What is PCA and what is it used for

A

Unsupervised multivariate (encompasses simultaenous observations and analysis of more than one outcome) - for high dimension data - used to identify patterns
every feature is used to calculate principal components (so dimension reducing approach to summarize large data)

104
Q

What type of data can beanalyzed with PCA

A

mutlidimensional data sets (usualyl 2 groups, 3 reps each - biological reps, technical reps, profile analysis etc

105
Q

How is PCA done -on a base level

A

looks at variability of a feature or variable across samples - and does that for each variable.
plot all observations on plot - and draw the lines with the best fit -minmizes the distance (it maximizes variances..?) - we make PC 2 perpindicular but calculated the same way (which one is best fit - keep going until stop

106
Q

How do the PC’s in PCA compare

A

PC 1 is the most important and captures the most info

107
Q

What is an issue with PCA

A

you give up some accuracy - since you are using less data reducing it down (parsimony want to explain the data with the least # of qualifiers

108
Q

How are PC’s calculated

A

based on the variance (which puts it on the p[lot) and the magnitdue of that (eg how much does it influence the PC (eg if looking at genes and a result from one - those with the greatest variabiliy have the greatest impact on the PC’s

109
Q

What do you need to consider before doing PCA

A

Scaling - (to make variables and the magnitude of influence comparable) - eg can do log transformation, mean centering etc

Overla - TRANSFORM - CENTER- SCALE
so transform ypically log
center - subtract mean from each and scale by dividing by stdev

110
Q

What is a SCREEE plot

A

Line graph shows the proportion each PC accounts for variability
generally has elbow shape as first one or two generally suggest most of the variance with the first showing the most (the first 3 should be 80% or else its not a great PCA - maybe do something else)

111
Q

What is a PCA scores plot

A

scres calculated for each PC plotted against each other (generally just show PC1 and PC2 - kind of like a corrgram - so plot PC 1 on the x and PC2 on the y for example to compare the influence on the data

112
Q

What is a PCA LOADING plot

A

shows all observations and demonstrates which features most greatly influence the PC scored (so a plot for each PC) ( the farther from the origin the greater the influence

113
Q

What is a PCA biplot

A

combination of scores plot and loading plot (essentially superimposed upon each other

114
Q

What are some outlier tests and what do they do

A

Dixon (Q ) single - for small data
Grubbs -
Iglewicz Hoaglin - robust test for multi outlier - 2 sided - z score

115
Q

What makes a non parametric method non paremtric vs robust

A

non parametric use the median
robust - based on the idea that sample pop is in fact NORMAL but has significant outliers

116
Q

When to use non parametric

A

small data sets
different dsistrubtions
categorical

117
Q

What is the wilcoxon signed rank

A

paired t test (two sample paired t test (non parametric)

118
Q

What is the mann whitney U

A

two sample indenpendant t test (non parametric)

119
Q

What is kruskal wallis

A

non parametric ANOVA (one way)

120
Q

What is SPearman rank correlation

A

non parametric pearson correlation

121
Q

Local Regression LOWESS and LOESS

A

regression analysis

122
Q

What is regression used for

A

describe relationship 0 gie an equation

123
Q

What is a residual in regression

A

signed difference between observed and fitted value

124
Q

What is correlatoin coefficient

A

degree of linear assoc between x and y variables

125
Q

What is second order polynomial regression

A

DOF n-3 and 3 params a b and c term (cx^2 bx +a

126
Q

Scatter Plot matrix

A

Plot linear relationship between a whole bunch of variables

127
Q

Bonferroni adjust p value what does it mean

A

adjust p value based on # of tests doing

128
Q

What is hat statistic

A

p/n - shows high leverage or outliers

129
Q

Covariance what is

A

tells you how 2 data sets change together in tandem

130
Q

Correlation

A

Tells you when a change in one variable leads to another

131
Q

Covariacne vs correlation

A

covariance is affected by change in scale
covariance keeps units
- 0 when independent for both
correlation descibres the degree to which 2 variables move in sequence

132
Q

Assumptions for correlation

A

Normal dist

133
Q

What is a corrgram

A

shows a bunch of cariables against each other
also scatter plot matrix

134
Q

correlation does it equal causation?

A

no - causation means it causes it directly

135
Q

What is ANOVA used for

A

analysis of variance - have variance associated with 2 or more things eeg an
population means of groups all equal or not equal
data grouped by factor like dose

136
Q

How do we calculate variance in anova (or what types are there)

A

theres with each analyst , within group factor
and between group factor

137
Q

What are assumptions for ANOVA

A

independance of observations
normality of residuals
homoscedasticity

138
Q

ANova null and alt hypo

A

null all means the same - alt they different

139
Q

What test do we get from ANOVA

A

we get F - compare F calc to F crit - if f calc is less than f crit - no sig difference

140
Q

What is a post hoc test

A

tukeys hsd tels you whats different

141
Q

What is aconfounding factor

A

a variable that could also explain group differences on the dependant variable - we are not interested in this -its a nuisance variable - want to remove it

142
Q

What type of anova deals with confounding factors?

A

ANCOVA - add your nuisance as a covariate

143
Q

What is ANOVA with MULTIPLE dependant variables

A

MANOVA - multivariate analysis of variance

144
Q

what is MANCOVA

A

multivariate with covariate

145
Q

What are ANCOVA assumptions

A

Linearity between covariate and outcome variable at each level of the independent variable (so basically all of your groups of the dependant need to be equally influenced by our covariate - or more like it does in fact effect our dependant variable at each independent variable level)

Homogeneity of regression slopes - sloesp of covariate against outcome variable should be same across groups (so basically no interaction of dependant variable and covariate - same effect across all independatn levels)

Outcome variable normal Dist

Homoscedascitiy

146
Q

What is 2 factor ANOVA

A

AANOVA - but with subjects assigned to two groups that are a cross classification of independent variable levels
eg for TOEFL scores as outcome
can initially have one independent variable - educational level (3 groups in there)
but then can add another group - learning styles (which has 4 groups in there

147
Q

2 way or factor ANOVA - assumptions

A

1
Dependant variable continuous

2 Both independant variables should have >- 2 levels

3 Independance of observations

4) dependant variable normally dist for each combo of independent variables

5) - homoscedasticity

6- balanced design

148
Q

What are 2 way anova hypotehses

A

no difference in meanas of factor A
no difference in means of factor B
no interaction between A and B

149
Q

What is 2 way factorial anova?

A

grouped again 2 dependant variables but cross classification between the two (eg can have medicine type but also dose - (so can have 2 med types and then 3 dosages for each emd type

150
Q

Interaction plot

A

2 way anova plot

151
Q

repeated meaures vs replication

A

repeated measuer is different measures on the same subject, replicates is with replicates

152
Q

what is MANOVA

A

multivariate analysis - means that we have 2 dependant variables - looking at two factors (can combine with others MACNOVA< 2 way mmanova etc

153
Q

MANOVA assumptions

A

independant observations
no outliers for outcome
multivariate normality
no multicolinearity (dependant variables cant be related)j
inearity between all outcome variables for each group
homogeneity of variances

154
Q

What is a mahalanoblis plot

A

if data follows a multivariate normal dist - data points should fill on the line

155
Q

How do you post hoc Manova

A

univariate one way anova for each outcome
or TUKEY

156
Q

what are teh 5 descriptive stats:

A

Frequency.counts (mode)
central tendency/location (mean)
dispersion (stdev)
position (quartiles - medians etc)
shape of observations (skew and kurtosis)

157
Q

Steps for sig testing

A

state null hypo
state alternate hypo
check dist
select test
choose sig level
calc stat
obtain crit value and compare

158
Q

What is cohens D

A

measure effect size

159
Q

What is distribution

A

function that shows possible values fora variable and tendency to occur

160
Q

normal dist stdev

A

68%, 95 then 99

161
Q

what is z scores

A

value - mean div by stdev (basically normallie to the dist)

162
Q

How to get probability of value occufring

A

get z value - escribes area to the left look up in table

163
Q

what is central limit theorem

A

dist of means gets closer and closer to normal the bigger the sample size

164
Q

What is skew

A

taling and fronting

165
Q

kurotiss

A

is peakedness - can be narrow or flat

166
Q

How to test for normality

A

graphically histogram, QQ plot
stat test anderson darling etg