V7 Flashcards
statistical tests
the principles of microarrays (general info being it)
spotted microarray: can measure the expression level of more than 20.000 genes in a single experiments
hybridisation by forming H-binds between complementary nucleotide base parts -> DNA spots attached to solid surface
- samples are labeled with fluorescent dyes
how to set up a microarray plate
column are the different samples , every column is given the same sample all the way
- every row is tested for a different gene
- gene expression levels are then seen on plate
common microarray workflow
- create oligo-arrays
- acquire samples, extract RNA
- RNa to DNA reverse transcription
- PCR(optional), Cy3 and Cy5 labelling
- hybridisation and scanning
- data storage
- extract expression levels
- data normalisation
- gene expression analysis
- data interpretation
why normalise ?
- remove technical variation from noisy data
- assumption: global changes across samples are due to unwanted technical variability
- to remove these differences has the potential to remove interesting biologically driven variation
different options for normalisation/standardization
mean centering: new = old - mean(of group)
standard screw/student’s T-statistic: new = (old - mean) / SD
quantile normalisation: rank based method
quantile normalization
- normalises between the samples -> makes them very homogenous, even with samples from different tissues
- > might remote differences between the samples that naturally occur
ratios
- simplest way to look at differences
- ratio = mean(tissue1)/mean(tissue2) -> the ratio is very bias, not advisable
- log2 ratio = long2(mean(tissue1)/mean(tissue2)) - so the one ‘tissue’ doesn’t overtake the other - its more unbias
what defines statistical significance
- if it has been predicted as unlikely to have occurred by chance alone
- measured by probability value (p-value)
- rejected if p < 0.5
- the smaller the p value, the larger the significance
student’s t-test
- compare the difference between two groups
- assumes normal distribution of data
- t.test function in r
single sided t-test
- hypothesis groupa < group b
- > is more powerful
two sided t-test
- hypothesis groupa != groups
- > always this unless you have a certain hypothesis/previous info
one sample thest
- test the hypothesis(H0) that the population mean is qual to a specific value µ0
two sample t-tests
- independents (unpaired) samples
- > 100 people - 50 control, 50 treatment
- > preferably two groups in equal size and variance
paired sample (“intervention” in the middle)
- > generally preferred
- > “repeated measures” t-test
- > before and after treatment measurements
- > reduces (or eliminates) the effects of confounding factors
normality assumptions - parametric tests
- assume the input data follows a known distribution
- each of the two populations being compared should follow a normal distribution
- variance of the two populations are also assumed to be equal
- samples should be random and independent
how to test for normality
shapiro-Wilk test : shapiro.test()
how to behave for variance between populations
if not : Welch’s t-test (default)
if yes: use t.test(,var.equal = TRUE)
normality assumptions - non-parametric tests
- does not assume data follows a certain distribution
- > often rely on rank methods
Tests:
Mann_whitney U test
wilcox. test
Kruskal Wallis test
friedman test
Mann-Whitney U test
2 -group Mann-Whitney U test
- wilcox.test(y~A) : y = numeric measurements A= boolean factor (GroupA/GroupB)
-wolcox.test(x,y): x=numeric group A measurements y=numeric group B measurements
2-group wilcoxon signed rank test
- wilcox.test(x,y,paired=TRUE) : where x and y are numeric “repeated measurements”
Kruskal Wallis test
- one way anova by ranks
kruskal. test(y~A): y is numeric, A is a factor (many levels)
H0 = are all groups from the same population?
friedman test
- randomised block design
friedman. test(y~A|B): y numeric data values, a is a grouping factor, b is a blocking factor - potato yield (y) of types of potato plants (A), which have been measured across different fields (B)
Corellation
- is a measure of dependence between two variables
- R provides the cor()
- corellations are useful because they can indicate a predictive relationship that can be exploited in practice
correlation != causation
types of corellation
pearson - no transformation : fast, but sensitive to outliers
spearman - rank based transformation: slower, but more robust
multiple testing and what is accepted as significant
- we test gene expression data for significant difference : does gene a significantly differ between conditions?
- perform many of these tests (commonly 20.000 genes)
- as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in merman of at least one attribute
- to preserve the 1 in 20 threshold compensation for the amount of tests we performed is needed
- simples correction (bonferroni)
p-value <0.05/#samples -> value < 2.5x10^-6
different errors in multiple testing
type 1- error : calling a gene significantly changed, even if its just by chance - avoid by bonferroni correction
type 2 error: missing a significantly changed gene - avoid by benjamin i-hochberg false discovery rate procedure
how to adjust the p-value
using the p.adjust function
p.adjust(0.0015, “bonferroni”, 10)
- p wert = 0.0015, # tests = 10
adjusted p-values below < 0.05 are considered sigificant
how to get free microarray data
- gene expressions Omnibus(NCBI)
- > only storage and retrieval
- array Express (EBI)
- > has gene expression atlas, curated, re-annotaded archive data
- > A storage, retrieval and analysis
- > different biological conditions across experiments