2. Analysis of Gene Expression Data Flashcards

1
Q

Linear Regression vs ANOVA Model

A

-ANOVA is used similarly to linear regression but for when the x variable is categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ANOVA Model

Description

A

-suppose we have p categories to compare (usually two, healthy and ill)
-let n be the number of observations group i, i=1,…,n then N=np is the total number of individual observations
-let yij denote the jth observation of the ith experimental group:
yij = μ + αi + εij
-where μ is the baseline
and αi is the effect of group i
-and εij is the random variation assumed to be i.i.d. N(0,σ²)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ANOVA Model

Constraint

A

-in order for the parameters to be estimable apply constraint:
Σαi = 0
-so for the case p=2:
α1 = -α2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ANOVA Model

Least Squares

A

-the parameters can be estimated by the least squared method
-let the sum of square residuals:
S(μ,αi) = Σrij² = Σ[yij-μ-αi]²
-where the sum is over i and j
-the estimates are obtained by minimising S(μ,αi)
-calculate dS\dμ and each dS/dαi, set each to zero and evaluate μ^ and αi^ for i=1,…,p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ANOVA Model

Estimates

A
μ^ = y.._ = Σyij/N
-where the sum is over i and j
αi^ = yi._ - y.._
-where:
yi._ = Σyij/N
-where the sum is over j
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ANOVA Model

Simplest Matrix Form

A

yi = μ + εi
-can be written in matrix form as:
y = Xβ + ε
-where y is a column vector with entries y1, y2, …, yn
-X is a 1xn matrix of 1s
-β is a 1x1 vector μ
-ε is a 1xn column vector with entries ε1,…,εn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ANOVA Model

Matrix Form, p=2

A

yij = μ + αi + εij
-consider the case where p=2
-matrix form:
y = Xβ + ε
-y is a column vector with entries y11,y12,…,y1n,y21,y22,…,y2n
-ε is a column vector with entries ε11,ε12,…,ε1n,ε21,ε22,…,ε2n
-X is a 2nx2 matrix since there are 2n total observations and 2 independent parameters with entries, all 1s in first column, n 1s then n -1s in second column
-β is a 2x1 vector with entries μ and α1
-then α2 is given by the constraint, α2=-α1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ANOVA Models for Microarray Data

General

A

yavdg = μ + Aa + Vv + Dd + Gg + VGvg + DGdg + AGag + Eavdg
-where μ,Aa,Vv,Dd,Gg are described as main effects
-VGvg,DGdg,AGag are interaction terms
-E
avdg is the error
Aa = ath array effect
Vv = vth variety, experimental group effect
Dd = the dth effect dye
Gg = the gth gene effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ANOVA Models for Microarray Data

Interaction Terms

A

-interaction terms are included in the model if there are at least two observations per comg=biation of groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ANOVA Models for Microarray Data

Interaction Terms VG

A

-the interaction term VG is the effect of gene g in experimental group v
-the term VG must be present in ANY model because the construction of cDNA microarray experiment must involve two different genes
-our interest is in this interaction term since it represents the relationship between a gene and an experimental group
-gene expression between two groups:
(VG)1g - (VG)2g
-we are interested in differential expression between the two groups but they are not directly comparable at this stage due to different source of variation in the raw microarray data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ANOVA Models for Microarray Data

Interaction Terms AG

A

-represents the ‘spot effect’, included in the model if a gene is repeated twice (on the same array)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ANOVA Models for Microarray Data

Interaction Terms DG

A
  • the term DG represents gene-specific dye effect / dye bias
  • included if labelling and samples involved are repeated at least twice
  • can be off-set by ‘dye swap’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Normalisation

A

-arrays are not directly comparable due to different sources of systematic variation
-normalisation is an effort to remove systematic non-biological variations
-e.g. array effects, subject effects (if applicable) dye effect etc.
-assumes that the majority of genes are unchanged (not differentially expressed between experimental groups)
aiming for arrays and spots to become directly comparable so that any variation is due to the biological question of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Seeing Variation in the cDNA Microarray Data

A

-transfer the linear scale to a log scale
-then plot log ratio:
M = log(R/G)
-vs the abundance:
A = log(√[RG])
-where √[RG] is the geometric mean
-this should follow a horizontal trend but there is a slight slant due to dye bias and other contributing factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Seeing Variation in Affy Array Data

A
  • in single arrays, the discrepancy is observed between arrays
  • the peaks are different heights an in different places
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normalisation Methods

Median Normalisation

A
  • shifts the distribution of all log ratios to have zero median
  • shifts the whole log-ratio distribution so that its median equals zero
  • but this doesn’t correct the shape
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Normalisation Methods

ANOVA Normalisation

A
  • models the log ratios in the linear model
  • and ‘compensate’ each expression so that only the term(s) related to the biological variability would remain
  • corrects y*avdg for the different terms in the model so that we are left with the term VG
  • requires a big number of arrays
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Normalisation Methods

Loess / Lowess Normalisation

A
  • corrects the distribution of log ratios across gene expression intensities
  • done by drawing a loess / lowess line and each value of the log ratio is subtracted to the loess curve across different level expression abundance
  • shifts the log-ratio distribution ‘adaptively’ wrt their abundance
  • it suppresses the first five terms in the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Affrymetrix Arrays

Quantile Normalisation

A

-this method assumes that the distribution of gene abundance is nearly the same in all samples
-it constructs a ‘reference’ array (not physical array) where the probes are taken to be the median of probes across samples
-to normalise an array it computes the quantile of the value in the distribution of probe intensities
-then it transforms the original value to that quantile’s value on the reference array
Xnorm = F^(-1) (Fref(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

ANOVA Model

After Normalisation

A

yavdg = VGvg + DGdg + AGag + Eavdg
-if the requirement for inclusion of DG and AG is not met then the model simplifies to:
yavg = VGvg + Eavg
-normalisation suppresses main effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Log Ratios and Experimental Design in cDNA Microarray

Notes

A
  • each group / variety MUST be labelled by one of the dyes, red or green
  • on spot in the microarray carries information from both groups (hence both colours)
  • we usually take a log ratio (red/green) or difference of the log between them to avoid spot bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Notations for Log Ratios

A
  • theoretical notation

- practical notation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Theoretical Notation for Log Ratios

Description

A

-take the simplest model:
yavg = (VG)vg + Eavg
-thus the log ratio between ‘treatment’ group v=1 and ‘control’ group v=2 can be written:
yag~ = ya1g - ya2g
= (VG)1g - (VG)2g + eag
= δg + eag
-now yag~ is the log ratio of treatment over control in array a and gene gand δg represents differential expression of treatment over control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Theoretical Notation for Log Ratios

Matrices

A

-in matrix form:
y = Xβ + E
-where y is a 1xn vector with entries y1g,y2g,…,yng
-and X is a 1xn matrix with all 1s
-and β is a 1x1 vector with entry δg, the differential expression of red over green
-and E is a 1xn vector with entries e1g,e2g,…,eng

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Direct Comparison Design
Practical Notation for Log Ratios
Description

A

-in practice we have dyes swaps to compensate for dye bias
-e.g. on arrays 1 and 3 red will be treatment and green will be control but on arrays 2 and 4 red will be control and green will be treatment
-computer software will always compute the log ratio of red over green regardless of which experimental group the dye corresponds to
-but we want the treatment over colour ratio regardless of dye colour
-denote the model as:
yag~ = ya(1)g - ya(2)g
= (VG)1g - (VG)2g + eag
= μg + eag
-where the brackets indicate the ordered indexing so that the ratio is treatment over control regardless of dye

26
Q

Direct Comparison Design
Practical Notation for Log Ratios
Matrices

A

-in matrix form:
-in matrix form:
y = Xβ + E
-where y is a 1xn vector with entries y1g,y2g,y3g,y4g
-and X is a 1x4 matrix with entries 1,-1,1,-1
-and β is a 1x1 vector with entry μg, the differential expression of treatment over control
-and E is a 1xn vector with entries e1g,e2g,e3g,e4g

27
Q

Practical Notation for Log Ratios

Control Over Treatment

A
  • for control over treatment define the X matrix as -1,1,-1,1 instead of 1,-1,1,-1
  • then μg in β represents the differential expression of control over treatment
28
Q

Common Reference Design

Description

A
  • consider a 4 array experiment
  • the green dye is always a reference group
  • the red dye is experimental group 1 on the arrays 1 and 2 and experimental group 2 on arrays 3 and 4
  • the main aim is to test for differential expression of group 2 over group 1
  • but a secondary aim can be to test differential expression of each of group 1 and 2 over the reference
29
Q

Common Reference Design

Example

A
  • the reference group may be patients given no treatment
  • group 1 may be patients given the standard treatment
  • group 2 may be patients given an experimental treatment
30
Q

Common Reference Design
Approach 1
Matrices

A

-we have the general relation:
y = Xβ + ε
-where y is an n-vector of log ratios of red over green for each of the four arrays
-X is a 4x2 matrix with entries 1,1,0,0, in the first column and 0,0,1,1, in the second column
-β is a 2x1 vector with entries β1 and β2
-β1 and β2 therefore represent the differential expression of group 1 over reference and group 2 over reference respectively
-set contrast C, a 2x1 matrix with entries -1,1
-calculate:
θ^ = Ct β^ = β2^ - β1^
-θ^ represents the differential expression of group 2 over group 1
-so a large positive value of θ^ means that the gene is more expressed in group 2 than group 1

31
Q

Common Reference Design
Approach 2
Matrices

A

-we have the general relation:
y = Xβ + ε
-where y is an n-vector of log ratios of red over green for each of the four arrays
-X is a 4x2 matrix with entries 1,1,1,1, in the first column and 0,0,1,1, in the second column
-β is a 2x1 vector with entries β1 and β2
-β1 represents the differential expression of group 1 over reference
-β2 represents the differential expression of group 2 over group 1, so no need to introduce contrast

32
Q

Common Reference Design
Equivalence of Approaches
Description

A
  • let y be an n-vector of of log ratio of ref over green
  • let n1 be the number of arrays with green as reference and group 1 as red
  • let n2 be the number of arrays with green as reference and group 2 as red
  • then n=n1+n2
  • let G1 and G2 be sets if array indices for group 1 and group 2 respectively
33
Q

Common Reference Design
Equivalence of Approaches
Approach 1

A

-X is a 2xn matrix with the first column being n1 1s and then n2 0s and the second column being n1 0s and then n2 1s
-β is a 2x1 vector with entries β1 and β2
-multiply the general equation by Xt
Xty = XtXβ
-then:
β = (XtX)^(-1) Xty
-β1 is then the mean log ratio for group 1 over reference
-β2 is the mean log ratio for group 2 over reference
-contrast C, is a 2x1 matrix with entries -1,1
θ^ = Ct β^ = β2^ - β1^
-so θ^ is the difference in mean log ratio between groups 2 and 1 i.e. the differential expression of group 2 over group 1

34
Q

Common Reference Design
Equivalence of Approaches
Approach 2

A

-X is a 2xn matrix with the first column being n=n1+n2 1s and the second column being n1 0s and then n2 1s
-β is a 2x1 vector with entries β1 and β2
-multiply the general equation by Xt
Xty = XtXβ
-then:
β = (XtX)^(-1) Xty
-β1 is then the mean log ratio for group 1 over reference
-β2 is the difference in mean log ratio between groups 2 and 1 i.e. the differential expression of group 2 over group 1

35
Q

T-Test for Direct Comparison Design

Matrices

A
  • for direct comparison design with 4 alternating arrays, y=Xβ+ε, where X is a 4x1 vector with entries -1,1,-1,1
  • let y* be a vector of treatment over control ratios, y*=Xy
  • then y=Xβ+ε, where X* is a 4x1 vector with entries 1,1,1,1
36
Q

T-Test for Direct Comparison Design

β^ Estimate

A

β^ = (XtX)^(-1) Xt y

= 1/n Σ yi* = y*_

37
Q

T-Test for Direct Comparison Design

σε²^ Estimate

A

σε²^ = 1/(n-p) Σ (y-y_)²

= Var(y*)

38
Q

T-Test for Direct Comparison Design

Variance and Standard Error of β^

A

Var(β^) = σε²^ (XtX)^(-1)
= 1/n σε²^
SE(β^) = SE(y*_)

39
Q

T-Test for Direct Comparison Design

T-Test

A
-so testing for differential expression of treatment over control becomes:
tg = β^/SE(β^) = y*_/SE(y*_)
= [y*_] / [SE(y*)/√n]
-a simple one-sample two test
-under Ho:β=0 (no DE) tg follows a t distribution with n-1 degrees of freedom
-we reject Ho if |tg|>tn-1(α/2)
-or if the p-value, 
Pho(|T|>to)
40
Q

T-Test for Single Colour Arrays

Description

A
  • let yavg be the normalised log expression of gene g in array a and variety v
  • analysis is done on a gene by gene basis so g index can be dropped
  • let v=1,2 for two experimental group so that there are a=1,…,n1 in group 1 and a=1,…,n2 in group 2
  • we assume that y1a~N(μ1,σ²) and y2a~N(μ2,σ²)
41
Q

T-Test for Single Colour Arrays

Hypothesis

A

-if there is no differential expression, μ1=μ2 hence the hypothesis is:
Ho : μ1-μ2 = 0
H1 : μ1-μ2 ≠ 0

42
Q

T-Test for Single Colour Arrays

Test Statistic

A

-define y1=Σy1a/n1 and y2=Σy2a/n2 where the sums over a
-since μ1^=y1_ and μ2^=y2, we construct a two-sample t-test (for each gene) as:
t = [y1
-y2] / SE(y1-y2_)

43
Q

T-Test for Single Colour Arrays

T-Test

A

-under Ho: μ1=μ2, t follows a t-distribution with n1+n2-2 degrees of freedom
-we reject Ho if |t|>tn1+n2-2(α/2)
-or if the p-value:
Pho(|T|>|t|) < α

44
Q

Identification of Differentially Expressed Genes

Overview

A
  • two groups of samples
  • have a condition for rejection of the null hypothesis
  • in reality we will get false positives, v (type I errors) and false negatives, t (type II errors)
45
Q

Testing Multiple Hypotheses

Description

A
  • under Ho, the test statistic has a 5% chance of being significant at the α=0.05 level
  • when we are testing thousands of hypotheses, then 5% of these statistics will be significant by chance even if there is no differential expression
46
Q

Testing Multiple Hypotheses

Experiment-Wise Error Rate, EER

A

-suppose we are testing m hypotheses each with significant level α and assumed to be independent
-then the EER would be:
P{at least one test rejects Ho | Ho is true}
= 1 - P{no test rejects Ho | Ho true}
= 1 - (1-α)^m
-in reality the tests would not be independent, the overall error rate would be lower so the above gives an upper bound on the overall error rate

47
Q

What should be done about false positives?

A
  • when you have thousands of tests, you are bound to get false positives
  • we can adjust the p-values to reflect that we are testing multiple hypotheses
  • our aim is to control false positives, NOT reduce!
  • if we don’t do this we will be declaring 5% of all test significant without knowing if there are any real effects at all
48
Q

Types of Errors

Random Variables

A
R = an observable, the total number of genes tested to differentially express
S = number of genes that are actually DE that are tested as DE
V = number of genes that are tested as DE but actually aren't
U = number of genes that are tested as non-DE and actually aren't
T = number of genes that are tested as non-DE but are actually DE
49
Q

Family Wise Error Rate

Definition

A
  • FWER
  • defined as P(V≥1|Ho)
  • the probability of AT LEAST ONE type-I error given complete null hypothesis (none of the genes DE)
50
Q

False Discovery Rate

Definition

A

-defined as
E(V/R) = E(V/R | R>0) P(R>0)
-so that FDR=0 when R=0
-this is the expected proportion of type I errors (false positives) among the rejected hypotheses

51
Q

Family Wise Error Rate

Method

A
  • we consider three methods; Bonferroni, Sidak, Holm
  • the procedure for m tests/genes is:
    1) order the p-values p1≤p2≤…≤pk≤…≤pm
    2) adjust the p-values p1~≤p2~≤…≤pk~≤…≤pm~
    3) declare k significant if pk~≤α and claim Pho(V>0)≤α
52
Q

Family Wise Error Rate

Bonferroni Method

A

-the simplest and most conservative
-multiply p-values by the total number of tests, m
pk~ = m*pk or 1
-whichever is larger

53
Q

Family Wise Error Rate

Sidak Method

A

pk~ = 1 - (1-pk)^m

  • it can be shown that Pho(V>0)=α
  • less conservative than the Bonferroni correction
54
Q

Family Wise Error Rate

Holm Method

A
-formally,
pk~ = max_{l=1,...,k} (m-l+1) pl
-this means:
p1~ = p1 * m
p2~ = p2 * (m-1) or p1~
p3~ = p3 * (m-2) or p2~
...
-taking whichever is greater
-less conservative than the Bonferroni correction
55
Q

Family Wise Error Rate

Remarks

A
  • generally considered conservative in highly multiple testing in a biological context
  • not many genes are found to be significant
  • low power (for large m), will not detect real changes when there are some
  • controls the probability of getting at least one false positive
56
Q

Family Wise Error Rate

In Practice

A
  • in practice, we ‘allow’ some false positives in our results as long as their proportion is controlled
  • this motivates the false discovery rate
  • perhaps 10-20% are allowed to be false positive as long as the majority of real changes are detected
  • this hypothesis generating research, genes in the top of the list will be validated by more sensitive techniques
57
Q

False Discovery Rate

Method

A

-order the p-values p1≤p2≤…≤pk≤…≤pm
-then:
pk~ = min_{l≥k} (m * pl/l)

58
Q

False Discovery Rate

Purpose

A
  • can tell us from all significant test how many are expected to be real
  • if we declare 100 genes significant with 5% FDR, then 5 of them are expected to be false positives
59
Q

5% FWER

A
  • means a 5% probability of getting at least one false positive in the result
  • the significance level of each test is 5%, so we expect 5% of tests to be significant by chance
60
Q

5% FDR

A

-means 5% of the significant tests are expected to be identified by chance

61
Q

FWER vs FDR

A
  • we defend against the possibility of there being no real effects at all using FWER; the probability of getting at least one false positive
  • if there are some real effects. we ‘allow’ some false positives in our results as long as the proportions are controlled
  • then FDR is used to control the expected proportion of false positives among significant results
  • FDR is more lenient than the Bonferroni, Sidak and Holm adjustments