2. Analysis of Gene Expression Data Flashcards
Linear Regression vs ANOVA Model
-ANOVA is used similarly to linear regression but for when the x variable is categorical
ANOVA Model
Description
-suppose we have p categories to compare (usually two, healthy and ill)
-let n be the number of observations group i, i=1,…,n then N=np is the total number of individual observations
-let yij denote the jth observation of the ith experimental group:
yij = μ + αi + εij
-where μ is the baseline
and αi is the effect of group i
-and εij is the random variation assumed to be i.i.d. N(0,σ²)
ANOVA Model
Constraint
-in order for the parameters to be estimable apply constraint:
Σαi = 0
-so for the case p=2:
α1 = -α2
ANOVA Model
Least Squares
-the parameters can be estimated by the least squared method
-let the sum of square residuals:
S(μ,αi) = Σrij² = Σ[yij-μ-αi]²
-where the sum is over i and j
-the estimates are obtained by minimising S(μ,αi)
-calculate dS\dμ and each dS/dαi, set each to zero and evaluate μ^ and αi^ for i=1,…,p
ANOVA Model
Estimates
μ^ = y.._ = Σyij/N -where the sum is over i and j αi^ = yi._ - y.._ -where: yi._ = Σyij/N -where the sum is over j
ANOVA Model
Simplest Matrix Form
yi = μ + εi
-can be written in matrix form as:
y = Xβ + ε
-where y is a column vector with entries y1, y2, …, yn
-X is a 1xn matrix of 1s
-β is a 1x1 vector μ
-ε is a 1xn column vector with entries ε1,…,εn
ANOVA Model
Matrix Form, p=2
yij = μ + αi + εij
-consider the case where p=2
-matrix form:
y = Xβ + ε
-y is a column vector with entries y11,y12,…,y1n,y21,y22,…,y2n
-ε is a column vector with entries ε11,ε12,…,ε1n,ε21,ε22,…,ε2n
-X is a 2nx2 matrix since there are 2n total observations and 2 independent parameters with entries, all 1s in first column, n 1s then n -1s in second column
-β is a 2x1 vector with entries μ and α1
-then α2 is given by the constraint, α2=-α1
ANOVA Models for Microarray Data
General
yavdg = μ + Aa + Vv + Dd + Gg + VGvg + DGdg + AGag + Eavdg
-where μ,Aa,Vv,Dd,Gg are described as main effects
-VGvg,DGdg,AGag are interaction terms
-Eavdg is the error
Aa = ath array effect
Vv = vth variety, experimental group effect
Dd = the dth effect dye
Gg = the gth gene effect
ANOVA Models for Microarray Data
Interaction Terms
-interaction terms are included in the model if there are at least two observations per comg=biation of groups
ANOVA Models for Microarray Data
Interaction Terms VG
-the interaction term VG is the effect of gene g in experimental group v
-the term VG must be present in ANY model because the construction of cDNA microarray experiment must involve two different genes
-our interest is in this interaction term since it represents the relationship between a gene and an experimental group
-gene expression between two groups:
(VG)1g - (VG)2g
-we are interested in differential expression between the two groups but they are not directly comparable at this stage due to different source of variation in the raw microarray data
ANOVA Models for Microarray Data
Interaction Terms AG
-represents the ‘spot effect’, included in the model if a gene is repeated twice (on the same array)
ANOVA Models for Microarray Data
Interaction Terms DG
- the term DG represents gene-specific dye effect / dye bias
- included if labelling and samples involved are repeated at least twice
- can be off-set by ‘dye swap’
Normalisation
-arrays are not directly comparable due to different sources of systematic variation
-normalisation is an effort to remove systematic non-biological variations
-e.g. array effects, subject effects (if applicable) dye effect etc.
-assumes that the majority of genes are unchanged (not differentially expressed between experimental groups)
aiming for arrays and spots to become directly comparable so that any variation is due to the biological question of interest
Seeing Variation in the cDNA Microarray Data
-transfer the linear scale to a log scale
-then plot log ratio:
M = log(R/G)
-vs the abundance:
A = log(√[RG])
-where √[RG] is the geometric mean
-this should follow a horizontal trend but there is a slight slant due to dye bias and other contributing factors
Seeing Variation in Affy Array Data
- in single arrays, the discrepancy is observed between arrays
- the peaks are different heights an in different places
Normalisation Methods
Median Normalisation
- shifts the distribution of all log ratios to have zero median
- shifts the whole log-ratio distribution so that its median equals zero
- but this doesn’t correct the shape
Normalisation Methods
ANOVA Normalisation
- models the log ratios in the linear model
- and ‘compensate’ each expression so that only the term(s) related to the biological variability would remain
- corrects y*avdg for the different terms in the model so that we are left with the term VG
- requires a big number of arrays
Normalisation Methods
Loess / Lowess Normalisation
- corrects the distribution of log ratios across gene expression intensities
- done by drawing a loess / lowess line and each value of the log ratio is subtracted to the loess curve across different level expression abundance
- shifts the log-ratio distribution ‘adaptively’ wrt their abundance
- it suppresses the first five terms in the model
Affrymetrix Arrays
Quantile Normalisation
-this method assumes that the distribution of gene abundance is nearly the same in all samples
-it constructs a ‘reference’ array (not physical array) where the probes are taken to be the median of probes across samples
-to normalise an array it computes the quantile of the value in the distribution of probe intensities
-then it transforms the original value to that quantile’s value on the reference array
Xnorm = F^(-1) (Fref(x))
ANOVA Model
After Normalisation
yavdg = VGvg + DGdg + AGag + Eavdg
-if the requirement for inclusion of DG and AG is not met then the model simplifies to:
yavg = VGvg + Eavg
-normalisation suppresses main effects
Log Ratios and Experimental Design in cDNA Microarray
Notes
- each group / variety MUST be labelled by one of the dyes, red or green
- on spot in the microarray carries information from both groups (hence both colours)
- we usually take a log ratio (red/green) or difference of the log between them to avoid spot bias
Notations for Log Ratios
- theoretical notation
- practical notation
Theoretical Notation for Log Ratios
Description
-take the simplest model:
yavg = (VG)vg + Eavg
-thus the log ratio between ‘treatment’ group v=1 and ‘control’ group v=2 can be written:
yag~ = ya1g - ya2g
= (VG)1g - (VG)2g + eag
= δg + eag
-now yag~ is the log ratio of treatment over control in array a and gene gand δg represents differential expression of treatment over control
Theoretical Notation for Log Ratios
Matrices
-in matrix form:
y = Xβ + E
-where y is a 1xn vector with entries y1g,y2g,…,yng
-and X is a 1xn matrix with all 1s
-and β is a 1x1 vector with entry δg, the differential expression of red over green
-and E is a 1xn vector with entries e1g,e2g,…,eng