2. Analysis of Gene Expression Data Flashcards

Question

Direct Comparison Design Practical Notation for Log Ratios Description

Answer 1

-in practice we have dyes swaps to compensate for dye bias -e.g. on arrays 1 and 3 red will be treatment and green will be control but on arrays 2 and 4 red will be control and green will be treatment -computer software will always compute the log ratio of red over green regardless of which experimental group the dye corresponds to -but we want the treatment over colour ratio regardless of dye colour -denote the model as: yag~ = ya(1)g - ya(2)g = (VG)1g - (VG)2g + eag = μg + eag -where the brackets indicate the ordered indexing so that the ratio is treatment over control regardless of dye

Answer 2

-in matrix form: -in matrix form: y = Xβ + E -where y is a 1xn vector with entries y1g,y2g,y3g,y4g -and X is a 1x4 matrix with entries 1,-1,1,-1 -and β is a 1x1 vector with entry μg, the differential expression of treatment over control -and E is a 1xn vector with entries e1g,e2g,e3g,e4g

Answer 3

- for control over treatment define the X matrix as -1,1,-1,1 instead of 1,-1,1,-1 - then μg in β represents the differential expression of control over treatment

Answer 4

- consider a 4 array experiment - the green dye is always a reference group - the red dye is experimental group 1 on the arrays 1 and 2 and experimental group 2 on arrays 3 and 4 - the main aim is to test for differential expression of group 2 over group 1 - but a secondary aim can be to test differential expression of each of group 1 and 2 over the reference

Answer 5

- the reference group may be patients given no treatment - group 1 may be patients given the standard treatment - group 2 may be patients given an experimental treatment

Answer 6

-we have the general relation: y = Xβ + ε -where y is an n-vector of log ratios of red over green for each of the four arrays -X is a 4x2 matrix with entries 1,1,0,0, in the first column and 0,0,1,1, in the second column -β is a 2x1 vector with entries β1 and β2 -β1 and β2 therefore represent the differential expression of group 1 over reference and group 2 over reference respectively -set contrast C, a 2x1 matrix with entries -1,1 -calculate: θ^ = Ct β^ = β2^ - β1^ -θ^ represents the differential expression of group 2 over group 1 -so a large positive value of θ^ means that the gene is more expressed in group 2 than group 1

Answer 7

-we have the general relation: y = Xβ + ε -where y is an n-vector of log ratios of red over green for each of the four arrays -X is a 4x2 matrix with entries 1,1,1,1, in the first column and 0,0,1,1, in the second column -β is a 2x1 vector with entries β1 and β2 -β1 represents the differential expression of group 1 over reference -β2 represents the differential expression of group 2 over group 1, so no need to introduce contrast

Answer 8

- let y be an n-vector of of log ratio of ref over green - let n1 be the number of arrays with green as reference and group 1 as red - let n2 be the number of arrays with green as reference and group 2 as red - then n=n1+n2 - let G1 and G2 be sets if array indices for group 1 and group 2 respectively

Answer 9

-X is a 2xn matrix with the first column being n1 1s and then n2 0s and the second column being n1 0s and then n2 1s -β is a 2x1 vector with entries β1 and β2 -multiply the general equation by Xt Xty = XtXβ -then: β = (XtX)^(-1) Xty -β1 is then the mean log ratio for group 1 over reference -β2 is the mean log ratio for group 2 over reference -contrast C, is a 2x1 matrix with entries -1,1 θ^ = Ct β^ = β2^ - β1^ -so θ^ is the difference in mean log ratio between groups 2 and 1 i.e. the differential expression of group 2 over group 1

Answer 10

-X is a 2xn matrix with the first column being n=n1+n2 1s and the second column being n1 0s and then n2 1s -β is a 2x1 vector with entries β1 and β2 -multiply the general equation by Xt Xty = XtXβ -then: β = (XtX)^(-1) Xty -β1 is then the mean log ratio for group 1 over reference -β2 is the difference in mean log ratio between groups 2 and 1 i.e. the differential expression of group 2 over group 1

Answer 11

- for direct comparison design with 4 alternating arrays, y=Xβ+ε, where X is a 4x1 vector with entries -1,1,-1,1 - let y* be a vector of treatment over control ratios, y*=Xy - then y*=X*β+ε, where X* is a 4x1 vector with entries 1,1,1,1

Answer 12

β^ = (X*tX*)^(-1) X*t y* | = 1/n Σ yi* = y*_

Answer 13

σε²^ = 1/(n-p) Σ (y*-y*_)² | = Var(y*)

Answer 14

Var(β^) = σε²^ (X*tX*)^(-1) = 1/n σε²^ SE(β^) = SE(y*_)

Answer 15

``` -so testing for differential expression of treatment over control becomes: tg = β^/SE(β^) = y*_/SE(y*_) = [y*_] / [SE(y*)/√n] -a simple one-sample two test -under Ho:β=0 (no DE) tg follows a t distribution with n-1 degrees of freedom -we reject Ho if |tg|>tn-1(α/2) -or if the p-value, Pho(|T|>to) ```

Answer 16

- let yavg be the normalised log expression of gene g in array a and variety v - analysis is done on a gene by gene basis so g index can be dropped - let v=1,2 for two experimental group so that there are a=1,...,n1 in group 1 and a=1,...,n2 in group 2 - we assume that y1a~N(μ1,σ²) and y2a~N(μ2,σ²)

Answer 17

-if there is no differential expression, μ1=μ2 hence the hypothesis is: Ho : μ1-μ2 = 0 H1 : μ1-μ2 ≠ 0

Answer 18

-define y1_=Σy1a/n1 and y2_=Σy2a/n2 where the sums over a -since μ1^=y1_ and μ2^=y2_, we construct a two-sample t-test (for each gene) as: t = [y1_-y2_] / SE(y1_-y2_)

Answer 19

-under Ho: μ1=μ2, t follows a t-distribution with n1+n2-2 degrees of freedom -we reject Ho if |t|>tn1+n2-2(α/2) -or if the p-value: Pho(|T|>|t|) < α

Answer 20

- two groups of samples - have a condition for rejection of the null hypothesis - in reality we will get false positives, v (type I errors) and false negatives, t (type II errors)

Answer 21

- under Ho, the test statistic has a 5% chance of being significant at the α=0.05 level - when we are testing thousands of hypotheses, then 5% of these statistics will be significant by chance even if there is no differential expression

Answer 22

-suppose we are testing m hypotheses each with significant level α and assumed to be independent -then the EER would be: P{at least one test rejects Ho | Ho is true} = 1 - P{no test rejects Ho | Ho true} = 1 - (1-α)^m -in reality the tests would not be independent, the overall error rate would be lower so the above gives an upper bound on the overall error rate

Answer 23

- when you have thousands of tests, you are bound to get false positives - we can adjust the p-values to reflect that we are testing multiple hypotheses - our aim is to control false positives, NOT reduce! - if we don't do this we will be declaring 5% of all test significant without knowing if there are any real effects at all

Answer 24

``` R = an observable, the total number of genes tested to differentially express S = number of genes that are actually DE that are tested as DE V = number of genes that are tested as DE but actually aren't U = number of genes that are tested as non-DE and actually aren't T = number of genes that are tested as non-DE but are actually DE ```

Answer 25

- FWER - defined as P(V≥1|Ho) - the probability of AT LEAST ONE type-I error given complete null hypothesis (none of the genes DE)

Answer 26

-defined as E(V/R) = E(V/R | R>0) P(R>0) -so that FDR=0 when R=0 -this is the expected proportion of type I errors (false positives) among the rejected hypotheses

Answer 27

- we consider three methods; Bonferroni, Sidak, Holm - the procedure for m tests/genes is: 1) order the p-values p1≤p2≤...≤pk≤...≤pm 2) adjust the p-values p1~≤p2~≤...≤pk~≤...≤pm~ 3) declare k significant if pk~≤α and claim Pho(V>0)≤α

Answer 28

-the simplest and most conservative -multiply p-values by the total number of tests, m pk~ = m*pk or 1 -whichever is larger

Answer 29

pk~ = 1 - (1-pk)^m - it can be shown that Pho(V>0)=α - less conservative than the Bonferroni correction

Answer 30

``` -formally, pk~ = max_{l=1,...,k} (m-l+1) pl -this means: p1~ = p1 * m p2~ = p2 * (m-1) or p1~ p3~ = p3 * (m-2) or p2~ ... -taking whichever is greater -less conservative than the Bonferroni correction ```

Answer 31

- generally considered conservative in highly multiple testing in a biological context - not many genes are found to be significant - low power (for large m), will not detect real changes when there are some - controls the probability of getting at least one false positive

Answer 32

- in practice, we 'allow' some false positives in our results as long as their proportion is controlled - this motivates the false discovery rate - perhaps 10-20% are allowed to be false positive as long as the majority of real changes are detected - this hypothesis generating research, genes in the top of the list will be validated by more sensitive techniques

Answer 33

-order the p-values p1≤p2≤...≤pk≤...≤pm -then: pk~ = min_{l≥k} (m * pl/l)

Answer 34

- can tell us from all significant test how many are expected to be real - if we declare 100 genes significant with 5% FDR, then 5 of them are expected to be false positives

Answer 35

- means a 5% probability of getting at least one false positive in the result - the significance level of each test is 5%, so we expect 5% of tests to be significant by chance

Answer 36

-means 5% of the significant tests are expected to be identified by chance

Answer 37

- we defend against the possibility of there being no real effects at all using FWER; the probability of getting at least one false positive - if there are some real effects. we 'allow' some false positives in our results as long as the proportions are controlled - then FDR is used to control the expected proportion of false positives among significant results - FDR is more lenient than the Bonferroni, Sidak and Holm adjustments