analysis of gene expression 2 Flashcards
1
Q
identification of differentially expressed genes
A
- data analysis methods:
- fold change/thresholds
- t-test
2
Q
fold change
A
- average fold changes of experimental replicated
- or average of log ratios (more accurate)
- decide threshold
- modification = additional criteria for intensity change
- require absolute change e.g. by 10 units
- or floor intensity data (all below 10 to 0)
- reduces number of false positives and corrects for systematic errors
- fewer hypotheses to be tested
3
Q
fold change advantages
A
- simple to implement and easy to calculate
- straightforward interpretation
- can use with few replicates
4
Q
fold change disadvantages
A
- small intensity changes can produce large calculated fold changes in poorly expressed genes
- doesn’t account for noisy data
- outliers have large effect on average fold change
- not statistically-based
- threshold?
- convenient not mathematical
5
Q
t test analysis
A
- statistical version of fold change
- assumes 2 samples are normally distributed
- investigates whether their means are the same or different by calculating t statistic for each replicate
- null hypothesis - average expression is same for both samples
6
Q
multiple correction testing
A
- more tests means you increase the number of apparently significant results you would expect by chance alone
- if alpha = 0.05, you would expect 50 unusual results in 1000 tests
- need to correct for this
7
Q
t test
advantages and disadvantages
A
- advantages:
- statistical
- fewer false positives than fold change
- can combine RNAseq and microarray data
- disadvantages:
- usually few replicates - limits statistical power
- can lead to large gene-to-gene fluctuations in calculated standard deviation with small replicate number
8
Q
DNA binding sites
A
- indicate how translaiton is controlled
- can predict regions that lead to expression of particular genes
- experimental identification or de novo prediction to create binding site library
9
Q
using DNA motif knowledge
A
- search sequence for known sites
- identify and search for restriction sites
- use information to model binding site
- create consensus
- decide number of allowed mismatches
- depends on sequence properties
- create weight matrix
- create position frequency matrix
- create position weight matrix
10
Q
binding site PWM
A
- probability of base b in position i (b,i)
- pseudocount to correct for finite number of input sequences
- sigma to represent general probability of base occurrence
- score sites to indicate certainty/uncertainty of particular base at that position
- search sequence for objects that are likely to arise from that PWM
- score directly related to binding energy of DNA-protein interaction
- statistical and energy-based model
11
Q
assumptions of DNA motif knowledge approach
A
- nucleotide at one position has no effect on nucleotide present at adjoining position
- TFs have strict spatial requirements in binding sites that preclude variable spacing
12
Q
de novo prediction of DNA binding sites
A
- use of gene expression studies to identify coexpressed genes
- something upstream of coexpressed genes may explain expression behaviour
- statistical methods to identify the motif of interest in available sequences