TFBS Prediction Flashcards
Week 2 Lecture 2
Why do we want to predict TFBS?
- They are key elements in the regulation of gene expression.
- They help in the general understanding of gene expression
- Can help with gene finding
- Mutations in TFBS can lead to disease - Experimentally validated sites are limited in number
- Most experimental methods have poor resolution - need to find the actual site within the experimental site
Transcription initiation in prokaryotes
RNA polymerase has a strong affinity for the promoter and basal transcription rate is high
Transcription initiation in eukaryotes
RNA polymerase II and RNA polymerase III pre-initiation complexes don’t assemble efficiently, the basal transcription rate is low, and other transcription factors are needed for effective initiation
What are transcription factors?
Sequence-specific DNA binding factors that activate the initiation of eukaryotic transcription
Types of transcription factors
- Constitutive: Work for many different genes and don’t respond to external signals
- Regulatory: Limited number of genes and respond to external signals
What do transcription factors recognise?
- Upstream promoter elements: they influence initiation at the promoter to which they are attached
- Targets within enhancers: influence several genes at once
Traditional view of how transcription factors work
They activate the formation of the pre-initiation complex by:
1. Making direct contact
2. Making indirect contact
3. Inducing a DNA bend
New view on how transcription factors work
- Some can modify histone proteins affecting nucleosome positioning
- Some can bend DNA into a specific shape bringing other TFs into contact with the pre-initiation complex (enhanceosome)
- Some do not bind to DNA but form protein-protein contacts with the pre-initiation complex
Examples of transcription factors
Up to 2600 TFs
e.g. Oct-1, Oct-2, Heat shock factor, Serum response factor, GATA-I
Size of consensus sequences
Most are 9-15 bp, mean 12.2 bp
How do we predict TFBS?
- Encode the patterns that describe a binding site
- Scan these patterns against DNA (needs to be more intelligent than naive scanning)
Techniques to identify TFBS
- Electro-Mobility Shift Assay (EMSA)
- DNase I footprinting/protection
- Systematic Evolution of Ligands by Exponential enrichment (SELEX)
- SELEX-Seq
- Chromatin ImmunoPrecipitation (ChIP)
- ChIP-chip
- ChIP-seq
- ChIP-exo
Electro-Mobility Shift Assay (EMSA)
in vitro
- The mobility on a gel is different for DNA bound to protein
- Control (DNA only), DNA and protein that does not bind, DNA and protein that does bind
DNase I footprinting/protection
in vitro
- Combines DNase I cleavage with electrophoresis
- The bound protein shields the DNA from cleavage
SELEX
in vitro
- Large DNA library
- Select for binders by affinity chromatography or by EMSA
- Amplify by PCR potentially with low stringency copying to allow mutations
- Subsequent rounds use higher stringency elution
- Sequence binders
SELEX-Seq
in vitro
- same as SELEX but uses next-gen sequencing
- SELEX sequences a maximum of 10^2 oligos and requires many rounds
- SELEX-Seq characterises >=10^7 oligos and only requires 1 or 2 rounds
ChIP
in vivo
- DNA and associated proteins are cross-linked
- DNA-protein complexes are sheared into ~500bp DNA fragments
- Crosslinked DNA/protein are immunoprecipitated with a protein-specific antibody
- DNA fragments are purified and sequenced
ChIP-chip
in vivo
- Cross-link and shear
- Select with specific antibody and release DNA
- Label DNA with fluorescent tag
- Hybridise with DNA micro-array from genomic region of interest
- Identify matches from fluorescence
- Computational analysis
Chip-Seq
in vivo
- Same as ChIP but uses next-gen sequencing
- Sequence everything from the ChIP experiment
- A single sequencing run can scan for genome-wide associations with fairly high resolution (100-300 bp)
- Computational statistical analysis required - peak calling
- Peak calling finds regions of the genome that are enriched with aligned reads
ChIP-exo
Refinement of ChIP-seq
- Uses exonucleases to trim the protein-bound DNA
- Claimed single-base resolution of binding sites
Comparison of identifying TFBS methods
- EMSA and DNase 1 footprinting can identify non-specific binding sites
- All methods find regions larger than the actual TFBS
- ChIP-seq has better resolution than ChIP-chip
- ChIP-exo has better resolution than ChIP-chip
- ChIP-seq is currently the gold standard
Motif-discovery methods to identify TFBS
- Enumerative methods
- Identify overrepresented strings
- Examine the frequencies of all DNA strings from these
- Less chance of getting stuck in local optima - Probabilistic methods
- Generate local MSA and learn descriptive parameters
- Expectation maximisation (MEME)
- Gibbs sampling Bayesian Monte Carlo)
- Greedy approaches
- Arbitrary motif model variations
MEME
- Struggles with huge datasets from genome-wide techniques
- Can use a random subset of sequences but this may be inaccurate
- Newer tools designed for huge datasets: ChIPMUNK, HOMER, MEME-ChIP, rGADEM
- Recent evaluation suggests rGADEM generates the best motifs from ChIP-seq data
Position Weight Matrices (PWMs)
- PWM simply shows the frequency at which each base is seen at each position - raw counts or fractions
- Fractional version - express each value as a fraction of the total for that row
- Assumes each position is independent
Complex models for encoding TFBS
- Pair-correlation models
- Trees
- Feature-based models
- Hidden Markow Models
etc.
(classical PWMs generally perform better)
FIMO
Converts PWM data into a log-odds score
FIMO uses dynamic programming to convert log-odds scores into p-values
Uses a zero-order background model to take into account the relative frequency of each base in the sequences being scanned
The program reports all motif occurrences with a p-value less than 10^-4
This is converted into a q-value - the minimal false discovery rate (FDR) for which a motif is deemed significant
Background models to scan TFBS
- Zero-order simply looks at the nucleotide frequency (it counts the occurrences of A, T, C, and G in the sequences dataset)
- Higher-order models assume the probability of observing a certain nucleotide depends on the previous nucleotides
Approaches to scanning TFBS
- Individual sites
- Clusters
The choice depends on the context. If you have prior knowledge of a gene it is best to use individual predictors. If no prior knowledge use a cluster predictor.