TFBS Prediction Flashcards

Week 2 Lecture 2

1
Q

Why do we want to predict TFBS?

A
  1. They are key elements in the regulation of gene expression.
    - They help in the general understanding of gene expression
    - Can help with gene finding
    - Mutations in TFBS can lead to disease
  2. Experimentally validated sites are limited in number
  3. Most experimental methods have poor resolution - need to find the actual site within the experimental site
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Transcription initiation in prokaryotes

A

RNA polymerase has a strong affinity for the promoter and basal transcription rate is high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transcription initiation in eukaryotes

A

RNA polymerase II and RNA polymerase III pre-initiation complexes don’t assemble efficiently, the basal transcription rate is low, and other transcription factors are needed for effective initiation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are transcription factors?

A

Sequence-specific DNA binding factors that activate the initiation of eukaryotic transcription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of transcription factors

A
  • Constitutive: Work for many different genes and don’t respond to external signals
  • Regulatory: Limited number of genes and respond to external signals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do transcription factors recognise?

A
  • Upstream promoter elements: they influence initiation at the promoter to which they are attached
  • Targets within enhancers: influence several genes at once
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Traditional view of how transcription factors work

A

They activate the formation of the pre-initiation complex by:
1. Making direct contact
2. Making indirect contact
3. Inducing a DNA bend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

New view on how transcription factors work

A
  1. Some can modify histone proteins affecting nucleosome positioning
  2. Some can bend DNA into a specific shape bringing other TFs into contact with the pre-initiation complex (enhanceosome)
  3. Some do not bind to DNA but form protein-protein contacts with the pre-initiation complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Examples of transcription factors

A

Up to 2600 TFs
e.g. Oct-1, Oct-2, Heat shock factor, Serum response factor, GATA-I

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Size of consensus sequences

A

Most are 9-15 bp, mean 12.2 bp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we predict TFBS?

A
  1. Encode the patterns that describe a binding site
  2. Scan these patterns against DNA (needs to be more intelligent than naive scanning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Techniques to identify TFBS

A
  1. Electro-Mobility Shift Assay (EMSA)
  2. DNase I footprinting/protection
  3. Systematic Evolution of Ligands by Exponential enrichment (SELEX)
  4. SELEX-Seq
  5. Chromatin ImmunoPrecipitation (ChIP)
  6. ChIP-chip
  7. ChIP-seq
  8. ChIP-exo
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Electro-Mobility Shift Assay (EMSA)

A

in vitro
- The mobility on a gel is different for DNA bound to protein
- Control (DNA only), DNA and protein that does not bind, DNA and protein that does bind

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DNase I footprinting/protection

A

in vitro
- Combines DNase I cleavage with electrophoresis
- The bound protein shields the DNA from cleavage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SELEX

A

in vitro
- Large DNA library
- Select for binders by affinity chromatography or by EMSA
- Amplify by PCR potentially with low stringency copying to allow mutations
- Subsequent rounds use higher stringency elution
- Sequence binders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SELEX-Seq

A

in vitro
- same as SELEX but uses next-gen sequencing
- SELEX sequences a maximum of 10^2 oligos and requires many rounds
- SELEX-Seq characterises >=10^7 oligos and only requires 1 or 2 rounds

17
Q

ChIP

A

in vivo
- DNA and associated proteins are cross-linked
- DNA-protein complexes are sheared into ~500bp DNA fragments
- Crosslinked DNA/protein are immunoprecipitated with a protein-specific antibody
- DNA fragments are purified and sequenced

18
Q

ChIP-chip

A

in vivo
- Cross-link and shear
- Select with specific antibody and release DNA
- Label DNA with fluorescent tag
- Hybridise with DNA micro-array from genomic region of interest
- Identify matches from fluorescence
- Computational analysis

19
Q

Chip-Seq

A

in vivo
- Same as ChIP but uses next-gen sequencing
- Sequence everything from the ChIP experiment
- A single sequencing run can scan for genome-wide associations with fairly high resolution (100-300 bp)
- Computational statistical analysis required - peak calling
- Peak calling finds regions of the genome that are enriched with aligned reads

20
Q

ChIP-exo

A

Refinement of ChIP-seq
- Uses exonucleases to trim the protein-bound DNA
- Claimed single-base resolution of binding sites

21
Q

Comparison of identifying TFBS methods

A
  • EMSA and DNase 1 footprinting can identify non-specific binding sites
  • All methods find regions larger than the actual TFBS
  • ChIP-seq has better resolution than ChIP-chip
  • ChIP-exo has better resolution than ChIP-chip
  • ChIP-seq is currently the gold standard
22
Q

Motif-discovery methods to identify TFBS

A
  1. Enumerative methods
    - Identify overrepresented strings
    - Examine the frequencies of all DNA strings from these
    - Less chance of getting stuck in local optima
  2. Probabilistic methods
    - Generate local MSA and learn descriptive parameters
    - Expectation maximisation (MEME)
    - Gibbs sampling Bayesian Monte Carlo)
    - Greedy approaches
    - Arbitrary motif model variations
23
Q

MEME

A
  • Struggles with huge datasets from genome-wide techniques
  • Can use a random subset of sequences but this may be inaccurate
  • Newer tools designed for huge datasets: ChIPMUNK, HOMER, MEME-ChIP, rGADEM
  • Recent evaluation suggests rGADEM generates the best motifs from ChIP-seq data
24
Q

Position Weight Matrices (PWMs)

A
  • PWM simply shows the frequency at which each base is seen at each position - raw counts or fractions
  • Fractional version - express each value as a fraction of the total for that row
  • Assumes each position is independent
25
Q

Complex models for encoding TFBS

A
  • Pair-correlation models
  • Trees
  • Feature-based models
  • Hidden Markow Models
    etc.
    (classical PWMs generally perform better)
26
Q

FIMO

A

Converts PWM data into a log-odds score
FIMO uses dynamic programming to convert log-odds scores into p-values
Uses a zero-order background model to take into account the relative frequency of each base in the sequences being scanned
The program reports all motif occurrences with a p-value less than 10^-4
This is converted into a q-value - the minimal false discovery rate (FDR) for which a motif is deemed significant

27
Q

Background models to scan TFBS

A
  • Zero-order simply looks at the nucleotide frequency (it counts the occurrences of A, T, C, and G in the sequences dataset)
  • Higher-order models assume the probability of observing a certain nucleotide depends on the previous nucleotides
28
Q

Approaches to scanning TFBS

A
  • Individual sites
  • Clusters
    The choice depends on the context. If you have prior knowledge of a gene it is best to use individual predictors. If no prior knowledge use a cluster predictor.