TFBS Prediction Flashcards by Asha Shinde

Why do we want to predict TFBS?

They are key elements in the regulation of gene expression.
- They help in the general understanding of gene expression
- Can help with gene finding
- Mutations in TFBS can lead to disease
Experimentally validated sites are limited in number
Most experimental methods have poor resolution - need to find the actual site within the experimental site

How well did you know this?

Not at all

Perfectly

Transcription initiation in prokaryotes

RNA polymerase has a strong affinity for the promoter and basal transcription rate is high

How well did you know this?

Not at all

Perfectly

Transcription initiation in eukaryotes

RNA polymerase II and RNA polymerase III pre-initiation complexes don’t assemble efficiently, the basal transcription rate is low, and other transcription factors are needed for effective initiation

How well did you know this?

Not at all

Perfectly

What are transcription factors?

Sequence-specific DNA binding factors that activate the initiation of eukaryotic transcription

How well did you know this?

Not at all

Perfectly

Types of transcription factors

Constitutive: Work for many different genes and don’t respond to external signals
Regulatory: Limited number of genes and respond to external signals

How well did you know this?

Not at all

Perfectly

What do transcription factors recognise?

Upstream promoter elements: they influence initiation at the promoter to which they are attached
Targets within enhancers: influence several genes at once

How well did you know this?

Not at all

Perfectly

Traditional view of how transcription factors work

They activate the formation of the pre-initiation complex by:
1. Making direct contact
2. Making indirect contact
3. Inducing a DNA bend

How well did you know this?

Not at all

Perfectly

New view on how transcription factors work

Some can modify histone proteins affecting nucleosome positioning
Some can bend DNA into a specific shape bringing other TFs into contact with the pre-initiation complex (enhanceosome)
Some do not bind to DNA but form protein-protein contacts with the pre-initiation complex

How well did you know this?

Not at all

Perfectly

Examples of transcription factors

Up to 2600 TFs
e.g. Oct-1, Oct-2, Heat shock factor, Serum response factor, GATA-I

How well did you know this?

Not at all

Perfectly

Size of consensus sequences

Most are 9-15 bp, mean 12.2 bp

How well did you know this?

Not at all

Perfectly

How do we predict TFBS?

Encode the patterns that describe a binding site
Scan these patterns against DNA (needs to be more intelligent than naive scanning)

How well did you know this?

Not at all

Perfectly

Techniques to identify TFBS

Electro-Mobility Shift Assay (EMSA)
DNase I footprinting/protection
Systematic Evolution of Ligands by Exponential enrichment (SELEX)
SELEX-Seq
Chromatin ImmunoPrecipitation (ChIP)
ChIP-chip
ChIP-seq
ChIP-exo

How well did you know this?

Not at all

Perfectly

Electro-Mobility Shift Assay (EMSA)

in vitro
- The mobility on a gel is different for DNA bound to protein
- Control (DNA only), DNA and protein that does not bind, DNA and protein that does bind

How well did you know this?

Not at all

Perfectly

DNase I footprinting/protection

in vitro
- Combines DNase I cleavage with electrophoresis
- The bound protein shields the DNA from cleavage

How well did you know this?

Not at all

Perfectly

SELEX

in vitro
- Large DNA library
- Select for binders by affinity chromatography or by EMSA
- Amplify by PCR potentially with low stringency copying to allow mutations
- Subsequent rounds use higher stringency elution
- Sequence binders

How well did you know this?

Not at all

Perfectly

SELEX-Seq

Study These Flashcards

in vitro
- same as SELEX but uses next-gen sequencing
- SELEX sequences a maximum of 10^2 oligos and requires many rounds
- SELEX-Seq characterises >=10^7 oligos and only requires 1 or 2 rounds

ChIP

Study These Flashcards

in vivo
- DNA and associated proteins are cross-linked
- DNA-protein complexes are sheared into ~500bp DNA fragments
- Crosslinked DNA/protein are immunoprecipitated with a protein-specific antibody
- DNA fragments are purified and sequenced

ChIP-chip

Study These Flashcards

in vivo
- Cross-link and shear
- Select with specific antibody and release DNA
- Label DNA with fluorescent tag
- Hybridise with DNA micro-array from genomic region of interest
- Identify matches from fluorescence
- Computational analysis

Chip-Seq

Study These Flashcards

in vivo
- Same as ChIP but uses next-gen sequencing
- Sequence everything from the ChIP experiment
- A single sequencing run can scan for genome-wide associations with fairly high resolution (100-300 bp)
- Computational statistical analysis required - peak calling
- Peak calling finds regions of the genome that are enriched with aligned reads

ChIP-exo

Study These Flashcards

Refinement of ChIP-seq
- Uses exonucleases to trim the protein-bound DNA
- Claimed single-base resolution of binding sites

Comparison of identifying TFBS methods

Study These Flashcards

EMSA and DNase 1 footprinting can identify non-specific binding sites
All methods find regions larger than the actual TFBS
ChIP-seq has better resolution than ChIP-chip
ChIP-exo has better resolution than ChIP-chip
ChIP-seq is currently the gold standard

Motif-discovery methods to identify TFBS

Study These Flashcards

Enumerative methods
- Identify overrepresented strings
- Examine the frequencies of all DNA strings from these
- Less chance of getting stuck in local optima
Probabilistic methods
- Generate local MSA and learn descriptive parameters
- Expectation maximisation (MEME)
- Gibbs sampling Bayesian Monte Carlo)
- Greedy approaches
- Arbitrary motif model variations

MEME

Study These Flashcards

Struggles with huge datasets from genome-wide techniques
Can use a random subset of sequences but this may be inaccurate
Newer tools designed for huge datasets: ChIPMUNK, HOMER, MEME-ChIP, rGADEM
Recent evaluation suggests rGADEM generates the best motifs from ChIP-seq data

Position Weight Matrices (PWMs)

Study These Flashcards

PWM simply shows the frequency at which each base is seen at each position - raw counts or fractions
Fractional version - express each value as a fraction of the total for that row
Assumes each position is independent

Complex models for encoding TFBS

- Pair-correlation models - Trees - Feature-based models - Hidden Markow Models etc. (classical PWMs generally perform better)

FIMO

Converts PWM data into a log-odds score FIMO uses dynamic programming to convert log-odds scores into p-values Uses a zero-order background model to take into account the relative frequency of each base in the sequences being scanned The program reports all motif occurrences with a p-value less than 10^-4 This is converted into a q-value - the minimal false discovery rate (FDR) for which a motif is deemed significant

Background models to scan TFBS

- Zero-order simply looks at the nucleotide frequency (it counts the occurrences of A, T, C, and G in the sequences dataset) - Higher-order models assume the probability of observing a certain nucleotide depends on the previous nucleotides

Approaches to scanning TFBS

- Individual sites - Clusters The choice depends on the context. If you have prior knowledge of a gene it is best to use individual predictors. If no prior knowledge use a cluster predictor.

TFBS Prediction Flashcards

Week 2 Lecture 2 (28 cards)