Bioinformatics - investigating gene regulation Flashcards

1
Q

What is the old method for sequencing DNA?

A

Sanger sequencing
Also known as the chain terminator or dideoxy method)
Invented by double Noble prize winner, Fred Sanger
Was used to produce the first human genome sequence (2001)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does sanger sequencing work?

A

Uses an E. coli enzyme to make complementary copies of the single-stranded DNA being sequenced

DNA polymerase I, uses a single DNA strand as a template and takes dNTPs and assembles a complementary polynucleotide chain in the 5’ to 3’ direction
A small amount of ddNTP (lacks 3-OH’), when this analog is incorporated chain growth is terminated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we visualise the sequenced DNA in the Sanger method?

A

The chain terminators (ddNTPs) are labelled by coloured fluorescent dyes

So the generated set a DNA fragments differing by one nucleotide, is separated in gel electrophoresis and passed through a detector to visualise the fluorescence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is done at the end of Sanger sequencing?

A

The reads that have been sequenced must be correctly assembled into the original strand of DNA
The reads are compared with overlapping sequences
OR
sonication - fragements generated in a solution of stiffdouble-stranded DNA to high frequency sound waves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the capacity and speed of modern Sanger sequencing?

A

Sequences up to 600-800 bases in one read
Typical machine does 96 samples in 2.5 hours with 15 mins of technician time = highly automated

Yearly output of a machine – 30MB (30x106 bases)
(cf – human genome 3GB (3x109 bases))
But you need to create much more sequence than that to reduce error and assemble the sequence
Typically each base needs to be sequenced 10s-100s of times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the development from the first human genome?

A

The first human genome ‘reference’ sequence took years (~1990 – 2001/2004) to build
Cost approximately £1 billion
‘Next generation’ sequencing methods will sequence a genome in less than a week and at a cost of a few hundred pounds
These have been progressively improved since the early years of this century

Example - Illumina/Solexa method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe Illumina sequencing?

A

Works by running many miniaturised sequencing reactions for different DNA fragments in parallel
Generates short sequence ‘reads’
As the method has developed these have been 36, 51, 75 and now >300 base pairs in length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the output of Illumina?

A

‘Bench top’ sequencing – MiSeq, NextSeq
Run time 1-2 days
MiSeq – 25 million paired end reads (300 bp each)
NextSeq – 400 million

Production sequencing – HiSeq, NovaSeq
Runtime 1-3 days
5 billion – 20 billion paired end reads

Compare – human genome 3.1 G bases
1 billion 150 (paired) bp reads is 300 G bases – 100x the genome!
Most runs are now more than one experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is alignment?

A

Short sequence reads are aligned to the reference human genome
Similarity analysis of sequences begins with alignment
We can see the evolutionary relationship between sequences by identifying the aligned bases
Computationally intensive - there are millions of reads to align

You can identify substitution, insertion and deletion mutations of bases
Insertion and deletions are called indels or gap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the uses of high-throughput DNA sequencing?

A

Genetics - sequencing genomic DNA and finding variants and their associations with phenotype and disease

Sequencing has been revolutionary in epigenetics also

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DNase-seq - how can we find hypersensitive sites?

A

DNA is cut (digested) with DNase
DNA is more accessible to cutting in nucleosome free regions - often where proteins bind - promoters and enhancers
Sequence in from the cut ends
Align the resulting sequences to the genome
‘Peaks’ – regions where many reads align correspond to DNase hypersensitive sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the initial data anlysis for DNase-seq?

A

Need computational methods for identifying peaks
Statistics to distinguish real peaks from noise

Ultimately reveals hypersensitive sites across the whole genome
These are cell type specific

Usually carried out on uniform populations of cells of the same type
Needs millions of cells

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the further data analysis by DNase-seq?

A

Detect transcription factor binding by locating binding sequences in hypersensitive sites - as many transcription factors bind to specific sequence patters in DNA
Comprehensive collections of binding site patterns are in databases like TRANSFAC and JASPER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is ATAC-seq?

A

Assay for Transposase-Accessible Chromatin using sequencing
Similar analysis to DNase-seq
It assesses genome-wide chromatin accessibility
Needs fewer cells

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is ChIP?

A

Chromatin immunoprecipitation (ChIP)
We can detect where transcription factors are binding, where modified histones are or where any DNA binding protein is
First crosslink DNA and proteins - using formaldehyde (short link cross-linker) and isolate chromatin
Sonicate or digest (with micrococcal nuclease) chromatin into fragments of around 500 bp
Immunoprecipitate - use antibody against protein of interest and use protein A or G to pull it down and then wash
Digest with proteinase K and reverse crosslinks before purifying DNA
PCR amplify target sequences (or detect for hybridisation)
Use primers against a specific gene in the PCR reaction to determine if that region is bound by the protein of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can ChIP-seq tell us?

A

For example, finding binding sites for a particular transcription factor (TF)
Needs an antibody to the TF
TF bound DNA extracted, sequenced and mapped to genome
Peaks (genome regions where many sequences map) reveal where the TF is bound

17
Q

What has further analysis of ChIP-seq revealed?

A

Has revealed that TFs bind extensively across the genome
Sometimes there are 1000s of binding sites
Likely not all functional

Binding site co-occurrence analysis
Transcription factors often function in groups
Binding to same or nearby sites as a multifactor complex

Multiple ChIP experiments can reveal complexes

18
Q

How is ChIP involved with the histone code?

A
ChIP-seq (with an appropriate antibody) can map the occurrence and co-occurrence of these chromatin marks over the entire genome
H3K4Me1 - enhancers
H3K4Me3 – active promoters
H3K27Me3 - (polycomb) repressed genes
H3K27ac – active enhancers
19
Q

What is RNA-seq?

A

Genetic regulation is about controlling the level of gene expression
The level of gene expression can also be measured by sequencing - called RNA-seq

20
Q

Describe the basis of RNA-seq?

A

mRNA -> cDNA
Sequencing of the cDNA randomly samples the transcript pool
The number of times you expect to sequence a transcript from a given gene is proportional to the expression level of that gene
Tells you the relative expression level of each gene in a sample

21
Q

What is alternative splicing?

A

RNA-seq can even be ‘splice’ aware
Differential sequence coverage can allow the analysis to attribute coverage to different splice variants (expressing different exons) and different transcription start sites
Alternative splicing occurs when the mRNA products of a gene make use of different exons

22
Q

Give an overview of the effects of combining these techniques?

A

When combined these technologies allow us to measure both genetic regulation and the resulting patterns of gene expression
In different cell types as cells differentiate

And all this on the scale of the entire genome
Every gene (in man there are >20000) and every transcript variant
Every promoter
Every enhancer

23
Q

Describe the example project of blood cell differentiation?

A

Differentiation of blood cells in culture - from embryonic stem cells
Cell types distinguished by the genes they express
Example of dynamic genetic regulation

Genome scale data set acquired
Gene expression, open chromatin (number of DNase hypersensitive sites), histone modifications, transcription factors and number of transcription factor binding sites

24
Q

Describe the methods of visualising of a single locus?

A

The TAL1 locus – a key blood TF
Dynamic gene expression (RNA-seq) increasing in intermediate cell types
Dynamic patterns of changing DHS (DNase hypersensitive sites)
Dynamic patterns of changing histone marks

25
Q

What did the patterns show within the blood differentiation project?

A

That gene expression defines the cell types

The cells progressively:
Lose stem cell character
Gain and then lose endothelial and vascular character
Gain blood cell like characteristics
Become macrophages (immune system cells)
26
Q

When analysing the DNase hypersensitive sites - what regulatory element define the differences between cells?

A

The enhancers and other regulatory elements that control cell type specific gene expression
Analyse the DHS to find the specific ones to particular cell types

27
Q

What was discovered to be involved in blood cell development?

A

TEAD transcription factor
We discovered that TEAD TF sequence binding patterns are found in DHS sites that are specific to early blood cell precursors
First indication that TEAD is involved in blood cell development

CACATTCC - common in blood cell development

28
Q

Give an overview of viewing genome scale data in genome browsers?

A

A major feat of computational engineering
Allow you to view a whole range of data:
Genes and transcripts
Epigenetic marks (often from ENCODE cell lines)
Data on genetic variants in human populations
Data on evolutionary conservation (of genome regions in related mammals)
You can upload your own data ‘tracks’ and use the browser to visualize your own epigenetic data

UCSC – University of California at Santa Cruz - European equivalent – Ensembl - ensembl.org
Each horizontal data plot is called a ‘track’

29
Q

What is data anlysis at the genome scale?

A

‘Next generation’ DNA sequencing produces short (ish) sequences that are first mapped to the genome
Many next generation sequencing problems concern counting the number of sequences mapping to genome regions

ATAC or DNase seq – high numbers of sequences mapping identify open chromatin
ChIP-seq – high numbers of sequences mapping identify protein binding sites
RNA-seq – the number of sequences mapping to each gene increases with increasing gene expression level

30
Q

What is significant about genome regions where there are ‘a lot’ of sequences in the map?

A

Over the entire genome even if your sequences were just randomly distributed (i.e. All chromatin equally open) then chance would mean that some regions have more sequences mapping than others
If something happens more than you expect ‘by chance’ then this can be evidence of some real and possibly interesting effect
For example - open chromatin, protein binding

Its important to understand what ‘by chance’ means
With a coin ‘by chance’ means - we assume the probability of heads is 0.5 = 100 tosses to be close to 50 heads
With some small random variation, perhaps not exactly 50, but not 95
This is a statistical model, and is often called the NULL (no interesting effect) model

31
Q

How can we quantify counting these statistics?

A
We need to calculate probabilities and therefore require probability distributions
For example:
T and normal distributions
The binomial distribution
The Poisson distribution
32
Q

What is the binomial distribution?

A

The probability of a success or failure outcome in an experiment or survey that is repeated multiple times
The binomial is a type of distribution that has two possible outcomes
The binomial distribution, therefore, represents the probability for x successes in n trials, given a success probability p for each trial

P - the probability of success for each trial
R - the probability of the successes in N trials
N - number of trials

Criteria
The number of observations or trials is fixed
Each observation/trial is independent = no effect on the probability of the next trial

33
Q

What is the poisson distribution?

A

This is a probability distribution that can be used to show how many times an event is likely to occur within a specified period of time - given the average number of times the event occurs in this time

P - The Poisson probability that exactly r successes occur in a Poisson experiment, when the mean number of successes is E
E - The mean number of successes that occur in a specified region
R - The actual number of successes that occur in a specified region

34
Q

Give a comparison of binomal and poisson?

A

These are closely related, but have different parameters
Binomial – number of trials = n, probability of success in a trial = p
Poisson – just the expected number of events = E
If you set E=np, then the Poisson approximates the binomial
The approximation gets better as n gets bigger and p gets smaller

We often use the Poisson because
It’s easier to calculate
Parameterization in terms of E is often convenient

35
Q

What is a window in data analysis of genomic data?

A

Sliding windows are genomic intervals that literally “slide” across the genome, almost always by some constant distance
These windows are mapped to files containing signal or annotations of interest, such as: SNPs, motif binding site calls, DNaseI tags, conservation scores, etc.

Sliding windows can overlap or be disjoint
Overlapping windows are often used to “smooth” signal, to remove or reduce the impact of signal noise
Example - TFs would bind in the same ‘windows’

36
Q

What is the binding ‘window’ overlapping problem?

A

Do two ChIP-seq experiments identify binding window sets that overlap more than you would expect by chance
We need to determine if they are just overlapping by chance

Example - TF
These calculations allow us to determine whether in genome scale data there is evidence that transcription factors bind ‘together’ to function rather than independently

37
Q

What are some technicalities with finding windows in practice?

A

In practice it is more complex for a range of reasons including
Not all genome regions are equally mappable by short reads (particularly repetitive regions)
The ‘background’ rate E may vary between chromosomes and chromosome regions
Some technologies, e.g. ChIP-seq, require modifications and different considerations for reads on positive and negative DNA strands
‘Paired-end’ sequencing introduces further complications
It may be necessary to account for sequencing artefacts caused by PCR, such as high rates of occurrence of duplicate sequences
Some experiments use ‘control’ samples, e.g. ChIP with non-specific antibody

38
Q

What are some approximations and other solutions instead of using poisson for the window analysis?

A

To solve this with binomial instead of Poisson
= gives p=0.11 (cf. 0.12 from the Poisson)

We have made an approximation in both binomial and Poisson
Our analysis assumes that two blue windows could overlap the same red window
In reality they can’t, but the approximation is good so long as the number of red and blue windows is much smaller than the number windows on the genome = true for TF binding data

To avoid that approximation we can solve this exactly with the hypergeometric distribution - this gives p=0.09