Sequencing and Bioinformatics Flashcards

1
Q

dNTP vs ddNTP?
What do these abbreviations stand for?
Similarities and Differences?
Nickname for ddNTP?

A

Both can incorporate into a newly synthesised DNA strand
Deoxy Nucleotide (dNTP) - Oxygen present on 3’ C allows for chain extension
Dideoxy Nucleotide (ddNTP) - 3’ oxygen removed preventing chain extension; Terminator nucleotide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain Sanger Dideoxy sequencing
It is an example of ‘sequencing by _______’?

A

Sequencing by Synthesis

DNA is denatured, and new strand is synthesised with a mixture of dNTPs and ddNTPs
New DNA strand extends a known primer
As each nucleotide is added to the chain, there’s a chance that a terminator nucleotide will be added
If this occurs then no more bases can be added
The products are then run on a gel to figure out the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name of sequence once a ddNTP is added

A

Truncated sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Alternate ways of carrying out Sanger sequencing?

A

Running 4 reactions, each with a different ddNTP i.e. ddATP, ddGTP etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain Dye-Terminator Sequencing
What are the benefits?

A

Using fluorescently labelled ddNTPs
- This allows all the products to be run in the same lane in capillary gel electrophoresis
Each base is identified depending on the colour/wavelength of the fluorescent tag

This process can be automated and scaled up to industrialise the process of sequencing to sequence large mounts of DNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many bp can the ABI 370 sequencer read at a time?

A

800bp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Problems with sequencing human genome?

A

3 billion base pairs; Takes a long time if reading 800bp at a time
Piecing together the genome is also a challenge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Process the Human Genome Project Strategy (IHGSC) used? (hint: BACs)

A
  1. Extract human genomic DNA from multiple people
  2. Fragment DNA so they are small enough for the sequencers to read
  3. Size selection of 100-200kb fragments
  4. Clone fragments into Bacterial Artificial Chromosomes (BACs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are BACs and what do they do?

A

They are plasmids which contain sequence elements which trick E. coli into replacing/copying them during the cell cycle (as if they were a native plasmid)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain how BAC cloning is performed

A

BAC cloning amplifies DNA

BACs are transformed into E. coli cells and these grow colonies
These colonies grown contain millions of copies of the same fragment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are genetic and physical maps used for in the sequencing of the genome?

A

Established a dense set of genetic markers across the human genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do genetic maps rely on?
What is the relationship between distance between markers and recombination frequency?

A

They rely on recombination frequency between markers
The further apart markers are from one another, the higher the recombination frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do they link BAC clones to the genetic and physical maps?

A

Clones are tested for PCR markers that have known locations on genetic and physical maps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are overlapping clones identified?
What is the likely origin for such clones?

A

The end of the insert, in the BAC containing the marker, is sequenced
Using the sequence used to design PCR primers, BAC clones containing that end sequence are looked for
Such BAC clones are likely to come from a neighbouring genetic locus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are BAC clones organised?
Which types of BAC clones are identified?

A

They are ordered relative to the genetic map, with PCR primers
- These PCR primers are complementary to the sequence of a particular marker, allowing us to test and identify BAC clones that overlap the genetic marker
- Once we know that a BAC clone overlaps a marker, we know roughly where to place it on the genetic map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is sequence data from the end of the insert obtained?
What is done with this data?

A

Sequence from the vector backbone into the BAC insert
This sequence data is used to design a new set of primers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How are these primers used with the BAC library?
Where are the clones identified likely to be?

A

They are ran with the BAC library to identify clones which contain the end sequence but not the original genetic marker
These are likely to be derived from a genomic region adjacent to the source of the original BAC clone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How is this BAC insert sequencing process repeated?
What can be done in the other direction?

A

The end of the second BAC clone insert is sequenced, allowing for another set of primers to be designed to identify a third BAC clone with an insert that overlaps the second
The same thing can be done at the other end of the original clone which overlaps the marker, to obtain overlapping BAC clones in the other direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is this whole approach of identifying and ordering overlapping BAC inserts called?

A

Chromosome walking
Allows a library of clones to be built up in the correct order relative to the genetic map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does shotgun sequencing work?

A

Many fragments are sequenced at random and then assembled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Size of BAC inserts used in shotgun sequencing and how are the BAC clones generated?

A

BAC clone DNA is fragmented into smaller 5-10kb fragments and cloned into plasmid vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How are primers used in shotgun sequencing to derive plasmid insert sequence?

A

The sequence of the plasmid that the insert is in is known, so primers could be designed to sequence the insert
Done many times to derive a consensus of the insert sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What did Celera want to do with the human genome data?

A

Patent and commercialise it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What was the sequencing approach Celera used?
How is this different to IHGSC?

A

They used a shotgun Whole Genome Sequencing (WGS) approach
Instead of creating BAC clones, Celera fragmented into much smaller fragments of 2-50Kbp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What did Celera demonstrate with shotgun sequencing?
That shotgun sequencing was feasible for even large and repetitive genomes
26
Problems with Sanger sequencing on an industrial scale? How did IHGSC get around this?
Doesn't scale well as when reading, each capillary can only produce 1 sequence at a time The IHGSC had 'factories' with hundreds of sequencers
27
What happened to the cost of sequencing in 2007 and why?
Price to sequence human genome dropped drastically from 2007 onwards due to Next-Generation Sequencing
28
What is massively parallel sequencing and how does it improve on Sanger sequencing? What became the dominant platform for technologies like this?
Lots of molecules are sequenced at the same time as opposed to one; Reduced costs Illumina is the main platform for this technology
29
How many sequence reads does 1 HiSeq run produce? Length of sequencing reads?
≈8 billion sequence reads ≈100bp sequence reads
30
Illumina uses sequencing by _______? What is used to detect each base?
Sequencing by synthesis Fluorescent bases are used to detect each base
31
Why and how are the molecules amplified in Illumina sequencing?
Optical sensors are not sensitive enough to detect the signal from a single template molecule PCR amplification of template molecules is done via a process called bridge amplification
32
First step of Illumina sequencing?
Fragment DNA and size select fragments of ≈500bp
33
Refer to cluster generation process images and explain the process
34
Outcomes of cluster generation? Overlapping?
Generate millions of separate clusters, each with sequence data from a different region of the genome Clusters are large enough to be detected when they fluoresce Some clusters overlap, potentially resulting in the loss of a few reads; Doesn't matter as there are so many clusters
35
What is used in Illumina's cluster sequencing by synthesis process? (type of term. nucleotide)
Reversible terminator nucleotides; Block chain extension, but the block and dye can be removed Once the block is removed it acts like a dNTP
36
Refer to cluster sequencing process and explain the process Also explain how it is repeated on different clusters to generate the second read?
37
Size of fragments selected in Illumina library preparation?
500bp
38
How far apart are read pairs when doing 2 rounds of sequencing clusters?
500bp
39
Advantages of 3rd Gen sequencing over Illumina? (4 advantages)
Single molecule sequencing – No amplification required Real time sequencing – Data is generated during the run Ultra-long read lengths - Up to 50kb (PacBio) or >2Mb (nanopore) Can directly identify base modifications such as methylation
40
Disadvantages of 3rd Gen sequencing over Illumina?
Fewer reads per run than Illumina More expensive per base Individual reads have a high error rate (although consensus accuracy is good) (high error rate is inevitable as you are sequencing a single molecule at a time)
41
What are ZMWs in PacBio SMRT cells? How many?
Zero mode waveguides are wells that cover the aluminium surface 150,000 wells
42
What is the ds-template DNA bound to and where?
DNA is bound to a DNA polymerase and a sequencing primer Polymerase is immobilised at the bottom of a well; 1 polymerase per well
43
What does ZMW do to light used to excite the fluorescently labelled nucleotides? What does this allow for?
Only allows the light to penetrate a small distance into the well, exciting a very small volume This allows the signal from a single fluorescently labelled nucleotide to be detected
44
How is the DNA sequenced through this process? (ZMWs)
When a fluorescent nucleotide is bound by the polymerase, it remains within the illuminated zone and gives a detectable signal (unbound nucleotides will diffuse in and out quickly, not giving a consistent signal The label is then cleaved away and another nucleotide is then incorporated and fluoresced, showing the sequence
45
What are SMRT-bell adapters?
Hairpin loops on either end of a ds-template DNA, which connects them to make a single continuous loop of DNA
46
Refer to the images of SMRT-bell adapters and explain how primers are involved in the process
47
Why is it better to use SMRT-bell adapters with a small insert rather than a larger one? What can be done with smaller insert reads? (hint CCS and CLR)
Larger fragment reads have a poorer quality with low accuracy Smaller insert reads are sequenced several times as the primers goes round These sequences can be combined to reduce the error rate and giving circular consensus sequences (CSS) CCS reads can be used to correct the long but lower quality reads, giving corrected long reads (CLR)
48
How long are CLRs typically? Accuracy?
15kb >99.999% accuracy
49
Advantages of Oxford Nanopore?
Read lengths of up to 100kb Error rates of ≈1% Direct sequencing of DNA, RNA and protein No library prep; Sequence directly from biological samples Small and portable models
50
How does Oxford NANO-PORE work?
Pore proteins embedded within an artificial membrane which is electrically insulating Motor protein pushes a single strand of DNA through the pore, resulting in a change in the electrical current flowing through the pore
51
How many bases are sequenced at once in Oxford Nanopore?
Several bases can pass through the pore at once Short sequences (e.g. 5 bases) have their own characteristic signal
52
What is done in Whole Genome Shotgun Sequencing? How is assembly done? What is this known as? (de novo ____)
Genomic DNA is fragmented and paired reads are obtained from either end of each fragment The original chromosome sequences are computationally reconstructed; This is called de novo assembly
53
What is read mapping? Repetitive regions?
Computationally determining the most likely position that each sequence read derives from Repetitive regions of the genome are a problem for mapping Once the reads are mapped to the reference genome, it is possible to identify different positions, a process called "variant detection"
54
Why would you want to resequence a genome? (4 reasons)
Individuals of a species are not all identical; Resequencing allows us to understand genetic variation within a population For human populations, this is of particular interest for studying single-gene and complex genetic disorders Cancers are effectively evolving organisms which are genetically different from the patient; Sequencing allows us to understand the genetic changes which occur as the cancer progresses Functional genomic (identifying function of genomes) technologies such as RNA-seq and ChIP-seq involve resequencing
55
What is the traditional method of assessing gene expression? Explain this method? Size of band meaning?
Northern blotting Radiolabelled probes are used to detect the presence of a particular transcript within a whole cell RNA extract The level of expression can be assessed (semi)quantitatively by the size of the band
56
What is RT-(q)PCR? Explain this method
Reverse transcription quantitative PCR It uses reverse transcriptase to make cDNA from transcripts The cDNA corresponding to a transcript can be PCR amplified; Fluorescent primers allow the transcript level to be quantified relative to a reference gene
57
What is a microarray? Differences to northern blot?
A glass slide onto which a spot consisting of lots of copies of a probe sequence can be attached Can be done in parallel; Thousands of probes at a time
58
How is a microarray assessed?
Microarray scanner detect the average intensity of each spot on the microarray and use it as a measure of the transcript level associated with each gene
59
What are technical and biological replicates?
Technical replicates involve assessing the same biological sample on multiple microarrays Biological replication requires us to repeat the entire experiment independently
60
Typical relationship between biological and technical variation? What does this mean for biological replicates?
Typically biological variation is larger than technical variation, so it is usually appropriate to perform multiple biological replicates
61
Why is not possible to have as many replicates for microarrays and RNA-seq? How do we get around this?
They are expensive By looking at many genes in parallel to estimate the “normal” level of variation between biological replicates; Allows us to identify genes where the difference in expression is greater than would be expected by chance
62
What do microarrays look for? How is this expressed (calculation)?
Differential expression between the experimental sample and control Expressed as 'fold change'; Level of expression in the experimental/Level of expression in the control
63
What does logFC mean?
Log2 of fold change calculation
64
Explain logFC = - 0 - +1 - -1
0 = Expression remains at the same level in the experimental +1 = Expression doubles in the experimental -1 = Expression halves in the experimental
65
Limitations of microarray (4 limitations)
Microarrays are a low-resolution sequencing technology - If we get a signal for a particular probe, we know that the sequence is present in our sample; However, we usually don’t know if that is the exact sequence that is present We also don’t know if there are any sequences present which are not covered by our microarray probes There is a limit to how much RNA can hybridise to a particular spot on the microarray; Can limit our ability to distinguish the expression levels of highly expressed genes
66
Process of RNA sequencing (RNA-seq)
Fragment input RNA Reverse transcribe it to cDNA Attach adaptor molecules to it Sequence them to produce many sequence reads
67
How is RNA mapping of RNA-seq reads used to measure expression levels?
The reads are mapped to a reference genome The number of reads mapping to each gene is used as a measure of the expression levels of each gene
68
Why is RNA splicing a challenge for RNA-seq data analysis? Why?
mRNA sequence does not exactly correspond to the sequence of the reference genome Processed mRNAs consist of adjacent exon sequences, but the exons are separated by introns in the genome sequence
69
What happens to reads that overlap exon junctions? (hint: SAA)
They are split during mapping using a splice aware aligner (see image on notes)
70
What happens in the absence of a reference genome in RNA-seq assembly? Key features of this process?
De novo assembly of transcripts - Not all transcripts are present at the same level - Same gene may produce multiple different transcripts - Assembled transcripts can be annotated in a similar way to genome sequences
71
RNA-seq vs Microarrays Which has a larger dynamic range (ability to distinguish different levels of expression)? Which gathers information for pre-selected regions, and which is genome-wide? Which allows us to detect differences from the reference genome? Which can be done without a reference genome?
RNA-seq has a larger dynamic range than microarrays (greater ability to distinguish different levels of expression) Microarrays only give information for pre-selected regions of the genome; RNA-seq is genome-wide, and can detect novel transcripts RNA-seq allows us to detect differences from the reference genome, such as SNPs in transcribed regions RNA-seq can be done without a reference genome – de novo assembly of the transcriptome is possible
72
What is alternative splicing? How does this affect RNA-seq analysis?
Alternative splicing allows for a single gene to produce many different transcripts and proteins Adds to the complexity of RNA-seq analysis
73
What is DNA methylation? What is this for and an example of?
Addition of a methyl group to C5 of cytosine Acts to downregulate/regulate gene expression and is an example of epigenetics
74
What else is methylation involved in?
X chromosome inactivation Silencing of germline-specific genes and repeat regions Imprinting (distinguish maternal and paternal alleles)
75
How do bacteria use methylation? (2 ways)
To distinguish 'self' DNA from 'non-self' Non-self DNA can be digested by enzymes that acts as the immune system They also use methylation to control bacterial DNA replication; Limit of a single replication per cell cycle
76
What are the 3 'contexts' of cytosine methylation?
CpG - C linked to G by phosphate backbone CHG - C followed by 'not G' followed by a G CHH - C followed by 2 non-G bases
77
Which methylation persists and which must be re-established after cell division
CpG methylation persists whereas CHG and CHH methylation do not
78
What does 5'-methyl cytosine deaminate to?
Thymine
79
What are clusters of CpG in promoter regions called? What does it mean when they are unmethylated?
CpG islands The gene is expressed
80
What is Bisulphite conversion?
Chemically inducing deamination of cytosine Methylated cytosine does not undergo this change
81
What is BS-seq? How does it exploit the fact methylated cytosine remains unchanged?
Bisulphite Sequencing A sample is sequenced before and after bisulphite conversion These can be compared as methylated cytosine will remain as cytosine and unmethylated cytosine will be changed to uracil and read as thymine
82
What is Reduced Representation Bisulphite Sequencing (RRBS)? How do they do this?
RRBS is a method of targeting BS-seq to regions which are likely to have a high CpG content (e.g. CpG islands) This allows us to make the most of a sequencing run RRBS exploits restriction enzymes which have a recognition site containing CpG By digesting with this enzyme and selecting small fragments, we target regions of high CpG density
83
What is PacBio Single Molecule Real-Time (SMRT) sequencing? How does it work?
Allows methylated bases to be distinguished from unmethylated ones (adenine as well as cytosine) The presence of a methylated base delays the progress of the polymerase; This can be detected by analysis of the polymerase kinetics
84
How does Oxford Nanopore detect cytosine methylation?
It detects a disruption in electrical current caused by a base passing through a pore in a membrane Methylated bases give a distinct signal from unmethylated ones; Allows for direct methylation measuring
85
What is Chromatin Immunoprecipitation (ChIP)?
It is a method that can be used to isolate DNA bound by specific protein
86
Explain ChIP process
1. Proteins covalently crosslinked to DNA by treating with formaldehyde 2. Chromatin sheared by sonication or using an endonuclease (ChIP-exo) allows the bound DNA to be trimmed to the binding site 3. Immunoprecipitation and purification of bound DNA using an antibody specific to the protein of interest
87
What is ChIP-on-ChIP? How does it work?
ChIP-on-chip involves identification of the ChIP-purified DNA using a microarray The purified binding sites are labelled and hybridised to a tiling microarray to determine the genomic regions where the protein is bound
88
What is ChIP-seq? How does it work?
Sequencing of the ChIP-purified binding sites directly using high throughput sequencing platforms (e.g. Illumina) Reads are mapped to the reference genome, and binding sites are identified as peaks in the signal There is an offset between reads on the forward and reverse strand, which allows the exact boundaries of the binding site to be determined; Due to trimming DNA to binding site
89
What is Chromosome Conformation Capture? (hint: long range interactions)
Uses formaldehyde to identify and form cross-links between long-range interacting regions of the genome The cross-linked chromatin is digested, the loose ends ligated, and the cross-link is removed to form a single continuous piece of DNA containing sequence from the 2 interacting regions
90
What are the 4 methods? How do they differ? (see image on notes) (hint: the different C's)
3C - Look for specific interaction between 2 known partners 4C - Identify remote regions which interact with region of interest 5C - Discovery of novel interactions Hi-C - Allows comprehensive genome-wide characterisation of all of the interactions between remote chromosomal regions
91
Explain 3C
3C uses 2 specific primers, so is good for targeting interactions between 2 known loci
92
Explain 4C
4C introduces a circularisation step, meaning that only one of the interaction partners needs to be pre-selected
93
Explain 5C
5C uses amplification using primers with a universal “tail” sequence PCR using primers which recognise this overhanging tail sequence can be used to **amplify interactions between many interacting regions**
94
Explain Hi-C
Biotin is incorporated into the cross-link between interacting loci The protein streptavidin has a high affinity for biotin, and is used to purify out the biotin-labelled DNA containing interacting loci This is followed by high-throughput sequencing to get a genome-wide view of all long range chromosomal interactions
95
What are the different functional elements of the human genome? (6 elements) (summary essentially)
Long range chromosomal interactions Regions of open chromatin DNA methylation sites Transcription factor binding sites Enhancers and promoters Coding and non-coding transcribed regions
96
What is the ENCODE Project? Aims? Methods used? (3 methods) With what method were genes identified?
ENCyclopedia Of DNA Elements Aims to identify all the functional elements in the human genome RNA-seq, 5C and ChIP-seq were used Genes were identified with RT-PCR or computational prediction
97
How are CLIP-seq and RIP-seq different to ChIP-seq?
They identify RNA-binding proteins, whereas ChIP-seq identifies DNA-binding proteins
98
What do DNase-seq and FAIRE-seq identify? How are they different?
Regions of open chromatin - DNase-seq exploits open chromatins hypersensitivity to DNase I digestion - FAIRE-seq uses formaldehyde crosslinking of DNA to nucleosomes and purifies unbound DNA
99
What is ChIA-PET? Similarities and differences with 5C?
Both identify long range chromosomal interactions It does this through ChIP-seq analysis of DNA-nucleosome interactions, instead of direct ligations of interacting DNA regions
100
What is methyl450k?
methyl450k is a microarray-based method of identifying DNA methylation
101
What did ENCODEs findings conflict with and how?
They were able to assign biochemical functions for 80% of the human genome This conflicted with the previous view that much of the genome was “junk DNA” with no function
102
Why are exons more conserved than introns between species?
Mutations are more likely to cause problems if they are within exons Functionally important regions of the genome tend to be evolutionarily conserved
103
General principle of evolutionary genomics? Explain
The rate of evolution of the genome is not uniform, and functionally important regions tend to evolve more slowly Changes in important regions are more likely to be "deleterious" - Have a negative impact on fitness, which means they tend to be removed from the population through natural selection
104
What is TRaDIS? What is it for?
TRansposon Directed Insertion-site Sequencing It is used to understand bacterial gene function
105
What are transposons? How do they work?
Mobile genetic elements Transposons can move around the genome through a “cut and paste” mechanism
106
What is the structure of transposons? How can they be manipulated for mutant screens?
They consist of a transposase gene, flanked by inverted repeat sequences that are recognised by the transposase If the transposase gene is removed, the transposon can still move and be inserted into a bacterial genome if transposase is supplied Inclusion of an antibiotic resistance gene allows mutants to be selected
107
How are mutant screens involving transposons studied? (Observations and conclusions etc.)
If a gene is disrupted by the transposon, it will be inactivated If the disrupted gene is essential, the mutant will not survive Genes without insertions are likely to be essential In transposon mutagenesis we do not see mutants in essential regions of the genome
108
'TraDIS can in some cases give information at a sub-genic level': What does this mean?
It can identify not just important genes, but important regions of genes
109
How is TRaDIS used to identify which genes are essential?
Have an input pool of random transposon mutants Run them through some form of stress Compare input and output pool to see which organisms with which gene survived This will tell us which genes are essential and non-essential