GEN 3: Defining the Genome II - DNA Flashcards
List some methods used to analyse DNA and when they were first developed
- Sanger Sequencing: 1970s
- Restriction enzymes: 1970s
- DNA cloning with Restriction Enzymes: 1972
- Southern Blotting: 1975
- Polymerase Chain Reaction: 1985
- Next generation sequencing (NGS): 2006
Describe how Sanger sequencing works
- Sanger sequencing requires:
- the target DNA
- an oligonucleotide primer of ~20nt complementary to part of that DNA
- a DNA polymerase
- extends the primer, using the target DNA as template until ddNTP is added and terminates extension
- a mixture of deoxyribonucleotide triphosphates (dNTPs) and dideoxynucleotide triphosphates (ddNTPs)
- the ddNTPs lack the 3’-OH group required for nucleotide chain extension
- fragments are separated by polyacrylamide gel electrophoresis or capillary electrophoresis, which distinguishes fragments differing in length by only one nucleotide
- labelling ddATP, ddCTP, ddGTP, ddTTP with different fluorophere produces coloured peaks that provide a direct read of nucleotide sequences up to ~1000 nucleotides long

Describe how restriction enzymes work to analyse DNA
- restriction enzymes are endonuclease enzymes that cut double-stranded DNA at specific sequences
- they cleave phosphodiester bonds to leave free terminal 3’-OH and 5’-phosphate groups
- they are able to cleve internal bonds and circular DNA, unlike exonucleases which only cleave bonds at DNA ends
- these enzymes usually recognise short target sequences of 4 to 8 base pairs
- they can cut DNA into smaller fragments by targeting their specific restriction sites
- there are two types of restriction digestion
- restriction enzymes come from bacteria and is named after the species of bacteria from which it derives

Describe how DNA cloning with restriction enzymes works
- DNA is inserted into a plasmid
- to do this, both the DNA and plasmid are digested with the same restriction enzymes
- this produces complementary sticky ends
- DNA ligase then ligates the two together
- the resulting recombinant plasmid is then introduced bacteria to generate a single colony (or clone) of bacterial cells, each carrying the same recombinant plasmid
- this is called recombinant DNA cloning or molecular cloning

What was DNA cloned with restriction enzymes used to create?
- it was used to create a genomic DNA library
- genomic DNA was fragmented into millions of small pieces
- these were ligated into a plasmid vector and introduced into bacteria, so that each individual clone carried a different genomic DNA fragment
- Sanger sequencing of each member was then used to assemble the sequence of the whole genome

Describe how Southern blotting analysed DNA
- mixtures of DNA fragments are separated by electrophoresis through an agarose gel and blotted onto a nylon membrane
- a specific sequence in the mixture can then be detected using a DNA probe that is radioactively or fluorescently labelled

Using sickle cell disease, describe how Southern blotting is used to help determine the status of the beta-globin gene

Describe how PCR works as a way to analyse DNA

- a pair of oligonucleotide primers is designed that flank the region to be amplified and are complementart tro opposite strands
- reactions contain template DNA, the chosen primer pair, dNTPs and a thermostable DNA polymerase (Taq polymerase)
- using a programmable temperature block, the PCR reaction is taken through multiple cycles of temperature incubations
- see image for why PCR is used

What’s the amplification factor in PCR?
- in principle, the target sequence is duplicated during each PCR cycle
- if the PCR runs for n cycles, this results in an amplification factor of 2n
- this doesn’t happen in practice, but it is still powerful
What is Next Generation Sequencing (NGS)?
- NGS methods can sequence million of short DNA fragments simultaneously
- known as massively parallel sequencing
- without the need for individual fragment isolation
Describe how Illumina, an NGS method, works
- the DNA is first broken into short (<250nt) fragments that are tagged and hybridised onto oligonucleotides attached to a solid support called a flow cell
- the bound fragments are PCR amplified in situ (bridge amplification)
- generating millions of distinct clusters, each derived from a single fragment
- clusters are then sequenced in parallel (at the same time) by primer-extension, one nucleotide at a time, using dNTPs that are reversibly modified with 3’-end blocks and fluorescence tags
- after the addition of the first nucleotide, the flow cell is laser-scanned to measure the position and colour of each cluster
- the information is then stored digitally
- the flow cell is then treated to remove the fluorescent tags and 3’-end blocking groups on the newly extended primers
- this process is then repeated enough times to generate sequences reads for each cluster
- bioinformatics software is then used to compare sequence reads, to identify any overlaps and so assemble the sequence of the starting DNA

What are some applications of NGS?
- whole genome sequencing (WGS):
- allows sequence variation between individuals to be compared
- transcriptome sequencing (RNA-seq):
- sequencing of DNA reverse transcribed from RNA transcripts
- the most highly expressed genes give the greatest number of sequence ‘reads’
- targeted sequencing:
- a small region of the genome is sequenced in samples where there may be many variants
- e.g. exome sequencing may reveal protein-coding variations
- ChIPseq:
- antibody to a protein of interest is used to purify chromatin containing that protein, prior to WGS
- reveals protein-genome interactions
Referring to the illustration for Sanger sequencing, why does the incorporation of the first ddNTP (ddGTP) not prevent any further primer extension?
- the reaction contains many primed templates as well as a mixture of dNTPs and ddNTPs (the former being excess)
- at each position in the sequence, therefore, only a small proportion of the primers have their 3’ ends blocked
- the majority will have unblocked 3’ ends and so will be extended
It is important to optimise the temperature at step 2 of PCR (annealing the primer)
Can you predict the consequences of step 2 temperatures that are
a) too high
b) too low ?
a) too high:
- primer hybridisation would be impaired so no PCR products would be obtained
b) too low:
- primers would bind with less specificity and may give rise to spurious PCR products
At each sequencing step during Illumina NGS, what must happen in between laster scanning of the flowcell and addition of the next nucleotide?
- the 3’-blocks and fluorescent tags must be removed
How many nucleotides is the human haploid genome composed of?
- 3 billion nucleotides

What percentage of the human haploid genome are genes and gene-related DNA?
How many protein-coding genes are there?
What are some gene-related DNA examples?
- 37.5%
- there are about 21,000 protein-coding genes
- most of their DNA is non-coding
- introns, UTRs
- non-protein-coding-genes make RNAs with known and unknown functions
- rRNA, tRNA, miRNA, some lncRNAs
- other non-coding DNA is known to be gene-related but lacking function
- pseudogenes
- gene fragments

How much of the human genome is made up of highly repeated DNA?
What is it made up of?
- 54%
- 1740Mb
- it is made up of dispersed transposable elements and tandemly repeated DNA

Observe this diagram for the breakdown of the human genome constituents

How is most of the DNA in protein-coding gene not coding?
- there are non-coding regions such as:
- 5’ and 3’ untranslated regions (UTRs)
- enhancer sequences
- promoter sequences
- long introns
What is alternative splicing?
- alternative splicing allows a single gene to generate multiple mRNA isoforms
- and therefore, multiple protein isoforms

Describe the gene density across humans, prokaryotes, simple eukaryotes
- human genomes have very low gene density compared to genomes of prokaryotes and simpler eukaryotes
- genes are tightly packed in bacteria and yeast (with only the occasional intron in yeast) while genes in higher eukaryotes are less tightly packed and routinely interrupted with introns

Describe the average lengths of a human protein-coding gene, exons and intron numbers and lengths
- Values range widely, but the average human protein-coding gene is 67 Kb long
- with 11 exons and 10 introns with mean sizes of 163 bp and 6.4 Kb, respectively
Describe the average gene density in the human genome
- The average gene density in the human genome also varies between chromosomes
- with about 25% of the genome consisting of gene deserts - regions of over a Mb that are devoid of genes.
How much of the genome is transcribed into RNA?
How much of this is transcribed into non-coding RNA (ncRNA)?
- about 75% of the genome
- a third of this represents non-coding RNA (ncRNA)
What are non-coding RNAs (ncRNA)?
List their classes
- ncRNA are RNA that is not translated into protein
- some of this may reflect background transcription of no function significance
- but much is transcribed from genes with important functions
- 5 classes:
- ribosomal RNA genes
- transfer RNA genes
- small nuclear/nucleolar RNA genes
- micro RNA genes
- long non-coding RNA genes
Describe ribosomal RNA genes
- what do they code for
- transcribed by what
- ribosomal RNA is essential during protein translation by ribosomes
- rRNA genes are the best characterised of the ncRNAs
There are 4 rRNAs:
- 18S
- 5.8S
- 28S
- 5S
- The genes for these exist in multiple copies to ensure there is sufficient rRNA for translation
- 18S, 5.8S and 28S rRNA are derived from a 41S precursor transcribed by RNA polymerase I from ~ 300 copies of a gene that is tandemly repeated in 5 clusters on different chromosomes
- Similar numbers of 5S rRNA genes, also clustered as tandem repeats, are located elsewhere in the genome and transcribed by RNA polymerase III.

Describe what transfer RNA genes do
Transcribed by what?
- codes for transfer RNA (tRNA)
- Transfer RNA (tRNA) also functions during translation by delivering amino acids to the ribosome.
- tRNAs are small; 76-90 nucleotides, with a folded structure.
- They can base pair with the codons of an mRNA strand, and this process delivers the attached amino acid to the growing peptide chain
- Genes for the 49 different tRNAs also exist as multiple copies at various chromosome sites and are also transcribed by RNA polymerase III.

What do small nuclear / nucleolar RNA genes do?
What are they transcribed by?
Where in the genome are they?
- Genes for small nuclear and small nucleolar (snRNA and snoRNA) are dispersed throughout the genome
- transcribed by RNA Polymerases II or III
- Each gene makes a distinct RNA with a distinct function in the processing of mRNA, tRNA or rRNA into their mature forms.
- For example, many are essential components of the RNA splicing machinery (spliceosome).
Describe micro RNA genes
What are miRNAs?
- MicroRNAs (miRNAs) are small (~22nt) RNAs that control gene expression by RNA interference
- they physically block translation by binding to mRNAs to prevent ribosomal access
- Many different miRNAs, each targeting specific mRNAs, are transcribed from multiple genes by RNA polymerases II or III.
- Piwi interacting RNAs (piRNAs) also work by RNA interference but are expressed only in the germ line where they silence transposons.
- The number of gene assigned to these categories is growing.

Describe long non-coding RNA (lncRNA) genes
What are they transcribed by?
Size
Examples
- Genes for long non-coding RNA (lncRNA) are also being identified in increasing numbers.
- ln common with protein-coding genes, most are transcribed by RNA polymerase II.
- lncRNA is between 200 and 17,000 nucleotides in length and can regulate mRNA expression by various mechanisms, including RNA interference, although relatively few have fully defined roles.
- A famous example is X-inactivation by Xist
What are pseudogenes?
How many categories of them are there?
- these are mutated genes that are no longer functional
- thought to be useless byproducts of genome evolution
- there are >20,000 pseudogenes in the human genome
- there are three categories
What are the three categories of pseudogenes?
- non-processed pseudogenes:
- arise by duplication of a functional gene followed by mutational inactivation
- processed pseudogenes:
- they are intronless and arise by reverse transcription of a spliced transcript followed by chromosomal integration and mutational inactivation
- gene fragments:
- these are non-functional remnants of genes resulting from genomic rearrangements
Look and learn the example of non-processed pseudogene at the human beta-globin gene on chromosome 11

What are the two types of highly repeated DNA?
- dispersed (or interspersed) repeats
- tandem repeats

Where are large amounts of highly repetitive DNA found?
- in higher eukaryotes, including humans
What are the functions of highly repetitive DNA?
- they are clearly major determinants of genome DNA sequence organisation
- They have also been identified as important sites of genetic variation.
- Furthermore, some are known to influence gene expression and it is thought that they have key roles in 3D folding of the genome.
- Nonetheless, the full extent of their functional significance remains unclear.
What are transposable elements (transposons)?
- a type of dispersed repetitive DNA
- by far the most abundant dispersed repeat sequences
- transposons are DNA sequences capable of changing their location within the genome
- there are two basic categories of transposable elements and these use different transposition mechanisms:
- RNA transposons (retrotransposons)
- DNA transposons
Describe the difference between RNA transposons and DNA transposons
Describe their mechanisms
- RNA transposons (retrotransposons):
- transcribe their DNA into RNA and use a reverse transcriptase to convert this back into DNA that inserts into a new site in the genome
- Remarkably, RNA transposons account for about 40% of the human genome!
- DNA transposons:
- make up about 3% of the human genome
- use a transposase to excise their DNA from one site and insert it elsewhere in the genome
- image caption: DNA transposons leave their original site to integrate elsewhere (‘cut and paste’), whereas RNA transposons remain at their original site but make copies that insert elsewhere (‘copy and paste’).

What are the three main types of RNA transposons?
- LTR retrotransposons
- LINEs
- SINEs

Use this diagram to describe how DNA transposons work

- DNA transposons consist of a transposes gene flanked by inverted repeat (IR) sequences
- the transposase protein recognises and cleaves the IR sequences, releasing the transposon DNA
- the transposase also catalyses the integration of the DNA at new, non-random sites in the host cell genome
Describe LTR retrotransposons
- long terminal repeat (LTR) retro-transposons are also called endogenous retroviruses (ERVs)
- their DNA consist of a coding region, flanked by a pair of LTR sequences (100 bp - 5 Kb in length)
- LTR retrotransposition involves an integration mechanism related to that used by retroviruses, which also have LTR sequences
- the genomes of LTR retro-transposons include two genes:
- gag: which encodes a protein needed to make a cytoplasmic virus-like particles
- pol: codes for reverse transcriptase
- unlike retroviruses, LTR retro-transposons are non-infective, as they lack the env gene that retroviruses need to leave their host cells and infect other cells

Describe LINEs RNA transposons
- length
- abundance
- structure
- long interspersed nuclear elements (LINEs) are around 6 Kb in length
- their mRNA codes for a reverse transcriptase that can copy the RNA back into DNA which can then integrate into a new site in the genome
- reverse transcription is often incomplete leading to a truncated, non-functional product being inserted back into the genome
- among the different families of human LINEs, one called LINE- 1 or L1 is very abundant
- it accounts for 17% of the human genome!
- structure:
- 5’ UTR: includes a strong promoter driving transcriptional initiation by RNA polymerase II
- Open Reading Frame 1: encodes an RNA binding protein required for transposition
- Open Reading Frame 2: encodes a protein with endonuclease (EN) and Reverse transcriptase (RT) acti ity required for transposition

Describe SINEs RNA transposons
- they are short interspersed nuclear elements
- 100-400 bp
- they are non-autonomous transposons, as they do no code for any proteins
- they depend on proteins (e.g. those encoded by other transposons) for transposition
- among the different families of human SINEs, one called Alu is very abundant
- it accounts for 11% of the human genome

What are the effects of transpositions of transposons?
- Even though the vast majority of transposons in the human genome have lost their ability to transpose, they have clearly had an enormous influence genome evolution.
- Furthermore, they continue to modify the genome, either by rare transposition events or by recombining with each other to cause chromosomal rearrangements .
- Such changes may be oncogenic in somatic cells or, if they occur during gametogenesis, may cause genetic disease or contribute to genome evolution
What are tandem repeats?
- regions of DNA where an array of identical, or highly similar, repeat motifs (also called repeat units) is consecutively repeated
- less abundant than transposons

Why are tandem repeats sometimes called VNTR (variable number of tandem repeats) sequences?
- there is variability between individuals
- there is interest in these because of this
How are tandem repeats classified?
- according to their abundance, size of their motifs and arrays, they are classified as:
- macro-satellites
- mini-satellites
- micro-satellites

Describe macro-satellites in eukaryotes
- motif length
- array size
- locations
- they have motifs up to ~220 nt long
- form large arrays (~ 20,000 to 5,000,000 nt long) at 1-100 locations in the genome
- sequencing such large arrays is an ongoing challange and why the full sequence of the human genome sequence is still incomplete
- located in heterochromatin and at or near centromeres and telomeres
- the major human centromeric macro-satellites DNA, alpha-satellite DNA repeat motif 171 nt, is implicated in centromere function and chromosome segregation
- see diagram

Describe mini-satellites in eukaryotes
- motif size
- array size
- location
- uses
- motifs of 10-150 nt
- forming arrays of ~20 - 2000 nt at more than 1000 locations in the genome
- mostly in euchromatin at telomeres and centromeres
- these regions have no clear function
- but in 1984 Alec Jeffreys famously identified that the number of repeat units in mini-satellites varies between individuals.
- This discovery led to DNA fingerprinting.

Describe micro-satellites in eukaryotes
- also called Short Tandem Repeats (STR) or Simple Sequence Repeats (SSRs)
- repeat units (motifs) of 1-10 nt
- forming arrays of up to 1400 bp at 1000 - 1,000,000 genomic locations
- there are more than half a million micro-satellite arrays in the human genome with di-, tri-, tetra-, penta- ncuelotide sequence motifs
- the number of repeats in any particular array is highly variable between individuals, making them useful for forensic, linkage and population sutides
- arrays of a 6 bp satellites repeat unit (TTAGGG) are found at the very end of all telomeres
- many micro-satellites are located close to or within genes and some may even be part of the protein-coding region
- e.g. the tri-nucleotide repeat (CAG)n array found in the gene HTT
- this is implicated in Huntington’s disease
Describe how Huntington’s disease is caused
- Many micro-satellites are located close to or within genes and some may even be part of the protein-coding region.
- A famous example is the tri-nucleotide repeat (CAG)n array found in the gene HTT.
- This codes for the protein Huntingtin, implicated in Huntington’s disease.
- Arrays in healthy individuals contain between 6 and 35 CAG repeats.
- Individuals affected with Huntington’s disease, however, have >35 repeats, which cause the protein product to have harmful properties.
