Lecture 2 Flashcards
The human genome
- 22 autosomes and 1 pair of sex chromosomes (XX/XY)
About 25,000 genes in the genome. Two types:
- Euchromatin is loosely packaged and transcriptionally active
- Heterochromatin is densely packed and not transcriptionally active (aren’t supposed to be turned on)
How are males and females different
- Y chromosomes are only paternally inherited (nothing to pair up and recombine with)
- Mitochondrial DNA (mtDNA) is only maternally inherited- although its found in males and females- useful if we’re trying to reconstruct the evolutionary history of a population. They’re also different
- Y and mtDNA does not recombine
- Autosomal regions do recombine
Structure of DNA and RNA
- DNA has 4 nucleotides (bases) - A, C, G and T- similar in their structure. Where they differ is in the nucleotide part. A and G are purines, C and G are pyrimidines.
- Attached to a sugar (deoxyribose) and a phosphate group
- RNA has a different sugar (ribose) and also a different base; uracil instead of thymine
- The genetic code is always read in the 5’ to 3’ direction
- G always pairs with C
- A always pairs with T
- DNA is always double stranded - G and C, T and A.
- GC content - % of positions that are GC
Typical structure of a gene
No two genes are the same, but there are some common features
No two genes are the same- there are similarities. Coding part is the exons- in between the exons are introns. Upstream there are promoters and enhancers (involved in gene expression). During transcription… introns get spliced out - mature messenger RNA
The genetic code
Key point – some substitutions change the amino acid (nonsynonymous)
Others result in the same (synonymous) amino acid; e.g. CCA and CCG are both proline
Only relevant to the coding region
Redundancy of the genetic code, each amino acid is encoded by a codon. Some amino acids are encoded by more than one codon.
This means a place where the mutation happens- can impact amino acid.
Genetic Variation
- Single nucleotide polymorphisms (SNPs) are most common form of genetic variation
- Change one nucleotide for another. Tranisitions are more common than transversions
- Transitions are more common than transversions (see Lecture 10)
- Genetic variation: rely on evolution to happen
- Occasionally we get indels- deletions/ insertions - this is disruptive in the coding region - results in a different set of codons.
SNPs are fundamental to the entire module
- SNPS are the building blocks of all genetic variation
- (Usually) two alleles – the major (more common) and minor (rarer) allele. A is the minor allele and g is the major allele.
Whats a minor allele frequency
Individuals can either have homozygous (e.g. GG or AA) or heterozygous (GA)
How can SNPs have a greater impact on some proteins than others
- Ones that radically change a protein are unlikely to be neutral
- Change from a GCG to a GCA- still codes for an alanine - silent.
- GCG to GTG - valine - substitution (conservative) as they’re biochemically quite similar - similar charges and pHs- won’t affect the protein that much. The protein may fold and function in the same way
- Non conservative amino acid substitution - GCA- CTA - change to a leucine - more profound affect on the protein. Changes the electronic charge or pH.
- Deletions and insertions - GAA has been deleted- might have a serious consequence but down stream everything is still the same - number that’s divisible by 3
- Addition or deletion of 1 or 2- affects the amino acid and affects everything downstream e.g. adding G- changes all the amino acids downstream and adds a stop codon early on - Likely to be harmful - rather than positive / neutral
How do we screen genetic variation? Capillary sequencing
- Also known as Sanger sequencing and chain termination sequencing
- Main idea is that a dideoxy (rather than a deoxy)nucleotide results in the termination of DNA synthesis during PCR.
- Dideoxynucleotides are at low concentration in reaction mixture, but each base has a different fluorescent dye
- Products passed through a capillary sequencer, and dyes read by a laser. Combination of size and dye reveals sequence
- Laser can read different wavelengths of the different dyes- so tells you the sequencing of letters
How do we screen genetic variation? Next-generation sequencing
Note the Log scale
1 Mbp costs about 1/1,000,000 of the price it did in 2001
Moores law - relates to the speed of computer processes- DNA sequencing followed moores law until 2007
1 million times cheaper - huge change- can do things a lot more effectively
e.g. Illuminia sequencing
Illumina sequencing key points
- DNA is fragmented and adaptors added to the ends.
- Adaptors are then immobilised on a flow cell, and a process known as bridging amplification generates ~1000 identical sequences in millions of locations on the cell
- Dye-labelled terminating nucleotides are added to the flow cell, then washed away. The process is repeated until about 100bp are read.
- These short reads are then compared to a reference assembly.
Comparison of sequencing methods
Capillary sanger sequencing: ddNTP termination and fluorescent detection, max read 850, run time 1, Gb / run / machine «_space;0.001, Pros: Accurate, useful for validation of NGS data, Cons: Low throughout, expensive
Illumina NovaSeqX: Polymerase-based sequence by synthesis, max reads: 2 x 150 (paired end), run time 2, Gb / run / machine 8000, Pros: Massive throughput, Cons: Short reads make assembly challenging
PacBio Revio: Single molecular real time sequencing, Max. read length 15,000-20,000, Run time 1, Gb / run / machine 90, Pros: Very high throughput, Very long reads, Cons: Less throughput than Illumina
Oxford Nanopore Promethion: Single molecular real time sequencing, Max. read length/ bp 10,000-100,000, run time: 3, GB/ run/ machine: 50-110, Pros: Very high throughput; Possibly longest reads, Cons: Slightly higher error rate; less throughput than Illumina
Errors during sequencing?
- Illumina errors can be corrected (and caused!) bioinformatically
- Sequencing errors are ~0.1% per read- get another file
- Q scores give an indication of reliability Q = -10log10P(error)
- With high depth, errors are more obvious, and so can be corrected
Exome sequencing
- The exome* is only a few percent of the genome, yet contains the coding regions of genes.
- Sequencing just the exome is a cost-effective way of analysing a part of the genome that is likely to be important