Lecture 2 Flashcards
The human genome
- 22 autosomes and 1 pair of sex chromosomes (XX/XY)
About 25,000 genes in the genome. Two types:
- Euchromatin is loosely packaged and transcriptionally active
- Heterochromatin is densely packed and not transcriptionally active (aren’t supposed to be turned on)
How are males and females different
- Y chromosomes are only paternally inherited (nothing to pair up and recombine with)
- Mitochondrial DNA (mtDNA) is only maternally inherited- although its found in males and females- useful if we’re trying to reconstruct the evolutionary history of a population. They’re also different
- Y and mtDNA does not recombine
- Autosomal regions do recombine
Structure of DNA and RNA
- DNA has 4 nucleotides (bases) - A, C, G and T- similar in their structure. Where they differ is in the nucleotide part. A and G are purines, C and G are pyrimidines.
- Attached to a sugar (deoxyribose) and a phosphate group
- RNA has a different sugar (ribose) and also a different base; uracil instead of thymine
- The genetic code is always read in the 5’ to 3’ direction
- G always pairs with C
- A always pairs with T
- DNA is always double stranded - G and C, T and A.
- GC content - % of positions that are GC
Typical structure of a gene
No two genes are the same, but there are some common features
No two genes are the same- there are similarities. Coding part is the exons- in between the exons are introns. Upstream there are promoters and enhancers (involved in gene expression). During transcription… introns get spliced out - mature messenger RNA
The genetic code
Key point – some substitutions change the amino acid (nonsynonymous)
Others result in the same (synonymous) amino acid; e.g. CCA and CCG are both proline
Only relevant to the coding region
Redundancy of the genetic code, each amino acid is encoded by a codon. Some amino acids are encoded by more than one codon.
This means a place where the mutation happens- can impact amino acid.
Genetic Variation
- Single nucleotide polymorphisms (SNPs) are most common form of genetic variation
- Change one nucleotide for another. Tranisitions are more common than transversions
- Transitions are more common than transversions (see Lecture 10)
- Genetic variation: rely on evolution to happen
- Occasionally we get indels- deletions/ insertions - this is disruptive in the coding region - results in a different set of codons.
SNPs are fundamental to the entire module
- SNPS are the building blocks of all genetic variation
- (Usually) two alleles – the major (more common) and minor (rarer) allele. A is the minor allele and g is the major allele.
Whats a minor allele frequency
Individuals can either have homozygous (e.g. GG or AA) or heterozygous (GA)
How can SNPs have a greater impact on some proteins than others
- Ones that radically change a protein are unlikely to be neutral
- Change from a GCG to a GCA- still codes for an alanine - silent.
- GCG to GTG - valine - substitution (conservative) as they’re biochemically quite similar - similar charges and pHs- won’t affect the protein that much. The protein may fold and function in the same way
- Non conservative amino acid substitution - GCA- CTA - change to a leucine - more profound affect on the protein. Changes the electronic charge or pH.
- Deletions and insertions - GAA has been deleted- might have a serious consequence but down stream everything is still the same - number that’s divisible by 3
- Addition or deletion of 1 or 2- affects the amino acid and affects everything downstream e.g. adding G- changes all the amino acids downstream and adds a stop codon early on - Likely to be harmful - rather than positive / neutral
How do we screen genetic variation? Capillary sequencing
- Also known as Sanger sequencing and chain termination sequencing
- Main idea is that a dideoxy (rather than a deoxy)nucleotide results in the termination of DNA synthesis during PCR.
- Dideoxynucleotides are at low concentration in reaction mixture, but each base has a different fluorescent dye
- Products passed through a capillary sequencer, and dyes read by a laser. Combination of size and dye reveals sequence
- Laser can read different wavelengths of the different dyes- so tells you the sequencing of letters
How do we screen genetic variation? Next-generation sequencing
Note the Log scale
1 Mbp costs about 1/1,000,000 of the price it did in 2001
Moores law - relates to the speed of computer processes- DNA sequencing followed moores law until 2007
1 million times cheaper - huge change- can do things a lot more effectively
e.g. Illuminia sequencing
Illumina sequencing key points
- DNA is fragmented and adaptors added to the ends.
- Adaptors are then immobilised on a flow cell, and a process known as bridging amplification generates ~1000 identical sequences in millions of locations on the cell
- Dye-labelled terminating nucleotides are added to the flow cell, then washed away. The process is repeated until about 100bp are read.
- These short reads are then compared to a reference assembly.
Comparison of sequencing methods
Capillary sanger sequencing: ddNTP termination and fluorescent detection, max read 850, run time 1, Gb / run / machine «_space;0.001, Pros: Accurate, useful for validation of NGS data, Cons: Low throughout, expensive
Illumina NovaSeqX: Polymerase-based sequence by synthesis, max reads: 2 x 150 (paired end), run time 2, Gb / run / machine 8000, Pros: Massive throughput, Cons: Short reads make assembly challenging
PacBio Revio: Single molecular real time sequencing, Max. read length 15,000-20,000, Run time 1, Gb / run / machine 90, Pros: Very high throughput, Very long reads, Cons: Less throughput than Illumina
Oxford Nanopore Promethion: Single molecular real time sequencing, Max. read length/ bp 10,000-100,000, run time: 3, GB/ run/ machine: 50-110, Pros: Very high throughput; Possibly longest reads, Cons: Slightly higher error rate; less throughput than Illumina
Errors during sequencing?
- Illumina errors can be corrected (and caused!) bioinformatically
- Sequencing errors are ~0.1% per read- get another file
- Q scores give an indication of reliability Q = -10log10P(error)
- With high depth, errors are more obvious, and so can be corrected
Exome sequencing
- The exome* is only a few percent of the genome, yet contains the coding regions of genes.
- Sequencing just the exome is a cost-effective way of analysing a part of the genome that is likely to be important
What is sequence capture
works by using ‘baits’ with biotin, which can then be attached to magnetic beads. The beads are then stripped of DNA which can be sequenced.
Cataloguing human genetic variation
- Identified > 1 million SNPs from 4 populations
- Showed that recombination happened at ‘hotspots’
- Provided impetus for association studies
- Had profound effect on ability to study selection, evolution and population structure
SNP genotyping
- SNP chips use the principle of primer extension and termination (see sequencing).
- Fragments are annealed to beads on an array
- The two alleles are labelled with different flourescent probes
- The different genotypes are ‘clustered’ which makes genotype calling very quick
- Some chips type >2 million SNPs for ~£100
- SNPs are chosen from HapMap data. Tag SNPs are highly correlated with other SNPs close by
The 1000 genomes project
Much more comprehensive description of human variation
Estimate of mutation rates and detection of regions under selection
Loss-of-function mutants detected and shown to be common in all of us
Managed to estimate mutation rates for the first time.
The ENCODE Project
- An ambitious (and expensive) attempt to understand the function of the parts of the genome that is non-coding (i.e. most of the genome)
- Sequences that are conserved across species are more likely to be functional
- Some scientists have argued about the 80% figure, saying it is much lower