Genomes Flashcards
What was the first genome that was sequenced
A single-stranded DNA virus bacteriophage oX-174. The genome is 5386 bases
Explain the evolution of genome sequencing
When the importance of sequencing was recognized there were more efforts made to automate the techniques. A major breakthrough was the replacement of autoradiography of gels(each nucleotide occupying a separate lane with 4 fluorescent dyes allowing a single reaction). This technique was automated by developing a machine that supported a generation of sequencing projects. recently next generation sequencing techniques have been developed.
Explain high-throughput sequencing
The initial draft of the first human genome that was sequenced took 10 years and cost US$ 3X10^9. More recently 250Gbp can be generated in a week. The largest dedicated institution in the field called Beijing Genomics Institute has about 200 state-of-the-art sequencing instruments that can sequence 25X10^9 bp per day each (one human genome at over 8Xcoverage). If they run at full capacity they can produce about 10000 human genomes per year. Advances in technology will continue to accelerate.
There are two aspects of a large-scale sequencing project:
> The generation of raw data- most methods sequence long DNA molecules by fragmenting them and partially sequencing the pieces.
> assembling the sequences- The short sequences must be assembled into the whole sequence by using overlaps between the individual fragments. The short sequences are affected by the typical length of an individual short sequence called read lengths. This process: gather the short sequences, search for overlapping regions among the individual sequences and assemble. Computer programs carry out this whole process.
Explain De novo sequencing
It is the determination of the complete sequence of the first genome from a species. The total DNA must be fragmented into 200bp long fragments. Single-end or paired-end sequencing is used to produce sequences, either way, the number of bases reported is the read length. There are a number of unknown bases that occur between paired-end reads. Next, is the assembly of the sequences by identifying their overlapping fragments to create a contig(contiguous sequence). The contigs will be further assembled into super-contigs or scaffold. Assembly requires a sufficient number of reads to cover the entire genome(coverage) with enough replicates to detect sequence errors. Coverage is the ratio of the total number of sequenced bases (during the project) over the genome length- for novel genomes the a coverage of 30x or 50x is required. For prokaryotic genomes the assembly is accomplished accurately by computers, but for eukaryotic genomes it is a computer intensive problem.
Explain resequencing
Once a reference genome for an individual of a species is available(e.g. human genome) the sequences of genomes from other individuals of the species are easier to determine. Resequencing does not assemble fragment sequences de novo, it maps them onto a reference genome. This is fairly straightforward, except for highly repetitive sequences. Coverage must be adequate so that the error rate of sequence determination is less than the frequency of natural variation(SNPs).In sequencing cancer cells, it is preferable to sequence normal cells from the same patient than to infer from the reference sequence the genome changes arising from the disease.
What is one goal of resequencing
It is to determine variation in the genome of an individual from a reference genome. By correlating these variations with the phenotype it is possible to identify the genetic origin of the lesion.
Explain exome sequencing
It is the sequencing of the exome(protein-coding regions). Many inherited diseases result from loss of activity of particular proteins. The loss of activity frequently arises from a specific mutation in the sequence coding for the protein. To identify such a mutation it is not necessary to sequence the entire genome.
Discuss the protein-coding regions in the human genome
The central dogma is DNA to mRNA to protein. Protein-coding genes are transcribed to mRNA, after processing ribosomes translate mRNA to polypeptide chains. The human genome contains 23000 protein-coding genes. The distribution of protein-coding genes results in protein-coding gene-poor chromosome regions, such as the subtelomeric regions on all chromosome, chromosome 18 and X. There are also protein coding gene-rich chromosome regions, such as chromosome 19 and 22. The structure of protein coding genes contains exons (expressed regions) interrupted by introns(regions that are spliced out of mRNA and not translated into a protein). The average exon is about 200bp. There are large differences in the size of protein-coding genes, this is mainly cause by the variability of intron size. Examples of protein-coding genes- the gene for insulin is 1.7kb, the gene for low density lipoprotein receptor is 5.45kb and the dystrophin gene is 2400kb. Splice signal sites indicate intron-exon junctions. Protein coding genes make up 2-3% of the human genome. They are distributed unevenly across the chromosomes. Many of them appear in multiple copies, either identical or diverged into families. For instance, humans have about 400 functional related olfactory receptor genes and some animals have more.
Protein-coding genes appear on both strands. In many cases, unrelated genes are fairly well separated. But, there are genes that partially overlap, entire genes that appear on one strand, within an intron of another gene.
Gene transcription may be under the control of cis-regulatory elements near the gene (upstream or downstream). There’s also trans-regulatory elements that occur elsewhere in the genome or even on different chromosomes.
Often closely related genes occur in the same region because of a common mechanism of evolution called gene duplication followed by divergence. In some cases multiple identical copies of a gene may appear on different chromosomes, e.g. the gene for ubiquitin.
What is the proteome
The amino acid sequences of the protein expressed
Ideally after determining a genome sequence it would be possible to infer the proteome, but there is variety within the genome-proteome relationship. What are the different mechanisms that complicate this relationship?
> In eukaryotes, a mechanism that generates variety at the protein level from a single gene sequence is alternative splicing. This involves forming a mature mRNA from different choices of exons from a gene, but it may not be in the order that it appears in the gene. 95% of multi-exon protein-coding genes in the human genome produce splice variants. Also, there have been cases where multiple promoters lead to transcription of parts of the same region into different proteins. If the reading frames of these different transcripts are not in phase then it will lead to different proteins.
In prokaryotes and eukaryotes, RNA editing can produce one or more proteins for which the amino acid sequence may differ from the predicted genome sequence. E.g. in the wine grape- the mRNAs that arise from the mitochondrial protein-coding gene are subject to multiple C to U editing events, most of these alter the amino acid. Also, in humans nuclear protein-coding genes are subject to editing that changes adenine to inosine. The editing and amino acid sequence can be tissue-specific.
There are also post-translational modifications, eg. the binding of prosthetic groups or the phosphorylation of side-chains for regulating protein activity.
There is also special combinatorial splicing of DNA that results in the production of many antibodies that are diverse.
Discuss the regions of the genome that produces non-protein-coding RNA molecules
RNAs include( not mRNAs) tRNAs(components of ribosomes), microRNAs or miRNAs, small interfering RNAs or siRNAs(regulates transcription) and piwi-interacting RNAs or piRNAs that have several functions, including protecting genome integrity by silencing transposable elements. There are about 3000 genes that code for RNAs(not mRNA). The RNA-ome is much richer than suspected. Most non-coding RNAs are involved in the control of gene expression(except for tRNAs).
Discuss the regions that contain pseudogenes
Pseudogenes are degenerate genes that have mutated so far from their original sequences that the polypeptide sequence that they encode will be non-functional. in some cases, processed pseudogenes are picked up by viruses from mRNA and reverse transcribed(mRNA to cDNA)- we can see this because introns have been lost. Processed pseudogenes lack promoters so they are not expressed as the original proteins. However sometimes they are transcribed and play a regulatory role by competing with miRNAs for binding to mRNAs. Some pseudogenes retain their function- they are rescued by translational read-through of a stop codon.
Discuss the regions that are responsible for the regulation of transcription
Other regions contain binding sites for ligands responsible for the regulation of transcription, eg. promoters. A lot of the genome is dedicated to control, including- regulatory sites, all the proteins and RNAs encoded that have regulatory functions.
Discuss the repetitive elements of unknown function
They account for large fractions of the genome. Long and short interspersed elements(LINEs and SINEs) account for 21% and 13% of the genome. Highly repeated sequences(minisatellites and microsatellites) may appear as tens or even hundreds of thousands of copies, in aggregate amounting to 15% of the genome.
Provide examples of repetitive elements in the human genome- moderately and highly repetitive DNA
Moderately repetitive DNA includes:
>Functional- dispersed gene families that is created by gene duplication followed by divergence, eg. actin and globin. Also, tandem gene family arrays- rRNA genes(250 copies), tRNA genes(50 sites with 10-100 copies each in human), histone genes in many species.
> Without known function- SINEs (eg. Alu), 200-300bp long, 100000s of copies( for Alu its 300000), scattered locations(not tandem repeats). Also, LINEs- 1-5kb long, 10-100000 copies in a genome, pseudogenes.
Highly repetitive DNA includes:
> Minisatellites- composed of 14-500bp segments, 1-5kb long, many different ones and they are scattered throughout the genome.
> Microsatellites- composed of repeats of up to 13bp, 100s of kb long, 106 copies/genome, most of the heterochromatin around the centromere
> Telomeres- 250-1000 repeats at the end of each chromosome, contains a short repeat unit (usually 6bp).