Genomes Flashcards

Question 1

Q

What was the first genome that was sequenced

Answer

A

A single-stranded DNA virus bacteriophage oX-174. The genome is 5386 bases

Question 2

Q

Explain the evolution of genome sequencing

Answer

A

When the importance of sequencing was recognized there were more efforts made to automate the techniques. A major breakthrough was the replacement of autoradiography of gels(each nucleotide occupying a separate lane with 4 fluorescent dyes allowing a single reaction). This technique was automated by developing a machine that supported a generation of sequencing projects. recently next generation sequencing techniques have been developed.

Question 3

Q

Explain high-throughput sequencing

Answer

A

The initial draft of the first human genome that was sequenced took 10 years and cost US$ 3X10^9. More recently 250Gbp can be generated in a week. The largest dedicated institution in the field called Beijing Genomics Institute has about 200 state-of-the-art sequencing instruments that can sequence 25X10^9 bp per day each (one human genome at over 8Xcoverage). If they run at full capacity they can produce about 10000 human genomes per year. Advances in technology will continue to accelerate.
There are two aspects of a large-scale sequencing project:
> The generation of raw data- most methods sequence long DNA molecules by fragmenting them and partially sequencing the pieces.
> assembling the sequences- The short sequences must be assembled into the whole sequence by using overlaps between the individual fragments. The short sequences are affected by the typical length of an individual short sequence called read lengths. This process: gather the short sequences, search for overlapping regions among the individual sequences and assemble. Computer programs carry out this whole process.

Question 4

Q

Explain De novo sequencing

Answer

A

It is the determination of the complete sequence of the first genome from a species. The total DNA must be fragmented into 200bp long fragments. Single-end or paired-end sequencing is used to produce sequences, either way, the number of bases reported is the read length. There are a number of unknown bases that occur between paired-end reads. Next, is the assembly of the sequences by identifying their overlapping fragments to create a contig(contiguous sequence). The contigs will be further assembled into super-contigs or scaffold. Assembly requires a sufficient number of reads to cover the entire genome(coverage) with enough replicates to detect sequence errors. Coverage is the ratio of the total number of sequenced bases (during the project) over the genome length- for novel genomes the a coverage of 30x or 50x is required. For prokaryotic genomes the assembly is accomplished accurately by computers, but for eukaryotic genomes it is a computer intensive problem.

Question 5

Q

Explain resequencing

Answer

A

Once a reference genome for an individual of a species is available(e.g. human genome) the sequences of genomes from other individuals of the species are easier to determine. Resequencing does not assemble fragment sequences de novo, it maps them onto a reference genome. This is fairly straightforward, except for highly repetitive sequences. Coverage must be adequate so that the error rate of sequence determination is less than the frequency of natural variation(SNPs).In sequencing cancer cells, it is preferable to sequence normal cells from the same patient than to infer from the reference sequence the genome changes arising from the disease.

Question 6

Q

What is one goal of resequencing

Answer

A

It is to determine variation in the genome of an individual from a reference genome. By correlating these variations with the phenotype it is possible to identify the genetic origin of the lesion.

Question 7

Q

Explain exome sequencing

Answer

A

It is the sequencing of the exome(protein-coding regions). Many inherited diseases result from loss of activity of particular proteins. The loss of activity frequently arises from a specific mutation in the sequence coding for the protein. To identify such a mutation it is not necessary to sequence the entire genome.

Question 8

Q

Discuss the protein-coding regions in the human genome

Answer

A

The central dogma is DNA to mRNA to protein. Protein-coding genes are transcribed to mRNA, after processing ribosomes translate mRNA to polypeptide chains. The human genome contains 23000 protein-coding genes. The distribution of protein-coding genes results in protein-coding gene-poor chromosome regions, such as the subtelomeric regions on all chromosome, chromosome 18 and X. There are also protein coding gene-rich chromosome regions, such as chromosome 19 and 22. The structure of protein coding genes contains exons (expressed regions) interrupted by introns(regions that are spliced out of mRNA and not translated into a protein). The average exon is about 200bp. There are large differences in the size of protein-coding genes, this is mainly cause by the variability of intron size. Examples of protein-coding genes- the gene for insulin is 1.7kb, the gene for low density lipoprotein receptor is 5.45kb and the dystrophin gene is 2400kb. Splice signal sites indicate intron-exon junctions. Protein coding genes make up 2-3% of the human genome. They are distributed unevenly across the chromosomes. Many of them appear in multiple copies, either identical or diverged into families. For instance, humans have about 400 functional related olfactory receptor genes and some animals have more.
Protein-coding genes appear on both strands. In many cases, unrelated genes are fairly well separated. But, there are genes that partially overlap, entire genes that appear on one strand, within an intron of another gene.
Gene transcription may be under the control of cis-regulatory elements near the gene (upstream or downstream). There’s also trans-regulatory elements that occur elsewhere in the genome or even on different chromosomes.
Often closely related genes occur in the same region because of a common mechanism of evolution called gene duplication followed by divergence. In some cases multiple identical copies of a gene may appear on different chromosomes, e.g. the gene for ubiquitin.

Question 9

Q

What is the proteome

Answer

A

The amino acid sequences of the protein expressed

Question 10

Q

Ideally after determining a genome sequence it would be possible to infer the proteome, but there is variety within the genome-proteome relationship. What are the different mechanisms that complicate this relationship?

Answer

A

> In eukaryotes, a mechanism that generates variety at the protein level from a single gene sequence is alternative splicing. This involves forming a mature mRNA from different choices of exons from a gene, but it may not be in the order that it appears in the gene. 95% of multi-exon protein-coding genes in the human genome produce splice variants. Also, there have been cases where multiple promoters lead to transcription of parts of the same region into different proteins. If the reading frames of these different transcripts are not in phase then it will lead to different proteins.
In prokaryotes and eukaryotes, RNA editing can produce one or more proteins for which the amino acid sequence may differ from the predicted genome sequence. E.g. in the wine grape- the mRNAs that arise from the mitochondrial protein-coding gene are subject to multiple C to U editing events, most of these alter the amino acid. Also, in humans nuclear protein-coding genes are subject to editing that changes adenine to inosine. The editing and amino acid sequence can be tissue-specific.
There are also post-translational modifications, eg. the binding of prosthetic groups or the phosphorylation of side-chains for regulating protein activity.
There is also special combinatorial splicing of DNA that results in the production of many antibodies that are diverse.

Question 11

Q

Discuss the regions of the genome that produces non-protein-coding RNA molecules

Answer

A

RNAs include( not mRNAs) tRNAs(components of ribosomes), microRNAs or miRNAs, small interfering RNAs or siRNAs(regulates transcription) and piwi-interacting RNAs or piRNAs that have several functions, including protecting genome integrity by silencing transposable elements. There are about 3000 genes that code for RNAs(not mRNA). The RNA-ome is much richer than suspected. Most non-coding RNAs are involved in the control of gene expression(except for tRNAs).

Question 12

Q

Discuss the regions that contain pseudogenes

Answer

A

Pseudogenes are degenerate genes that have mutated so far from their original sequences that the polypeptide sequence that they encode will be non-functional. in some cases, processed pseudogenes are picked up by viruses from mRNA and reverse transcribed(mRNA to cDNA)- we can see this because introns have been lost. Processed pseudogenes lack promoters so they are not expressed as the original proteins. However sometimes they are transcribed and play a regulatory role by competing with miRNAs for binding to mRNAs. Some pseudogenes retain their function- they are rescued by translational read-through of a stop codon.

Question 13

Q

Discuss the regions that are responsible for the regulation of transcription

Answer

A

Other regions contain binding sites for ligands responsible for the regulation of transcription, eg. promoters. A lot of the genome is dedicated to control, including- regulatory sites, all the proteins and RNAs encoded that have regulatory functions.

Question 14

Q

Discuss the repetitive elements of unknown function

Answer

A

They account for large fractions of the genome. Long and short interspersed elements(LINEs and SINEs) account for 21% and 13% of the genome. Highly repeated sequences(minisatellites and microsatellites) may appear as tens or even hundreds of thousands of copies, in aggregate amounting to 15% of the genome.

Question 15

Q

Provide examples of repetitive elements in the human genome- moderately and highly repetitive DNA

Answer

A

Moderately repetitive DNA includes:
>Functional- dispersed gene families that is created by gene duplication followed by divergence, eg. actin and globin. Also, tandem gene family arrays- rRNA genes(250 copies), tRNA genes(50 sites with 10-100 copies each in human), histone genes in many species.
> Without known function- SINEs (eg. Alu), 200-300bp long, 100000s of copies( for Alu its 300000), scattered locations(not tandem repeats). Also, LINEs- 1-5kb long, 10-100000 copies in a genome, pseudogenes.

Highly repetitive DNA includes:
> Minisatellites- composed of 14-500bp segments, 1-5kb long, many different ones and they are scattered throughout the genome.
> Microsatellites- composed of repeats of up to 13bp, 100s of kb long, 106 copies/genome, most of the heterochromatin around the centromere
> Telomeres- 250-1000 repeats at the end of each chromosome, contains a short repeat unit (usually 6bp).

Question 16

Q

Define transposable elements

Answer

A

They are skittish segments of DNA that move around the genome and are found in all organisms.

Question 17

Q

What are the different types of transposable elements

Answer

A

Retrotransposons(class I)- replicate via an RNA intermediate(RTase). They use the copy-and-paste mode- replication leaves the original copy behind while replicating another. Many of them are degenerate retroviruses. 
Transposons(class II)- produces DNA copies without an RNA intermediate stage. They encode an enzyme called transposase, which recognizes sequences within the transposon, cuts it out and inserts it elsewhere. The excision is sloppy and leaves a mutation at the original site. They use the cut-and-paste mode- moves the transposon from one location to another. Sometimes a bit of the surrounding sequence adheres to and accompanies the transposed material.

Question 18

Q

Transposable elements can replicate, therefore they are related to some of the types of repetitive sequences found in the genome. What are these sequences?

Answer

A

Mammalian genomes contain retrotransposons called LINEs and SINEs. LINEs are 1-5kb long, 10-10s of 1000s copies and the most common LINE is L1 which appears around 20000 times in the genome. SINEs are 200-300bp long and 100000s of copies at scattered locations and the most common SINE is the element Alu which is 280kb long and contains 300000 copies in the genome. The total amount of L1 and ALU is 7% of the human genome. LINEs encode a reverse transcriptase and can replicate autonomously, but SINEs are too short to produce their own RTase, so they depend on LINEs or other sources of the required activities for replication.
Transposons contain inverted repeats at their ends, which are targets of the excision machinery. Replication may occur in the cut-and-paste mode. If two equivalent transposons are nearby they can move a whole segment including all the genetic material between them. The transfer of multiple genes to a plasmid is a common mechanism of generation of antibiotic resistance in bacteria.

Question 19

Q

List the biological effects of transposable elements

Answer

A

> Sequence broadcasting- multiple copies of elements of a sequence may be distributed to various locations in the genome(genetic markers).
Altering properties of genes- If a fragment of sequence is inserting into a coding region it could render the gene product non-functional, creating a knockout effect. If a segment is inserted near a gene it may affect its regulation or alter its splicing pattern(TEs at promoters). If a segment is inserted in an intron it can affect the rates of transcription by slowing down the RNA polymerase as it passes through.
Transposable elements as an important engine of evolution- they provide a mechanism for gene evolution by gene fusion or exon shuffling. Transposable element insertion can cause species-specific alternative splicing patterns. This can produce new isoforms. Variant splicing can lead to disease- it can change the reading frame causing a truncated protein.
Causing chromosomal rearrangements- This can include inversions, translocations, transpositions and duplications, perhaps through mispairing of chromosomes during cell division. The deletions leading to Prader-Willi and Angelman syndromes are associated with a mutation in the sequence of a nearby transposable element.
Leakage of epigenetic modification- from the landlords POV TEs are squatters. They make up 70% of the maize genome. They clutter up the DNA and eukaryotes have to defend themselves from the expression of TEs. They do this by the methylation of the TEs or to use siRNAs. Some cancers and other diseases that lead the hypomethylation of DNA can cause transcriptional reactivation of some TEs. Methylation silencing TEs also cuts down on the mobility of those TEs that require mobility to carry out transcription. However, can affect neighbouring genes.