Chapter 21: Genomics, Bioinformatics, and Proteomics Flashcards
Which program in the Human Genome Project was designed to ensure that personal genetic information would not be used in discriminatory ways?
ELSI
The Ethical, Legal, and Social Implication program was put into place to study social implications arising from the Human Genome Project.
Which of the following is a characteristic of the human genome?
Larger and more intron-rich genes than in genomes of invertebrates
Both gene size and intron content tend to increase with complexity of the organism.
The Human Genome Project, which got under way in 1990, is an international effort to ________.
construct a physical map of the billions of base pairs in the human genome
Compared with eukaryotic chromosomes, bacterial chromosomes are ________.
small, with high gene density
The analysis of proteins and enzymatic pathways in cells is known as ________.
metabolomics
This is the analysis of proteins and enzymatic pathways involved in cell metabolism.
Compared with prokaryotic chromosomes, eukaryotic chromosomes are ________.
large, linear, less densely packed with protein-coding genes, mainly organized in
single gene units with introns
Most of the bacterial genomes described in the text have fewer than ________.
10,000 genes
The study of genomic data collected from environmental samples is called ________.
metagenomics
In metagenomics, genomes from entire communities of microorganisms are sequenced. These samples are collected from environmental samples of air, water, and earth.
The human genome contains approximately 20,000 protein-coding genes, yet it has the capacity to produce several hundred thousand gene products. What can account for the vast difference in gene number and product number?
Alternative splicing occurs.
Proteomics is the ________.
process of defining the complete set of proteins encoded by a genome
One major difference between prokaryotic and eukaryotic genes is that eukaryotic genes can contain internal sequences, called ________, that get removed in the mature message.
introns
How does shotgun cloning differ from the clone-by-clone method?
No genetic or physical maps of the genome are needed to begin shotgun cloning.
Shotgun cloning randomly sequences clones with no prior knowledge of their location in the genome.
Understanding why the chromosome is broken into fragments
Sequencing machines cannot analyze sequences that are more than about 800–1,000 bases long. Therefore, the chromosome must bebroken into fragments before any sequencing can take place.
DNA cloning using plasmids.
A typical DNA sequencing reaction requires about 1 microgram of DNA, so the amplification of DNA through cloning is a crucial step inshotgun sequencing.One type of DNA cloning involves plasmids. A plasmid is a small, circular DNA molecule found in bacteria in addition to the bacterialchromosome. Each time a bacterium reproduces, it replicates each of its plasmids.
To clone DNA using plasmids, molecular biologists insert DNA fragments into plasmids and then introduce the plasmids into bacteria.Because bacteria reproduce so rapidly, they can make more than a million copies of a DNA fragment in less than 24 hours.
Why is overlap between the fragment sequences important?
Why must the fragment sequences overlap?
Overlap enables the computer to match up the fragments and determine how they fit together.
Steps in shotgun sequencing:
What are the steps in the shotgun approach to whole-genome sequencing?
1) multiple copies of the same chromosome are prepared
2) Chromosome copies are broken into 1-kb fragments
3) 1 kb fragments are cloned into plasmids
4) the plasmids are sequenced
5) A computer combines the fragment sequences.
There is no use for RNA (1-kb fragments are transcribed into RNA) in the Sanger method = full genome sequencing i.e. the shotgun approach.
In shotgun sequencing, the DNA from many copies of an entire chromosome is cut into fragments.
The fragments are inserted into plasmids and cloned in bacteria. Plasmid DNA is isolated from the bacteria, purified, and sequenced. Finally, a computer assembles the fragment sequences into the continuous sequence of the whole chromosome, based on overlap between the fragments.
Assembling a complete sequence from fragment sequences
In the last step of shotgun sequencing, a computer analyzes a large number of fragment sequences to determine the DNA sequence of a whole chromosome. Given the following fragment sequences, what is the overall DNA sequence?
Sequences of DNA fragments GATGAC CGATGCG GGCGTCAG GACATGGC TCAGTCGA
The five fragment sequences can be arranged to form the complete sequence:
Fragment GATGAC
Fragment GACATGGC
Fragment GGCGTCAG
Fragment TCAGTCGA
Fragment CGATGCG
Complete sequence GATGACATGGCGTCAGTCGATGCG
In shotgun sequencing, a computer program takes millions of bases into consideration when determining the sequence of an entire chromosome. The program arranges the fragment sequences so there is a maximum amount of overlap.
The dog (Canis familiaris) genome has recently been sequenced. About what percentage of the dog’s genes are shared with humans?
75 %
A number of generalizations can be made about the organization of protein-coding genes in bacterial chromosomes. First, the gene density is very high, averaging about ________ gene per _____ basepairs of DNA.
A number of generalizations can be made about the organization of protein-coding genes in bacterial chromosomes. First, the gene density is very high, averaging about 1 gene per 1,000 basepairs of DNA.
Assembling a contig from short reads
In whole-genome shotgun sequencing, computers are used to assemble short DNA sequences (short reads) into an overlapping, contiguous sequence (contig). In this part of the tutorial, you will manually assemble a contig from seven short reads.
Correct
You have just done on a short DNA segment what computers do on an entire genome in whole-genome shotgun sequencing – build a contig by finding overlaps between short reads. The sequence of the contig is
AAGACCCGCCGGGAGGCAGAGGACCTGCAGGG
TGAGCCAACCGCCCATTGCT
In whole-genome shotgun sequencing, the genomic DNA must be prepared so that there are overlaps between the short reads. This can be done either by partial restriction digestion or randomized DNA shearing. If there are highly repetitive sequences or gaps in the contigs, map-based sequencing can be used to fill in the gaps and determine the correct number of repetitive sequences.
In which section of the search results can you find nucleotide-by-nucleotide comparisons between your query sequence and similar database sequences?
Alignments
The three sections of the search results page provide different information about the hit sequences from the database.
Understand the statistical significance of your hit sequences
One of the key features of BLAST is that it permits you to make quantitative comparisons of sequence similarities (alignments) between your query sequence and every other sequence in the database. The quantity that is most frequently reported in a statistical comparison of sequence alignments is the E value (expectation value). The E value is the probability that by chance there is another sequence with a better alignment to your query sequence than that particular hit.
Scroll to the Descriptions section of your results and examine the E values for your hits.
How do the E values change as you go from the top of the list of hits to the bottom?
The E values get larger.
The most similar sequences to your query sequence have E values of about 2 × 10-54 (written as 2e-54 in the search results). This means that there is almost no possibility of finding a better sequence alignment by chance. As you scroll down the list of hits, the E values get slightly larger (smaller negative exponents). These slightly larger E values indicate less statistically significant sequence alignments. In other words, the hits near the top of the list are the most statistically significant. And hits with equal E values are statistically equivalent.
In general, a good alignment has an E value of 1 × 10-5 or smaller.
Whole-genome shotgun sequencing has largely replaced map-based sequencing as a faster, cheaper method of sequencing full genomes.
Nevertheless, map-based cloning remains useful for areas of highly repetitive DNA because it can organize such DNA into a physical map before sequencing it.
Consequently, modern genome sequencing projects typically employ a combination of shotgun and map-based methods.
Both methods involve fragmenting genomes into overlapping segments and using the areas of overlap to assemble the segments into contiguous sequences, or contigs.
Once the complete sequence is determined, genomes are annotated to identify open reading frames, introns, exons, and other sequences.
proteomics
The study of the expressed proteins present in a cell at a given time.
comparative genomic hybridization (CGH)
A microarray-based method for the analysis of copy number variations in genomic DNA or in specific cell types,such as tumor cells.
YAC
A cloning vector in the form of a yeast artificial chromosome, constructed using chromosomal components including telomeres (from a ciliate), and centromeres, origin of replication,and marker genes from yeast.
YACs are used to clone long stretches of eukaryotic DNA.
Previous genome analysis
Used model systems
Screen for natural & induced mutants
Map studies
Linkage analysis to map genes
Required at least 1 mutant/ gene to find
Studies difficult to perform
Labor intensive
Some mutants lethal, will not find the genes associated with those
Genomics
Move to molecular methods (1980s), away from classical methods
Genomic library clones pieced together
Clones sequenced
Genome Sequencing Methods:
Clone by Clone
In the clone-by-clone method, a genomic library is prepared, and clones are organized into genetic and physical maps by observing the inheritance pattern of genetic markers in heterozygous families.
After the clones are arranged into physical maps, they are broken into smaller, overlapping clones that cover each chromosome.
Each smaller clone is sequenced, and the genomic sequence is assembled by stringing together the nucleotide sequence of the clones.
Genome Sequencing Methods:
Shotgun Method
In the shotgun method, a genomic library is constructed from fragments of genomic DNA.
Clones are selected from the library at random, and sequenced.
The sequence is assembled by looking for sequence overlaps between clones from different libraries. This is usually done by computer, using assembler software designed for genomic analysis.
NCBI
National Center for Biotechnology Information
Repository of Sequence/Annotation data
Genome sequence databases
Protein databases
Bioinformatic tools, eg BLAST for sequence similarity searches
bioinformatics
A field that focuses on the design and use of software and computational methods for the storage, analysis, and management of biological information such as nucleotide or amino acid sequences.
Bioinformatics
Analyze and store vast amounts of data
Visualize data
Access data
Data mining
Many companies have developed bioinformatic software
Prokaryotic Genomes:
Eubacteria genomes
Genome sizes vary
Most circular, but not all
Importance of plasmids, some essential:
When is a plasmid a chromosome?
Gene density is high, little “wasted genome”
Not all operons contain genes from same biochemical pathway, which was unexpected
Overlapping genes in eubacteria, also unexpected
The E. coli genome.
The origin and terminus of replication.
The outer circle of bars represents genes transcribed in a clockwise direction, and the inner circle represents genes transcribed in a counterclockwise direction.
Vibrio cholerae genome
The Vibrio cholerae genome is contained in 2 chromosomes.
The larger chromosome (chromosome 1) contains most of the genes for essential cellular functions and infectivity.
Most of the genes on chromosome 2 (52 percent of 115) are of unknown function.
The bias in gene content and the presence of plasmidlike sequences on chromosome 2 suggest that this chromosome was a megaplasmid captured by an ancestral Vibrio species.
An operon from the A. aeolicus genome.
This operon contains genes for protein synthesis (gatC), for DNA recombination (recA and recJ), for a motility protein (pilU), for nucleotide biosynthesis (cmk), and for lipid biosynthesis (pgsA1). This organization challenges the conventional idea that genes in an operon encode products that control a common biochemical pathways.
Prokaryotic Genomes:
Archaea genomes
Organisms typically extremophiles
Structurally similar to eubacteria, metabolically more similar to eukaryotes
Have histone chromosomal proteins, chromosomes may be organized into chromatin
Introns in tRNA genes
Eukaryotic Genomes
A mosaic of organization patterns
Usually linear chromosomes
Mitochondrial genomes
Chloroplast genomes in plants
Genome size highly variable
Low gene density
Introns present
Repetitive elements (up to 80% of genome in maize!)
C. elegans worm genome
Very different from other eukaryotic genomes
~25% of genes in polycistronic “operons”
Large number of introns
Genes within introns
Operons are common in bacterial genomes, but rare in most eukaryotic genomes.
In C. elegans, however, about
25% of all genes are part of operons.
operon
A genetic unit consisting of one or more structural genes encoding polypeptides, and an adjacent operator gene that regulates the transcriptional activity of the structural gene or genes.
Arabidopsis genome
Small genome
Many gene duplications
Used as a model organism
Higher plants have larger genomes but approx. the same amount of genes
Genome organization in several large-genome plants
vs.
the compact genome of Arabidopsis.
In large-genome plants, genes are located in clusters, separated by long, gene-empty spaces of repetitive DNA sequences. Within the gene clusters, the intergenic spaces contain many transposons.
In the Arabidopsis genome, gene-empty regions have been lost, and transposable elements have been lost or reduced. The result is a much smaller genome with genes at a much higher density throughout the genome.
Human Genome Results
Only ~20000 - 23000 coding genes (5% of genome)
50% of genome repeat elements
Gene clusters & gene deserts
Wide range of intron numbers
Bacterial derived genes in human genome
Human genes tend to be large & contain multiple introns
Gene distribution not even on chromosomes
Duplicated regions found on some chromosomes
Function of ~2/3 of genes determined
A preliminary list of assigned functions for 26,588 genes in the human genome.
These are based on similarity to proteins of known function. Among the most common genes are those involved in nucleic acid metabolism (7.5% of all genes identified), receptors (5%), protein kinases (2.8%) and cytoskeletal structural proteins (2.8%)
A total of 12,809 predicted proteins (41%) have unknown functions, reflecting the work needed to fully decipher our genome.
Chromosomes 21 and 22 are the smallest chromosomes in the human genome.
The regions already sequenced are shown in red adjacent to each chromosome.
Some of the disease genes identified on each chromosome are shown below the chromosome.
Species comparisons:
Human vs. Chimpanzee
- Human 21 and chimp 22 share 179 genes with coding sequences of identical length
- these shared genes are 99.29% similar at nucleotide level, 99.18% at amino acid level
A take home message:
There is much to learn as we sequence multiple genomes and move away from heavily studied “MODEL” organisms.
Results may challenge our models of generalization of gene organization, mechanisms, etc.
Genome Evolution
Compare genomes to gain insight to genome evolution
Bacteria: 3.5 billion years ago
Eukaryotes: 1.4 billion years ago
Gene density in four organisms.
In E. coli, gene density is high, and there are very few repetitive sequences. In eukaryotes (b-d), gene density is lower, and portions of the genome are occupied by repetitive DNA sequences. (b) A-50 kb region from chromosome III of yeast contains over 20 genes and little repetitive DNA. (c) A 50-kb region from human chromosome 11 contains 6 genes and stretches of repetitive DNA. (d) 50 kb of the maize genome surrounding the Adh locus. This gene is surrounded by long stretches of repetitive DNA.
Gene Duplication
Important in origin & evolution of eukaryotic genomes
Increases genetic diversity
Results in multigene families
- arise by unequal crossover
- arise by replication errors
Globin Duplication/Evolution
Ancestral duplication of an oxygen transport gene
~800mya—> 2 sister genes
1 became modern day myoglobin (muscle oxygen carrier)
Second became the ancestral globin gene
This second gene duplicated again ~500mya
The second duplication resulted in the alpha and beta globin gene families
Further duplication of alpha globin results in 3 alpha genes on human chromosome 16
Further duplication of beta globin results in 5 beta globin genes on chromosome 11
Evolutionary history of the globin gene superfamily.
About 700–800 million years ago (mya), a duplication event in an ancestral gene gave rise to two lineages.
One led to the myoglobin gene, which in humans is located on chromosome 22.
The other lineage underwent a second duplication event about 500 mya, giving rise to the ancestors of the alpha and beta subfamilies.
Duplications about 200 mya produced the alpha and beta globin subfamilies. In humans, the alpha-globin genes are located on chromosome 16 and the beta-globin genes are on chromosome 11.
Results of globin duplications:
3 alpha globin genes
Zeta expressed only in embryo
Alpha1 expressed in fetus
Alpha2 expressed in adults
Results of globin duplications:
5 beta globin genes
3 genes expressed prior to birth
Delta and beta globin expressed after birth
Increased Gene Diversity through Gene Recombination
~20,000 genes in human genome
How do we obtain incredible diversity of immune system genes?
Immunoglobulin gene subunit diversity
Differential splicing to increase diversity
Break & nibble mechanism to increase diversity
Pages 616-618 in text
Result: Incredible amount of diversity of immune system genes to produce antibodies for antigens
Antibodies
Antibodies (immunoglobulins) IgM, IgD, IgG, IgA, IgE found on plasma B cells
Proteins produced by vertebrates as a defense against infection.
Millions of different forms, each with a different binding site that specifically recognizes another molecule (antigen)
antigen
A molecule, often a cell-surface protein, that is capable of eliciting the formation of antibodies.
A typical antibody (immunoglobulin) molecule.
The molecule is Y shaped and contains 4 polypeptide chains
The longer arms are H chains and the shorter arms are L chains.
The chains are joined by disulfide bonds.
Each chain contains a variable region and a constant region.
The variable and hypervariable regions of a pair of L and H chains form a combining site that interacts with a specific antigen.
Different combinations of chains create different types of Ig classes
eg., IgE (kappa2, epsilon2 or lambda2 epsilon2)
IgE class of antibodies
Involved in fighting parasitic infections
Involved in allergic responses
Tetramer: [kappa2 epsilon2] or [lambda2 epsilon2]
Immune response
Somatic Recombination occurs in maturing B cells
Each mature B cell makes ONE type of light chain (kappa or lambda) and ONE type of heavy chain
An antigen stimulates a particular B cell with an antibody for that antigen
Produces population of plasma cells with antibody for that particular antigen
Lymphomas
Lymphomas-different types depending on stage of B-cell that cancer develops
Formation of the DNA segments encoding a human kappa chain and the subsequent transcription, mRNA splicing, and translation leading to the final polypeptide chain.
One set of L-V regions joined
to one of the joining regions
during B cell maturation
Joining event is imprecise, happens over a six base region, also bases are added or removed at recombination region (“break & nibble”)
In germ-line DNA, 70–100 different L-V (leader-variable) segments are present. These are separated from the J regions by a long-noncoding sequence. The J regions are separated from a single C segment by an intron that must be spliced out of the initial mRNA transcript. Following translation, the amino acid sequence derived from the leader RNA is cleaved off as the mature polypeptide chain passes across the cell membrane.
Formation of the DNA segments encoding a human kappa chain and the subsequent transcription, mRNA splicing, and translation leading to the final polypeptide chain.
One set of L-V regions joined
to one of the joining regions
during B cell maturation
Joining event is imprecise, happens over a 6 base region, also bases are added or removed at recombination region (“break & nibble”)
Transcription removes other “J” region, links to constant (C) region
Finally, splicing removes intervening regions
In germ-line DNA, 70–100 different L-V (leader-variable) segments are present. These are separated from the J regions by a long-noncoding sequence. The J regions are separated from a single C segment by an intron that must be spliced out of the initial mRNA transcript. Following translation, the amino acid sequence derived from the leader RNA is cleaved off as the mature polypeptide chain passes across the cell membrane.
Proteomics
Study of gene products
When & where produced
Post translational modification
Cellular localization
Proteome: complete set of proteins expressed during a cell’s lifetime
Interactome
Description of protein-protein interactions within an organism
Aids in understanding pathways & interactions
May provide insight for therapeutic interruption of pathway in disease treatment
Extensions of proteomics
Interactome: protein interactions with each other
Kinome: interaction of kinase proteins, important in cancer research
Ionome: study of ions in organisms, eg., Fe, Mn, Mg, K, Cu, Ca, Ni, S etc.
knockout mice
Mice created by a process in which a normal gene is cloned, inactivated by the insertion of a marker (such as an antibiotic resistance gene), and transferred to embryonic stem cells, where the altered gene will replace the normal gene (in some cells). These cells are injected into a blastomere embryo, producing a mouse that is then bred to yield mice homozygous for the mutated gene.
gene knockout
The introduction of a null mutation into a gene that is subsequently introduced into an organism using transgenic techniques,whereby the organism loses the function of the gene. Often used in mice.
Matching genes, structure, and function
Eg., Nuclear pore complex
Example of genomics combined with proteomics to determine structure and function
Genomic information determined genes involved in pore complex
Proteomics to determine molecular architecture of pore complex
Next: study protein-protein interactions (interactome) to determine what interacts with Nuclear Pore Complex
Sequencing machines cannot analyze sequences that are more than about 800–1,000 bases long.
Therefore, the chromosome must be broken into fragments before any sequencing can take place.
A typical DNA sequencing reaction requires about 1 microgram of DNA, so the amplification of DNA through cloning is a crucial step in shotgun sequencing.
…..
One type of DNA cloning involves plasmids.
A plasmid is a small, circular DNA molecule found in bacteria in addition to the bacterial chromosome.
Each time a bacterium reproduces, it replicates each of its plasmids.
To clone DNA using plasmids, molecular biologists insert DNA fragments into plasmids and then introduce the plasmids into bacteria.
Because bacteria reproduce so rapidly, they can make more than a million copies of a DNA fragment in less than 24 hours.
Why must the fragment sequences overlap?
Overlap enables the computer to match up the fragments and determine how they fit together.
contig
Genome sequences are assembled by lining up overlaps in sequence between fragments. The resulting sequence is said to be contiguous, and the assembled fragments are called a contig.
genomic library
A genomic library is a collection of large DNA fragments from one genome cloned into vectors, such as bacterial artificial chromosomes (BACs).
subcloning
Genomic libraries contain fragments too large for one sequencing reaction. Therefore, before they can be sequenced, they must be cut into smaller fragments and recloned, in a process called subcloning.
annotation
Once an entire genome sequence has been determined, computer algorithms can be used to analyze it for important sequences, such as open reading frames, introns, and regulatory sequences. This process is called genome annotation.
The relationship between recombination maps, physical maps, and genome sequencing
A recombination map is made by determining how often chromosomal locations, or loci, recombine into new combinations during meiosis. Although recombination frequency depends on the number of nucleotides separating two linked loci, different parts of the genome have different rates of recombination. This means that a recombination map is only an estimate of the true (physical) map. Despite this caveat, recombination maps are useful for ordering large clones into a physical map if loci on the clones can be identified and lined up with a known recombination map.
SNPs are changes in nucleotide sequence. Which of the following SNPs would be likely to change the appearance of a protein’s band(s) on a Western blot?
A nonsense mutation (a change from an amino acid codon to a stop codon) after the antibody binding site
A nonsense mutation (a change from an amino acid codon to a stop codon) before the antibody binding site
An altered amino acid codon in a protease site that prevents cleavage of the precursor protein
start of transcription
A gene encoded in genomic DNA is expressed when RNA polymerase transcribes it into a primary RNA transcript. Transcription begins at a site on the DNA called the start of transcription and requires specific DNA sequences upstream of that site. These sequences recruit and stabilize the transcription complex, which includes RNA polymerase III.
TATA box
Many eukaryotic genes include a TATA box (a sequence of nucleotides with the consensus sequence TATAAA) as part of the sequences that recruit the transcription complex.
introns and exons
In the nucleus, introns are spliced out of the primary RNA transcript, and exons are joined together into a mature messenger RNA (mRNA) molecule.
start of translation
The mRNA, once exported from the nucleus, will dock on a ribosome and be translated into an amino acid sequence beginning with the start of translation (a methionine codon, AUG).
Messenger RNA also has untranslated sequences upstream (toward the 5′ end) and downstream (toward the 3′ end) of the coding region.
Because multiple AUG sequences may be present, the correct reading frame is favored by the interaction of sequences in the 5′ untranslated region with the ribosome.
How does comparing an mRNA sequence to a genomic sequence help in annotation?
Aligning an mRNA sequence to a genomic sequence reveals exon/intron boundaries.
Aligning an mRNA sequence to a genomic sequence indicates the start of transcription of a gene.
Comparing an mRNA sequence to its genomic counterpart is very useful for annotating a gene.
The beginning of the mRNA corresponds to the first transcribed nucleotide, or start of transcription.
A TATA box, if present, is typically 24-25 nucleotides upstream of the start of transcription.
Also, mRNA has had introns removed, and comparing it to a genomic sequence reveals where exon/intron boundaries are.
Southern blot analysis has several steps:
cleaving genomic DNA with restriction enzymes,
separating the resulting fragments on an electrophoresis gel,
transferring the separated fragments to a membrane (blotting),
exposing the membrane to a labeled nucleotide probe.
T cells are cells of the immune system that recognize foreign proteins displayed on cell surfaces using the T cell receptor. The human body produces randomized T cell receptors, thereby generating the diversity of receptors needed to recognize unknown pathogens. This randomized process also generates, by chance, T cell receptors that bind to self proteins, including insulin. The immune system avoids releasing self-reactive T cells into the circulation by exposing them to cells that display self proteins in the thymus.
The type III allele promotes insulin gene expression in the thymus, increasing the probability that insulin-reactive T cells generated there will bind with self cells displaying insulin before those T cells are released. Lower insulin gene expression in the thymus (such as that seen with the type I allele) increases the probability that self-reactive T cells will escape the thymus and be carried by the blood to the pancreas, where they can target insulin-producing islet cells. The destruction of these pancreatic islet cells leads to the loss of the ability to produce insulin (type 1 diabetes).
Genomics
the study of the entire genomes of organisms.
The shotgun approach is usually somewhat random.
…..