Human Genome Flashcards
Craig Venter
In 1995 Craig Venter, then at TIGR (The Institute for Genomic Research) shocked the scientific community by sequencing the bacterium Haemophilus influenza by a novel method called Whole Genome Shotgun sequencing. He did 10-12 fold coverage of the whole genome of about 2 million base pairs and then used computer power to find all the overlaps of his 500 bp reads (about 40,000 of them) and reconstruct or assemble the whole sequence. (10 fold coverage or 10X means that for a 1 million basepair genome [1 Mb], you sequence 10 million bases.) Some finishing work was needed to fill in the gaps, but it worked. He argued that the same method would work for the human genome, but the critics said it would never work, because the human genome has too many repeats that would prevent the assembly
How is clone-mapping used in genome
identify large clones, in order along the chromosomes, that would cover the whole sequence. Then the sequence of each clone could be determined and the sequence of the whole genome would be known.
What are repeats and why do they pose a problem for genome sequencing?
There are millions of repeat sequences in the genome. The question is/was how to get around the repeats if you only had 500bp reads. Some of the repeats are 10,000bp long. They are not exact copies, but they are very similar, so they can only be put together if there is other information on their location. In response to this, Craig Venter developed the paired-end sequences method.
What is the paired-end sequences method?
500bp sequences were made from clones of three different sizes. 2kb, 10kb and 50kb. If both ends of these clones are sequenced then one end serves as an anchor for the other end. You automatically know that the opposite end is 2, 10 or 50kb away and you also know the orientation of the two pieces (the 3’ ends face each other). By using overlapping10kb and 50kb fragments it was possible to span any repeat that was smaller than 50kb. By using these paired-end sequences and the public sequence data combined with his own to make about 18x coverage, the whole genome could be assembled.
What is annotation?
Annotation is the process of finding all the interesting regions of the genome and labeling them as exons, promoters, pseudogenes, centromeres, telomeres, etc. The major attention has been on the genes and especially the exons
What are SNPs
Single nucleotide differences (polymorphisms). SNPs occur about every 1000bp. The 10x coverage of the genome has allowed detection of millions of these SNPs. These are useful markers to map inheritance in families and in populations. In fact there is a whole new discipline called archaeogenetics that studies human migrations and population differences by studying SNP. Disease gene hunters are also using SNPs to identify the regions of human chromosomes that are consistently inherited with a given disease.
Why is locating exons often seen as one of the most difficult aspects of genome assembly and gene delineation?
1) Assembling genes requires complete sequence data
2) polymorphisms, or normal variations in DNA sequence can cause large deletions of a gene
3) *** Important: Exons are typically much, much shorter than introns surrounding them (e.g. In human P450 CYP3A43, The first intron is over 8000 bp long and the first exon is only 24 amino acids long. Seven exons are less than 40 amino acids long and one is only 18). Any differences in these exons (possibly caused by SNPs) can make their detection even more challenging.
NOTE: Luckily the CYP3A subfamily is pretty well conserved and these short exon fragments can be detected by a systematic search with other CYP3A sequences. This would not be the case if the percent identity between a new sequence and some other known sequence is low. In that case one would need to depend on cDNA sequence to help locate the missing pieces, especially the N-terminal. (cDNA is made from reverse transcription of mRNA. It is called complementary DNA)
What is Junk DNA?
Junk DNA is a term for the 98% of DNA that is non-coding. Part of their function is regulatory. Genes exist in context in the DNA. Some genes have controlling sequences that are very far away, maybe a million base pairs. By comparing human and mouse genomes, many thousands of conserved sequences have been found that are not genes. They do not make protein, but they have been kept intact and highly conserved for 75 million years. Almost certainly, there is important biology behind this non-coding conserved DNA. Some of this may be transcribed into the ncRNA (non-coding RNA). A new class for some of this conserved sequence is microRNA or miRNA (epigenetic regulation)
NOTE: Research has shown that ~75% of the human genome is transcribed
What is alternative splicing?
A phenomenon where multiple protein products may be derived from a single gene. Alternative splicing joins alternative first exons onto a protein or sometimes causes exons in the middle to be skipped. Sequence data support that ~95% of human multi-exon genes make alternative splice variants, with some making 10 or more. Thus, our 20,687 genes may make well over 100,000 different proteins, making us very complex.
What is the ENCODE Project?
(ENCyclopedia Of Dna Elements)- worldwide project to assign functions to each base in the human genome. Revealed that 76% of the human genome is transcribed, even though only about 2% of that is protein coding sequence (think mcRNA and miRNA). They found 3.9 million sites where transcription factors bind and that will increase as more factors are included. 8% of the genome falls in a transcription factor binding site and this will increase (maybe double) as more transcription factors are screened. Gene regulatory regions are prevalent around gene transcription start sites, but they may extend out 100,000 bp distant or more. This research disproves the idea of junk DNA.
What is a BLAST search used for?
A BLAST search rapidly compares your protein sequence (derived from cDNA) against all the DNA in the human genome (or whatever data you specify). The DNA is translated in all possible reading frames (3 per DNA strand) and the best matches are returned to you. This is not an exact match, but a match to a related sequence, probably in the same protein family. Used primarily for annotation of genes and exons
What are protein families?
Proteins that are identified by sequence regions that are conserved between members of the same family. Prosite is a database that catalogs these characteristic sequences, which are called motifs. Short motifs that can be written out are called signature sequences (i.e. FXXGXXXCX[G, A] is a signature sequence for many cytochrome P450 proteins). This can tell you what the protein does or at least give a clue to its function.
What does the BLOCKS database do?
Proteins families often have more than one motif. The chance of a protein being in a family goes up if it contains more than one motif from that family. The BLOCKS database arranges protein motifs in order as blocks of conserved sequences. These can be searched to identify what family a protein belongs to. Pfam database also uses this mechanism for protein family identification
Intropro links these databases for enhanced searching
What is the Online Mendelian Inheritance in Man (OMIM) database used for?
finding out how the gene you are interested in influences human disease. There are thousands of entries on human genes and everything that is known about their genetics and their role in disease.
NOTE: every human gene will not be linked to a disease. Other genes are complex and cause more than one type of disease
What is the FOXP2 gene?
FOXP2 is one of the genes that is a candidate for being unique to humans and partially responsible for our ability to communicate. Defects in this forkhead transcription factor gene affect human language ability. The FOXP2 gene in Neanderthals is the same as humans, indicating they could speak and probably had language.