Human Genome Flashcards

1
Q

Craig Venter

A

In 1995 Craig Venter, then at TIGR (The Institute for Genomic Research) shocked the scientific community by sequencing the bacterium Haemophilus influenza by a novel method called Whole Genome Shotgun sequencing. He did 10-12 fold coverage of the whole genome of about 2 million base pairs and then used computer power to find all the overlaps of his 500 bp reads (about 40,000 of them) and reconstruct or assemble the whole sequence. (10 fold coverage or 10X means that for a 1 million basepair genome [1 Mb], you sequence 10 million bases.) Some finishing work was needed to fill in the gaps, but it worked. He argued that the same method would work for the human genome, but the critics said it would never work, because the human genome has too many repeats that would prevent the assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is clone-mapping used in genome

A

identify large clones, in order along the chromosomes, that would cover the whole sequence. Then the sequence of each clone could be determined and the sequence of the whole genome would be known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are repeats and why do they pose a problem for genome sequencing?

A

There are millions of repeat sequences in the genome. The question is/was how to get around the repeats if you only had 500bp reads. Some of the repeats are 10,000bp long. They are not exact copies, but they are very similar, so they can only be put together if there is other information on their location. In response to this, Craig Venter developed the paired-end sequences method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the paired-end sequences method?

A

500bp sequences were made from clones of three different sizes. 2kb, 10kb and 50kb. If both ends of these clones are sequenced then one end serves as an anchor for the other end. You automatically know that the opposite end is 2, 10 or 50kb away and you also know the orientation of the two pieces (the 3’ ends face each other). By using overlapping10kb and 50kb fragments it was possible to span any repeat that was smaller than 50kb. By using these paired-end sequences and the public sequence data combined with his own to make about 18x coverage, the whole genome could be assembled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is annotation?

A

Annotation is the process of finding all the interesting regions of the genome and labeling them as exons, promoters, pseudogenes, centromeres, telomeres, etc. The major attention has been on the genes and especially the exons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are SNPs

A

Single nucleotide differences (polymorphisms). SNPs occur about every 1000bp. The 10x coverage of the genome has allowed detection of millions of these SNPs. These are useful markers to map inheritance in families and in populations. In fact there is a whole new discipline called archaeogenetics that studies human migrations and population differences by studying SNP. Disease gene hunters are also using SNPs to identify the regions of human chromosomes that are consistently inherited with a given disease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is locating exons often seen as one of the most difficult aspects of genome assembly and gene delineation?

A

1) Assembling genes requires complete sequence data
2) polymorphisms, or normal variations in DNA sequence can cause large deletions of a gene
3) *** Important: Exons are typically much, much shorter than introns surrounding them (e.g. In human P450 CYP3A43, The first intron is over 8000 bp long and the first exon is only 24 amino acids long. Seven exons are less than 40 amino acids long and one is only 18). Any differences in these exons (possibly caused by SNPs) can make their detection even more challenging.

NOTE: Luckily the CYP3A subfamily is pretty well conserved and these short exon fragments can be detected by a systematic search with other CYP3A sequences. This would not be the case if the percent identity between a new sequence and some other known sequence is low. In that case one would need to depend on cDNA sequence to help locate the missing pieces, especially the N-terminal. (cDNA is made from reverse transcription of mRNA. It is called complementary DNA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Junk DNA?

A

Junk DNA is a term for the 98% of DNA that is non-coding. Part of their function is regulatory. Genes exist in context in the DNA. Some genes have controlling sequences that are very far away, maybe a million base pairs. By comparing human and mouse genomes, many thousands of conserved sequences have been found that are not genes. They do not make protein, but they have been kept intact and highly conserved for 75 million years. Almost certainly, there is important biology behind this non-coding conserved DNA. Some of this may be transcribed into the ncRNA (non-coding RNA). A new class for some of this conserved sequence is microRNA or miRNA (epigenetic regulation)

NOTE: Research has shown that ~75% of the human genome is transcribed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is alternative splicing?

A

A phenomenon where multiple protein products may be derived from a single gene. Alternative splicing joins alternative first exons onto a protein or sometimes causes exons in the middle to be skipped. Sequence data support that ~95% of human multi-exon genes make alternative splice variants, with some making 10 or more. Thus, our 20,687 genes may make well over 100,000 different proteins, making us very complex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the ENCODE Project?

A

(ENCyclopedia Of Dna Elements)- worldwide project to assign functions to each base in the human genome. Revealed that 76% of the human genome is transcribed, even though only about 2% of that is protein coding sequence (think mcRNA and miRNA). They found 3.9 million sites where transcription factors bind and that will increase as more factors are included. 8% of the genome falls in a transcription factor binding site and this will increase (maybe double) as more transcription factors are screened. Gene regulatory regions are prevalent around gene transcription start sites, but they may extend out 100,000 bp distant or more. This research disproves the idea of junk DNA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a BLAST search used for?

A

A BLAST search rapidly compares your protein sequence (derived from cDNA) against all the DNA in the human genome (or whatever data you specify). The DNA is translated in all possible reading frames (3 per DNA strand) and the best matches are returned to you. This is not an exact match, but a match to a related sequence, probably in the same protein family. Used primarily for annotation of genes and exons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are protein families?

A

Proteins that are identified by sequence regions that are conserved between members of the same family. Prosite is a database that catalogs these characteristic sequences, which are called motifs. Short motifs that can be written out are called signature sequences (i.e. FXXGXXXCX[G, A] is a signature sequence for many cytochrome P450 proteins). This can tell you what the protein does or at least give a clue to its function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the BLOCKS database do?

A

Proteins families often have more than one motif. The chance of a protein being in a family goes up if it contains more than one motif from that family. The BLOCKS database arranges protein motifs in order as blocks of conserved sequences. These can be searched to identify what family a protein belongs to. Pfam database also uses this mechanism for protein family identification

Intropro links these databases for enhanced searching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Online Mendelian Inheritance in Man (OMIM) database used for?

A

finding out how the gene you are interested in influences human disease. There are thousands of entries on human genes and everything that is known about their genetics and their role in disease.

NOTE: every human gene will not be linked to a disease. Other genes are complex and cause more than one type of disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the FOXP2 gene?

A

FOXP2 is one of the genes that is a candidate for being unique to humans and partially responsible for our ability to communicate. Defects in this forkhead transcription factor gene affect human language ability. The FOXP2 gene in Neanderthals is the same as humans, indicating they could speak and probably had language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you get around the problem of finding exons in sequence DNA?

A

Create cDNA from mRNA by reverse transcription. This cDNA will then only have exons due to the splicing that occurs during mRNA synthesis.

17
Q

What are Expressed Sequence Tags?

A

ESTs are DNA sequences taken from cDNA that do not contain the intron sequences, only the protein coding regions and some untranslated sequence on the ends. These can be very helpful in defining the intron exon boundaries of new genes

18
Q

What does the caspase12 gene do?

A

Lack of caspase12 may influence the development of Alzheimer’s disease in humans. It is suspected that the major differences between chimps and humans will not be due to protein sequence differences, but rather to changes in regulatory parts of genes that alter expression levels, tissue specificity and timing.

19
Q

What is Herceptin?

A

A monoclonal antibody that recognizes HER2 receptors. These receptors are over-expressed in HER2-positive metastatic breast cancer. The antibody shuts the receptor signaling pathway to off. Imatinib or Gleevec is a kinase inhibitor of the Abl oncogene. This gene is permanently activated in chronic myeloid leukemia by a gene fusion event. Imatinib has doubled the five year survival rate to 90%. Gefitinib or Iressa blocks the epidermal growth factor receptor tyrosine kinase, with a particular amino acid sequence found in ~10% of the population. People without this sequence do not respond, so this is a genome-based treatment. The drug is used in non-small cell lung cancer, but it could be useful for other cancers as well. Erlotinib is another EGFR inhibitor. About 60% of EGFR positive patients respond to this drug.