Genomics - Irina Flashcards
What are ESTs and how are they obtained?
Expressed sequence tags (ESTs)
→ clones with reverse transcibed mRNA representing a specific physiological condition or tissue
→ produced high-throughput, single pass pipeline, leading to a large collection of sequence reads from 400-700bp
→ first thing to do when commencing a genomic project de novo
Which criteria should be fulfilled for the recognition of genes in a genome? Describe the principle of each of them.
- identifying ORFs: a series of AS triplets bounded by a start and a stop codon = open reading frame (ORF)
- codon bias: for most AS two or more codons are available in genetic code, some codons occur more frequently than others, many uncommon codons → gene may not be actively described
- homology search: gene identification using their similarity to known genes from other species, works only with conserved genes
- association with promotor elements: characteristic sequences upstream of ORF that match with known transcription factor-binding sites, shed light on physiological context in which the gene is expressed = phylogenetic footprinting
- match with transcript or protein sequences: large collection of cDNA sequences, ESTs, serial analysis of gene expression (SAGE) tags
Define the term “gene”!
gene = complete chromosomal segment responsible for making a functional product
includes structural and regulatory elements (promotor, terminator, transcription-binding sites, etc.)
What is an open reading frame?
a series of AS triplets bounded by a start and a stop codon
What is codon bias?
- For most AS two or more codons are available in genetic code. Some codons occur much more frequently than others.
- tRNA for different codons are differently abundant in the genome
- Synonymous mutations are not under neutral selection – selective pressure for the use of preferred codons
- If many uncommon codons are seen in an ORF if may indicate that the gene is not actively transcribed
What are synonymous and nonsynonymous mutations?
Synonymous mutations are not under neutral selection – selective pressure for the use of preferred codons.
Non-synonymous mutations are under **neutral selection. **
What is an orphan gene?
Every new genome comes with 20-60% previously unknown genes = orphan genes
What are transcription factors and what is their general function in the genome?
A transcription factor (sequence-specific DNA-binding factor) is a protein that binds to specific DNA sequences, thereby controlling the transcription of genetic information from **DNA to mRNA. **Transcription factors perform this function alone or with other proteins in a complex, by promoting (as an activator), or blocking (as a repressor) the **recruitment of RNA polymerase **to specific genes.
Name some databases which are in use for the naming and classification of genes.
KEGG: Kyoto Encyclopedia of Genes and Genomes
KOGs: Eukaryotic Orthologous Groups
the Gene Ontology
JGI: Joint Genome Institute
How does genome size correlate with the complexity of the organisms? What is C value paradox?
It does not → remarkable lack of correspondence between the genome size and the organisms complexity = C value paradox
Increase of genome size from viruses to prokaryotes
C value = picograms of the haploid genome per cell
Non-genic fraction of the DNA is responsible for the C value paradox, in eukaryotes 30-99% are non-coding DNA.
What is the C value, how is it calculated?
Biochemically or flow cytometry = picograms of the haploid genome per cell
G (genome size in nt) = 0.987 x 109 C
Explain what are gene number and genome size, what is the difference between them and how they are calculated or obtained?
Gene number = the number of chromosomal segments that are responsible for making a functional product. It has nothing to do with genome size.
G (genome size in nt) = 0.987 x 109 C
C…picograms of haploid genome per cell
Non-genic fraction of the DNA is responsible for the C value paradox.
In Eukaryotes 30-99% of the genome consist of non-coding DNA
– Repetitive sequences
– Mobile elements
– Intrones
– Intergenic spacers
– etc.
Why is genome size in organisms not proportional to their gene number?
In Eukaryotes 30-99% of the genome consist of non-coding DNA
– Repetitive sequences
– Mobile elements
– Intrones
– Intergenic spacers
– etc.
Briefly describe the types of repeated sequences in the human genome.
- simple sequence repeats
- variable number tandem repeat
- highly repeated sequences at centromeric and subtelomeric regions
- segmental duplications
- transposon-derived repeats
- retroviral-like elements
- transposons
Describe the ways how genomes get enlarged?
Global polyploidization
- global genome duplication: highly deleterious (cell division and meiosis)
- destroys the mechanisms of dosage compensation of X chromosomes
- triploid always leads to sterility
- even number of chromosomes may be mechanisms of evolution innovation
- common in plants, but rare in animals
Regional genome duplication
- leads to localized repeat sequences
- unequal crossing-over
Duplicative transpositions
- transposable elements (copy & paste, cut & paste)
What is polyploidy? Name some examples among animals and plants.
Organisms or cells that have more than three “sets” of chromosomes are termed polyploidic.
Global genome duplication: highly deleterious (cell division and meiosis)
- Destroys the mechanism of dosage compensation of X chromosomes
- Triploid always leads to sterility
- Even number of chromosomes may be mechanism of evolution innovation
- Common in plants, but rare in animals: Brassica napus has 19 sets, certain frogs are triploidic, rodents tetraploidic (Viscacharatte).
What is a gene family?
Genes are categorized into families based on shared nucleotide or protein sequences, but also on protein secondary structures.
→ Phylogenomics
- prediction of gene function
- establishment and clarification of evolutionary relationships
- prediction and retracing of lateral gene transfer
If the genes of a gene family encode proteins, the term protein family is often used in an analogous manner to gene family (e.g. Pfam, PROSITE, PIRSF, PASS2, SUPERFAMILY, SCOP & CATH)
Explain the possibilities of gene functionalization in case of gene duplication.
A large numbers of genes is similar to each other due to their common descent from a duplication event = paralogous genes.
- subfunctionalization
- nonfunctionalization
- superfunctionalization
- neofunctionalization
Explain the differences between the paralogous and orthologous genes.
Paralogous genes = large number of genes similar to each other due to a common descent from a duplication event
Orthologous genes = genes in different species sharing a common ancestor
Explain the term phylogenomics and for what is it useful?
Genes are categorized into families based on shared nucleotide or protein sequences, but also on protein secondary structure.
Phylogenetic techniques can be used for:
- Prediction of gene function
- Establishment and clarification of evolutionary relationships
- Prediction and retracing lateral gene transfer.
What is the GC content? Where in the genomes the deviations from random occurrence happen? How the GC content correlate with the complexity of organisms? Draw a CpG island.
- % of GC bp over AT bp in a genetic fragment (gene, locus, non-coding region, chromosomes) or across species.
- A, T, G and C are not distributed randomly in DNA
- Disregarding the DNA itself, deviations from random occurrence:
– In coding regions higher than in flanking regions of the gene
– 5’ flanking regions richer than 3’ (promotor)
– Biased over long stretches of DNA (ca. 300 kb)
- GC-rich dinucleotide stretches of DNA of at least 200 bp = CpG islands
- found to be variable with different organisms (Variation in selection, mutational bias and bias in recombination-associatied DNA repair)
What is synteny? Explain the term on one example.
- Genetics: two loci located on the same chromosome
- Genomics: a series of genes is arranged in the same order on different genomes
- Passarge et al. (1999): Colinearity
How is the gene order of certain gene maintained during the genome recombination?
- Selective pressure acting upon the cluster as an integrated whole.
- Coherent temporal expression such as in Hox genes.
- Single locus-control region controls expression of a group of genes (by movement of severe selective disadvantages, e.g. beta-globin cluster)
- Interdigitization of regulatory elements = regulatory elements might be physically linked to genes close by e.g. in intrones
Explain diverse applications of genome sequencing.
- medicine
- microbes for energy and environment
- bioanthropology
- agriculture, livestock, breeding, bioprocessing
- DNA identification
List some facts about the human genome.
- 3 billion basepairs
- human genome is 99.9% the same in all people
- only about 2% of the human genome contains genes, which are instructions for making proteins
- humans have an estimated 30.000 genes, the functions of more than of them is unknown
- almost half of all human proteins share similarities with those of other organisms, underscoring the unity of life
Explain how you would obtain a sequence of a gene for which you don’t have the genome? Explain several approaches.
- Hierarchical sequencing of a genome
- construction of whole-genome clone library
- chromosome libraries
- cDNA clone libraries
- Shotgun sequencing
- Hierarchical shotgun sequencing for large genomes