Genes and Genomes Flashcards
Definition
a method of DNA sequencing first commercialized by Applied Biosystems, based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication
Sanger sequencing
Definition
the material of which the chromosomes of organisms other than bacteria (i.e. eukaryotes) are composed, consisting of protein, RNA, and DNA
Chromatin
Define
Methyl-cytosine
the normal cytosine nucleotide in DNA that has been modified by the addition of a methyl group to its 5th carbon
Definition
non-autonomous, non-coding transposable elements (TEs) that are about 100 to 700 base pairs in length. They are a class of retrotransposons, DNA elements that amplify themselves throughout eukaryotic genomes, often through RNA intermediates
SINES
What is considered the fifth base in DNA?
Methyl-cytosine
Definition
a unit made up of linked genes which is thought to regulate other genes responsible for protein synthesis
Operon
Mobile genetic elements are not usually found in gene exons/introns. Examples are retrotransposons which move via a DNA/RNA intermediate
Mobile genetic elements are not usually found in gene exons. Examples are retrotransposons which move via a RNA intermediate
Where are CpG islands usually found?
Mainly at the 5’ end of genes
How many bases does the human genome contain?
3162 million bases
Whole genome shotgun (WGS)
entails sequencing many overlapping DNA fragments in parallel and then using a computer to assemble the small fragments into larger contigs and, eventually, chromosomes
Definition
a functional RNA molecule that is transcribed from DNA but not translated into proteins
non-coding RNA/ncRNA
What is the Whole Genome Shotgun Method?
Genomic DNA is shred randomly before being read. Repeated many time to ensure at least 30x read depth coverage. The reads are then reassembled into the genome sequence
Definition
an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences
BLAST search
What makes up junk DNA?
Pseudogenes
Mobile genetic elements (i.e. LINES, SINES, incomlplete retroviral-like elements and Transposon remnants)
Definition
Describing a type of messenger RNA that can encode more than one polypeptide separately within the same RNA molecule
Multicistronic
What is used to sort out the contigs given in de novo assembly?
PacBio
What is a hypothetical protein?
A predicted protein that is not similar to any characterised protein
What BLAST program is used for a protein query search in the protein database?
BLASTp
What are the major characteristics of SINES?
They do not encode reverse transcriptase, endonuclease or integrase
Definition
a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes
FASTA
Define
non-coding RNA/ncRNA
a functional RNA molecule that is transcribed from DNA but not translated into proteins
Define
Draft genome sequence
Sequence of genomic DNA having lower accuracy than finished sequence; some segments are missing or in the wrong order or orientation
Definition
Elements that are transcribed into RNA, reverse-transcribed into DNA and then inserted into a new site in the genome
Retroviral-like elements
Definition
a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores
FASTQ
Define
Genome annotation
the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do
Define
Retroviral-like elements
Elements that are transcribed into RNA, reverse-transcribed into DNA and then inserted into a new site in the genome
Define
Paralogue
Either of a pair of genes that derives from the same ancestral gene
Define
SINES
non-autonomous, non-coding transposable elements (TEs) that are about 100 to 700 base pairs in length. They are a class of retrotransposons, DNA elements that amplify themselves throughout eukaryotic genomes, often through RNA intermediates
The ENCODE project is an editing/annotation approahc that has built a map of functional elements within the human genome, suggesting that over 50%/70% is biologically active
The ENCODE project is an annotation approahc that has built a map of functional elements within the human genome, suggesting that over 70% is biologically active
What is the genome data problem?
The ever increasing analysis gap that is occurring because our ability to analyse is not keeping up with the data available
Definition
a project that seeks to interpret the sequence of DNA that makes up the human genome
ENCODE project
What were the strategies used by HGP and Celera to sequence the human genome?
HGP used an ordered or directed strategy
Celera used a shotgun strategy
Define
Pseudogenes
a section of a chromosome that is an imperfect copy of a functional gene
What were the key findings of the ENCODE project?
Around 80% of the human genome is assocaited with at least one biochemical event
___________ arise by gene duplication followed by gene inactivation - contain introns
____________ are formed by integration of DNA copies of mRNA - do not contain introns
Classical pseudogenes arise by gene duplication followed by gene inactivation - contain introns
Processed pseudogenes are formed by integration of DNA copies of mRNA - do not contain introns
Definition
DNA that does not code for a protein, usually occurs in repetitive sequences of nucleotides, and does not seem to serve any useful purpose
Junk DNA
True or False:
The HGP sequence tells us nothing about the genetic variation between individuals
True
What BLAST program is used for a nucleotide quesry searchin the protein database?
BLASTx
Definition
Either of a pair of genes that derives from the same ancestral gene
Paralogue
Why does the sequence CpG occur at a lower than expected frequency in vertebrates?
During DNA damage, deamination of unmethylated C gives rise to U, which is recognised as a fault by DNA repair machinery. Deamination of methylated C gives rise to T, which is not recognised as an error by DNA repair machinery. Over evolutionary time, methylated Cs have been mutated to T, so CpG is under-represented in vertebrate DNA
Define
CpG island
stretches of DNA 500–1500 bp long with a CG: GC ratio of more than 0.6, and they are normally found at promoters and contain the 5′ end of the transcript
How do SINES move?
Using enzymes produced by other mobile elements e.g. LINES
Definition
a set of overlapping DNA segments that together represent a consensus region of DNA
Contig

Zero
What is an unbroken consensus sequence called?
Contig
True or False:
The sequence data found in the HGP is inaccessible by regular people
False
It is publically available
Definition
entails sequencing many overlapping DNA fragments in parallel and then using a computer to assemble the small fragments into larger contigs and, eventually, chromosomes
Whole genome shotgun (WGS)
What are the two types of Illumina sequencing? Which is faster?
HiSeq (3 days; 2000 GigaBases)
MiSeq (56 hrs; 20 GigaBases)
Define
De novo
starting from the beginning
True or False:
Only transposon remnants are evident in the human genome
True
Definition
a section of a chromosome that is an imperfect copy of a functional gene
Pseudogenes
What are the similarities between a draft and a closed genome sequence?
- Both have all the genes
- Both predict the encoded proteins
- Predict function by similarity to characterised proteins
- Overview of the organism’s genetic capability
In reality, how many contigs do we get per chromosome? Why?
You expect only 1, but in reality you get many, but the whole genome sequence will be there. This is because there will be several copies of the same sequence on the genome
What symbols indicate Bad and Excellent Phred quality scores?
Bad - !’#$%”
Excellent = EFGHIJK
Define
BLAST search
an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences
Define
LINES
a group of non-LTR (long terminal repeat) retrotransposons which are widespread in the genome of many eukaryotes
Definition
a transposon whose sequence shows homology with that of a retrovirus
Retrotransposons
What are the major characteristics of non-retroviral retrotransposons (LINES)?
They have a promotor and encode a protein with combined endonuclease and reverse transcription activity
Definition
a group of non-LTR (long terminal repeat) retrotransposons which are widespread in the genome of many eukaryotes
LINES
What form is each entry in the GenBank database in?
A text file containing DNA sequence data and any associated information (annotation)
Define
Chromatin
the material of which the chromosomes of organisms other than bacteria (i.e. eukaryotes) are composed, consisting of protein, RNA, and DNA
Definition
the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do
Genome annotation
Which nucleotides can be methylated?
Cytosine but only when next to a guanine
What were the aims of the Human Genome Project?
To determine the entire nucleotide sequence of human DNA
To identify all the genes within the human genome
Why is Illumina and PacBio often used together?
Illumina provides good quality reads whereas PacBio provides good read length
On completion of the human genome project it was evident that over 50%/70%/90% of the genome does no encode protein/microRNA/tRNA, consistent with the idea of waste/garbage/junk DNA
On completion of the human genome project it was evident that over 90% of the genome does no encode protein, consistent with the idea of junk DNA
What is the read depth coverage equation?
Depth = N x L / G
N = number of reads
L = length of each read
G = estimated genome size
Retrotransposons move from one point to another in the genome via what?
RNA intermediates
Define
Processed pseudogenes
a type of pseudogene that is are copied from messenger RNA and incorporated into the chromosome
Define
Junk DNA
DNA that does not code for a protein, usually occurs in repetitive sequences of nucleotides, and does not seem to serve any useful purpose
Define
Mobile genetic elements
DNA sequences that can move around the genome, changing their number of copies or simply changing their location, often affecting the activity of nearby genes
Definition
a fluorescent chemical compound that can re-emit light upon light excitation
Flurophores
Define
FASTA
a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes
Define
Reference genome
a digital nucleic acid sequence database, assembled by scientists as a representative example of a species’ set of genes
Define
Retrotransposons
a transposon whose sequence shows homology with that of a retrovirus
Define
Flurophores
a fluorescent chemical compound that can re-emit light upon light excitation
What are the two classes of retrotransposons?
Retroviral-like
Non-retroviral-like
Define
Multicistronic
Describing a type of messenger RNA that can encode more than one polypeptide separately within the same RNA molecule
Definition
the normal cytosine nucleotide in DNA that has been modified by the addition of a methyl group to its 5th carbon
Methyl-cytosine
Define
Contig
a set of overlapping DNA segments that together represent a consensus region of DNA
Definition
one of two or more homologous gene sequences found in different species
Orthologue
How many genes are in the human genome?
Between 20000 and 25000
Define
Orthologue
one of two or more homologous gene sequences found in different species
Definition
Sequence of genomic DNA having lower accuracy than finished sequence; some segments are missing or in the wrong order or orientation
Draft genome sequence
What percentage of the genome encodes proteins?
2%
What does DNA methylation do?
Helps turn genes off by altering chromatin structure
Definition
a digital nucleic acid sequence database, assembled by scientists as a representative example of a species’ set of genes
Reference genome
Define
Amplicon
a piece of DNA or RNA that is the source and/or product of amplification or replication events
Definition
stretches of DNA 500–1500 bp long with a CG: GC ratio of more than 0.6, and they are normally found at promoters and contain the 5′ end of the transcript
CpG island
Definition
a piece of DNA or RNA that is the source and/or product of amplification or replication events
Amplicon
Definition
a chromosomal segment that can undergo transposition, especially a segment of bacterial DNA that can be translocated as a whole between chromosomal, phage, and plasmid DNA in the absence of a complementary sequence in the host DNA
Transposon
What is the name of the modified nucleoties used in Sanger Sequencing?
Dideoxy nucleotides
Definition
starting from the beginning
De novo
Define
ENCODE project
a project that seeks to interpret the sequence of DNA that makes up the human genome
True or False:
Retorviral-like retrotransposons do not encode coat proteins
True
A query protein is 26% identical to a guide protein. What can we say about these two proteins?
They might have similar functions
Define
Transposon
a chromosomal segment that can undergo transposition, especially a segment of bacterial DNA that can be translocated as a whole between chromosomal, phage, and plasmid DNA in the absence of a complementary sequence in the host DNA
Define
Read coverage depth
the number of unique reads that include a given nucleotide in the reconstructed sequence
Definition
a type of pseudogene that is are copied from messenger RNA and incorporated into the chromosome
Processed pseudogenes
Definition
the number of unique reads that include a given nucleotide in the reconstructed sequence
Read coverage depth
Define
Sanger sequencing
a method of DNA sequencing first commercialized by Applied Biosystems, based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication
Definition
DNA sequences that can move around the genome, changing their number of copies or simply changing their location, often affecting the activity of nearby genes
Mobile genetic elements
What are the components of Illumina sequencing?
‘Blocked’ nucleotides
Oligonucleotide primer
ssDNA template
DNA polymerase
What proportion of nucleotides are identical in all people?
99%
What can you say about proteins that are over 35% identical to a guide protein?
They probably have a related function
The human genome compises 3 million/billion/trillion base paires encoding approximately 10,000/**20,000/50,000 genes. The number, position and order of introns/exons/genes is identical between individuals/proteins/tRNA
The human genome compises 3 billion base paires encoding approximately 20,000 genes. The number, position and order of genes is identical between individuals


Define
Operon
a unit made up of linked genes which is thought to regulate other genes responsible for protein synthesis
Define
FASTQ
a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores
Long interspersed nuclear element 1 (LINE 1) mobile genetic elements…
Select one:
a. are derived from viruses
b. encode enzymes essential for their replication
c. are found only in A / T rich regions of the genome
d. emerged from the genomes of ancient parasitic bacteria
Long interspersed nuclear element 1 (LINE 1) mobile genetic elements…
Select one:
a. are derived from viruses
b. encode enzymes essential for their replication
c. are found only in A / T rich regions of the genome
d. emerged from the genomes of ancient parasitic bacteria
Gene annotation is the process of…
Select one:
a. manually sequencing “difficult” regions of the human genome
b. depositing new nucleotide sequence data in a public database
c. adding information on biological function to a nucleotide sequence file
d. deleting redundant data files
Gene annotation is the process of…
Select one:
a. manually sequencing “difficult” regions of the human genome
b. depositing new nucleotide sequence data in a public database
c. adding information on biological function to a nucleotide sequence file
d. deleting redundant data files
When completed in 2003, the Human Genome Project lacked information on the…
Select one:
a. order of genes in the human genome
b. approximate number of genes in the human genome
c. approximate number of alleles in the human genome
d. percentage of protein-encoding genes in the human genome
When completed in 2003, the Human Genome Project lacked information on the…
Select one:
a. order of genes in the human genome
b. approximate number of genes in the human genome
c. approximate number of alleles in the human genome
d. percentage of protein-encoding genes in the human genome
What is the current thinking about junk DNA?
Select one:
a. It serves no useful purpose
b. It consists of non-functional ancestral genes
c. It makes up less than 10% of the human genome
d. It is largely made up of mobile genetic elements
What is the current thinking about junk DNA?
Select one:
a. It serves no useful purpose
b. It consists of non-functional ancestral genes
c. It makes up less than 10% of the human genome
d. It is largely made up of mobile genetic elements
The figure below represents a visual overview of a part of the DNA sequence of a bacterial genome (approximate base range 2400 to 8000). The overview is produced using the Artemis software and shows reading frame (RF) one through to six. The short black vertical lines indicate stop codons.
The sequence for each stop codon in the sequence displayed can be…
Select one:
a. GGG only
b. any of ATG CTG GTG
c. any of TAG, TAA, TGA
d. any of UAG, UAA, UGA

The sequence for each stop codon in the sequence displayed can be…
Select one:
a. GGG only
b. any of ATG CTG GTG
c. any of TAG, TAA, TGA
d. any of UAG, UAA, UGA
BLAST search…
Select one:
a. predicts protein function from the predicted 3D structure of the query protein sequence
b. is widely used to map millions of short DNA sequences onto a reference genome
c. finds sequence similar to the query sequence in the subject database
d. is a basic global alignment search tool
BLAST search…
Select one:
a. predicts protein function from the predicted 3D structure of the query protein sequence
b. is widely used to map millions of short DNA sequences onto a reference genome
c. finds sequence similar to the query sequence in the subject database
d. is a basic global alignment search tool
The Whole Genome Shotgun (WGS) method for genome sequencing…
Select one:
a. uses long read sequencing technology to produce a single read that spans the whole bacterial chromosome
b. is likely to work best when the total number of sequenced bases is the same as the predicted number of based in the bacterial chromosome
c. is rarely used for bacterial genome sequencing
d. is an approach based on the sequencing of randomly selected fragments of the genomic DNA, that collectively cover the whole genome
The Whole Genome Shotgun (WGS) method for genome sequencing…
Select one:
a. uses long read sequencing technology to produce a single read that spans the whole bacterial chromosome
b. is likely to work best when the total number of sequenced bases is the same as the predicted number of based in the bacterial chromosome
c. is rarely used for bacterial genome sequencing
d. is an approach based on the sequencing of randomly selected fragments of the genomic DNA, that collectively cover the whole genome
For the final three questions, consider the following information:
The genome sequence of the Reference strain was determined using a combination of long-read and short-read sequencing technologies (Assembled genome sequence: one circular chromosome and no plasmids). The genome of the Mutant strain was sequenced using a short-read sequencing (Illumina, paired-end, 150 base reads). Table 1 shows all sequence differences between the Reference and Mutant strains.
Table 1. Sequence differences between strains
The phenotypic difference is that the Reference strain has a flagellum and the Mutant strain does not.
–
The initiation codon for pwpS is located:
Select one:
a. within 100 bases of position 4,684,444
b. between 100 and 300 bases from position 4,684,444
c. between 301 and 999 bases from position 4,684,444
d. more than 1,000 bases from position 4,684,444

The initiation codon for pwpS is located:
Select one:
a. within 100 bases of position 4,684,444
b. between 100 and 300 bases from position 4,684,444
c. between 301 and 999 bases from position 4,684,444
d. more than 1,000 bases from position 4,684,444
The genome sequence of the Reference strain was determined using a combination of long-read and short-read sequencing technologies (Assembled genome sequence: one circular chromosome and no plasmids). The genome of the Mutant strain was sequenced using a short-read sequencing (Illumina, paired-end, 150 base reads). Table 1 shows all sequence differences between the Reference and Mutant strains.
Table 1. Sequence differences between strains.
The phenotypic difference is that the Reference strain has a flagellum and the Mutant strain does not.
–
The phenotypic difference is likely to be caused by:
Select one:
a. any one of the differences observed in protein coding regions
b. all three differences observed in protein coding regions
c. the intergenic difference
d. the intergenic difference, the difference in the pwpS gene, or both

The phenotypic difference is likely to be caused by:
Select one:
a. any one of the differences observed in protein coding regions
b. all three differences observed in protein coding regions
c. the intergenic difference
d. the intergenic difference, the difference in the pwpS gene, or both