Bioinformatics Flashcards
What is bioinformatics?
The essential computational technology of storing and analysing biological data
Why do we sequence nucleic acids?
To deduce amino acid sequences and to an extent the structure/function of proteins, sequence comparisons into evolutionary relationships and information about mutation causing inherited disease
They are duplicated, modified and expressed in order to do this
What is the overall strategy for sequencing a nonidentical polymer?
- Cleave the polymer into fragments that are small enough to be fully sequenced (restriction endonucleases - sticky ends)
- Determine the sequence of residues in each fragment
- Determine the order of the fragments in the original polymer by aligning fragments that contain overlapping sequences
What is the traditional DNA sequencing method? (the analysis of the DNA)
Chain-Terminator method or Sanger method
Uses an E. coli enzyme to make complementary copies of the single-stranded DNA being sequenced
DNA polymerase I, uses a single DNA strand as a template and takes dNTPs and assembles a complementary polynucleotide chain in the 5’ to 3’ direction
A small amount of ddNTP (lacks 3-OH’), when this analog is incorporated chain growth is terminated
How can we visualise the sequenced DNA in the Sanger method?
The chain terminators (ddNTPs) are labelled by coloured fluorescent dyes
So the generated set a DNA fragments differing by one nucleotide, is separated in gel electrophoresis and passed through a detector to visualise the fluorescence
What is done at the end of Sanger sequencing?
The reads that have been sequenced must be correctly assembled into the original strand of DNA
The reads are compared with overlapping sequences
OR
sonication - fragements generated in a solution of stiffdouble-stranded DNA to high frequency sound waves
What are some computational problems for genome sequencing?
Base calling - finding the peaks
Assigning a quality score to the base
Assembling longer sequences: bacteria genomes, chromosomes (designing efficient algorithms)
They need to account for sequencing error (false overlaps: due to more fragments and larger genomes)
It needs a large CPU
How are protein separated into its subunits for sequencing?
A fluorescent compound dansyl chloride reacts with primary amines to form dansylated polypeptides - in order to find the N-terminus, to reveal the number of types of subunits
High heat and aqueous acid liberates the N-terminus residue, which can be chromatographically separated from the other free amino acids
Mercaptoethanol is then used to break the di-sulphide bridges (iodoacetate added to prevent reformation)
Once proteins are separated into subunits, what happens to the polypeptide chains?
They are cleaved/fragmented:
Endopeptidases - enzymes that catalyze the hydrolysis of internal peptide bonds
Exopeptidases - atalyze the hydrolysis of N- or C-terminal residues
e.g. trypsin, cyanogen bromide
How are proteins sequenced to find their amino acid sequences?
Edman degradation
This removeds a peptide’s N-terminal amino acid residue - leaving the rest of the polypeptide chain in tact
The PTH-amino acid is later identified in chromatography
Then comparing amino acid sequences of the overlapping peptide fragments
Finally disulphide bond locations are identified
What is gene finding?
The task of finding protein sequence in genomic DNA
Most proteins are now found by sequencing DNA but proteins are much harder to sequence
Gene finding is harder in eukaryotes than prokaryotes due to introns
Larger amounts of DNA leads to uncertainty in predicted gene numbers
What are some next-generation sequencing technologies?
Pyrosequencing - 1 DNA molecule is imobilsed per 1 microscopic plastic bead
It is amplified and grows the strand (primer, polymerase & dNTP added)
Luciferase generates a flash of light, and a detector records which light is produced depending on the dNTP
Illumina sequencing - DNA segments attached to a glass plate and amplified
Fluorescent dNTPs are added with primer and polymerase
A laser identifies the fluorescent groups
What stores nucleotide sequences?
Databases e.g. Genbank
The hardest part isn’t the DNA sequencing but the assembly of millions of reads in the correct order
What is metagenomic sequencing?
The DNA sequences of multiple organisms are analyzed as a single dataset
This is used to characterise complex interdependent microbial communities
e.g. microbiome of the human gut
What did we discover from sequencing the human genome?
Half the genome consists of repeating sequences
80% of the genome is transcribed into RNA
Only 21,000 protein-encoding genes (1.2%) AKA open reading frames (ORFs)
Very small fraction of human proteins are unique to us
2 randomly selected genomes differ by 1 nucleotide per 1000 on average (99.9% identical)
What does evolution result from?
Sequence mutations:
Point mutations - single-nucleotide errors (from mispairing of bases in DNA replication)
Recombination - exchange of DNA between chromosomes
Transportation - movement of genes within/between chromosomes
Altered transcribed mRNA -> different protein may have properties that confer an advantage
What are SNPs?
Single nucleotide polymorphisms:
Single base variations in the genome (essentially errors) there is a difference around 1 in 1200 = we all have individual information
They are associated with disease and other disorders
10 million have been cataloged
What has evolution brought us?
The 3 domains of life bacteria, archaea and eukarya (phylogenetic tree)
Diversity of species has arisen from: random variation, selection, divergence and speciation
Common ancestors have more obvious similarities at the molecular level
Similarity in core metabolic enzymes covers the whole tree of life
What is alignment?
Similarity analysis of sequences begins with alignment
We can see the evolutionary relationship between sequences by identifying the aligned bases
You can identify substitution, insertion and deletion mutations of bases
Insertion and deletions are called indels or gap
What terms describe the type of mutations found within alignment?
Neutral drift - Mutations being accepted if they don’t affect the protein function
Negative selection - Mutations being selected against if they cause harm to the protein function
Positive selection - Mutations are accepted if they introduce a useful protein function
How can we determine close/far aligned sequences?
You can work out the score for aligned sequences
If it gets a higher score = close evolutionary relationship
A gap can result due to insertion and deletion of bases:
Aligned identical base = +1
Aligned non-identical base = 0
A gap = -1
This is easier for short sequences but we use dynamic programming (computer algorithms) to find the best alignment
What is the PAM?
Point accepted mutation
The substitution of a single amino acid and is accepted in natural selection
A PAM matrix is a matrix where each column and row represents one of the twenty standard amino acids
PAM matrices are regularly used as substitution matrices to score sequence alignments for proteins
What are PAM scores based on?
The scores are based on the calculation of probabilities
PDE - the probability that a D mutates to an E in a fixed evolutionary time
Calculated for all possible pairs of amino acids
The scores in the PAM matrices are not probabilities themselves
They compare the probability of 2 amino acids aligned from evolutionary mutation to the probability of them aligned by chance
What PAMs are often used?
PAM250 and PAM120
PAM250 - an evolutionary distance of 250 PAMs
What is another series matrix?
BLOSUM
Considered better than PAM due to no extrapolation from similar sequences
What is the value of PAM/BLOSUM matrices?
Evolutionary relationships can be detected at the level of the protein sequence when the coding DNA sequence has changed so much that the relationship is undetectable
This helps preserve amino acid physico-chemical properties, with the gene remaining stable and functional
We can detect more distant evolutionary relationships with proteins sequences
What is a tool to search databases?
BLAST - basic local alignment search tool
It will search report scores and statistics such as E-values (expectation values)
E-values gives a measure of statistical significance
Smaller E-value = higher statistical significance that the sequences are evolutionary relatives
Should do iterative searches (repeat with intermediate sequences) - for distant relatives
What are some issues with BLAST?
Proteins are often made of domains – five different proteins below all contain a homeobox (HBX) domain
BLAST detects local sequence similarity
It might not cover the entire sequence (just a domain)
The proteins above do not have the same function but Blast would find similarity in the HBX domains
What is homology?
Homology means descended from a common ancestor by divergent evolution
Two such sequences are said to be homologs or homologous
You can’t have a % homology
Significant sequence similarity may be taken as evidence of possible homology
What causes divergent evolution of protein families?
Speciation – the emergence of new (reproductively isolated) species - genes evolve independently
Gene duplication – a process that produces redundant copies of genes in the genome
Many become pseudogenes - not fulfilling the original purpose
What are the types of homolog?
Ortholog – homologous genes in different organisms that have the same function
Paralogs – homologous genes in the same organism often with different functions
Why do proteins evolve at different rates?
Effect of amino acid changes on the protein’s function
Protein’s structural ability
Use of domains from other proteins
What is the PDB?
Protein Data Bank
Stores structural bioinformatics on how macromolecular structures are displayed and compared
What tools help facilitate the classification and comparison of protein structures?
CATH - Class, Architecture, Topology and Homologous superfamily
CE - Combinatorial Extension
Pfam - Protein families
SCOP - Structural Classification Of Proteins
VAST - Vector Alignment Search Tool
What is genomics?
The study of organisms genomes
What is the C-vaue paradox?
The amount of genetic material roughly parallels the organisms complexity of its morphology and metbolism
But lungfishes and some algae have incredibly large genomes
Most of the ‘extra’ DNA is unexpressed
Here lies the paradox
What is an open reading frame?
A protein-coding gene
Not interrupted by STOP codons, but ends with a STOP codon
Exons are relatively short in comparison to introns
Introns - genes with no known function AKA orphan genes
Why is eukaryotic gene finding difficult?
Difficult to find exons amongst the introns This is why there is a substantial uncertainty in the number of human protein coding genes De novo (ORFs) gene finding is used in eukaryotes also but it is less accurate than in prokaryotes Junctions between exons and introns are not identified reliably
What is non-coding DNA made up of?
Repeated sequences
Unique sequences
What are the repeated sequences of DNA made up of?
Transpons:
LINEs - long interspersed nuclear elements, molecular parasites have accumulated mutations
SINEs - short interspersed nuclear elements
Retrotransposons with long terminal repeats (LTRs)
As these sequences are unexpressed they accumlate many polymorphisms much faster
What is systems biology?
Collecting and integrating enormous amounts of data in searchable databases so the properties and dynamics of entire biological networks can be analyzed
What are DNA microarrays or DNA chips?
They help create an accurate picture of gene expression, with the goal of transcriptomics (studing a cell’s transcriptome)
What is everything we can study within a cell?
Genome
Transcriptome
Proteome
Metabolome