Bioinformatics Flashcards
What is bioinformatics?
The essential computational technology of storing and analysing biological data
Why do we sequence nucleic acids?
To deduce amino acid sequences and to an extent the structure/function of proteins, sequence comparisons into evolutionary relationships and information about mutation causing inherited disease
They are duplicated, modified and expressed in order to do this
What is the overall strategy for sequencing a nonidentical polymer?
- Cleave the polymer into fragments that are small enough to be fully sequenced (restriction endonucleases - sticky ends)
- Determine the sequence of residues in each fragment
- Determine the order of the fragments in the original polymer by aligning fragments that contain overlapping sequences
What is the traditional DNA sequencing method? (the analysis of the DNA)
Chain-Terminator method or Sanger method
Uses an E. coli enzyme to make complementary copies of the single-stranded DNA being sequenced
DNA polymerase I, uses a single DNA strand as a template and takes dNTPs and assembles a complementary polynucleotide chain in the 5’ to 3’ direction
A small amount of ddNTP (lacks 3-OH’), when this analog is incorporated chain growth is terminated
How can we visualise the sequenced DNA in the Sanger method?
The chain terminators (ddNTPs) are labelled by coloured fluorescent dyes
So the generated set a DNA fragments differing by one nucleotide, is separated in gel electrophoresis and passed through a detector to visualise the fluorescence
What is done at the end of Sanger sequencing?
The reads that have been sequenced must be correctly assembled into the original strand of DNA
The reads are compared with overlapping sequences
OR
sonication - fragements generated in a solution of stiffdouble-stranded DNA to high frequency sound waves
What are some computational problems for genome sequencing?
Base calling - finding the peaks
Assigning a quality score to the base
Assembling longer sequences: bacteria genomes, chromosomes (designing efficient algorithms)
They need to account for sequencing error (false overlaps: due to more fragments and larger genomes)
It needs a large CPU
How are protein separated into its subunits for sequencing?
A fluorescent compound dansyl chloride reacts with primary amines to form dansylated polypeptides - in order to find the N-terminus, to reveal the number of types of subunits
High heat and aqueous acid liberates the N-terminus residue, which can be chromatographically separated from the other free amino acids
Mercaptoethanol is then used to break the di-sulphide bridges (iodoacetate added to prevent reformation)
Once proteins are separated into subunits, what happens to the polypeptide chains?
They are cleaved/fragmented:
Endopeptidases - enzymes that catalyze the hydrolysis of internal peptide bonds
Exopeptidases - atalyze the hydrolysis of N- or C-terminal residues
e.g. trypsin, cyanogen bromide
How are proteins sequenced to find their amino acid sequences?
Edman degradation
This removeds a peptide’s N-terminal amino acid residue - leaving the rest of the polypeptide chain in tact
The PTH-amino acid is later identified in chromatography
Then comparing amino acid sequences of the overlapping peptide fragments
Finally disulphide bond locations are identified
What is gene finding?
The task of finding protein sequence in genomic DNA
Most proteins are now found by sequencing DNA but proteins are much harder to sequence
Gene finding is harder in eukaryotes than prokaryotes due to introns
Larger amounts of DNA leads to uncertainty in predicted gene numbers
What are some next-generation sequencing technologies?
Pyrosequencing - 1 DNA molecule is imobilsed per 1 microscopic plastic bead
It is amplified and grows the strand (primer, polymerase & dNTP added)
Luciferase generates a flash of light, and a detector records which light is produced depending on the dNTP
Illumina sequencing - DNA segments attached to a glass plate and amplified
Fluorescent dNTPs are added with primer and polymerase
A laser identifies the fluorescent groups
What stores nucleotide sequences?
Databases e.g. Genbank
The hardest part isn’t the DNA sequencing but the assembly of millions of reads in the correct order
What is metagenomic sequencing?
The DNA sequences of multiple organisms are analyzed as a single dataset
This is used to characterise complex interdependent microbial communities
e.g. microbiome of the human gut
What did we discover from sequencing the human genome?
Half the genome consists of repeating sequences
80% of the genome is transcribed into RNA
Only 21,000 protein-encoding genes (1.2%) AKA open reading frames (ORFs)
Very small fraction of human proteins are unique to us
2 randomly selected genomes differ by 1 nucleotide per 1000 on average (99.9% identical)
What does evolution result from?
Sequence mutations:
Point mutations - single-nucleotide errors (from mispairing of bases in DNA replication)
Recombination - exchange of DNA between chromosomes
Transportation - movement of genes within/between chromosomes
Altered transcribed mRNA -> different protein may have properties that confer an advantage
What are SNPs?
Single nucleotide polymorphisms:
Single base variations in the genome (essentially errors) there is a difference around 1 in 1200 = we all have individual information
They are associated with disease and other disorders
10 million have been cataloged