Bioinformatics Flashcards by Kirsty Welshman

What is bioinformatics?

The essential computational technology of storing and analysing biological data

How well did you know this?

Not at all

Perfectly

Why do we sequence nucleic acids?

To deduce amino acid sequences and to an extent the structure/function of proteins, sequence comparisons into evolutionary relationships and information about mutation causing inherited disease

They are duplicated, modified and expressed in order to do this

How well did you know this?

Not at all

Perfectly

What is the overall strategy for sequencing a nonidentical polymer?

Cleave the polymer into fragments that are small enough to be fully sequenced (restriction endonucleases - sticky ends)
Determine the sequence of residues in each fragment
Determine the order of the fragments in the original polymer by aligning fragments that contain overlapping sequences

How well did you know this?

Not at all

Perfectly

What is the traditional DNA sequencing method? (the analysis of the DNA)

Chain-Terminator method or Sanger method
Uses an E. coli enzyme to make complementary copies of the single-stranded DNA being sequenced

DNA polymerase I, uses a single DNA strand as a template and takes dNTPs and assembles a complementary polynucleotide chain in the 5’ to 3’ direction
A small amount of ddNTP (lacks 3-OH’), when this analog is incorporated chain growth is terminated

How well did you know this?

Not at all

Perfectly

How can we visualise the sequenced DNA in the Sanger method?

The chain terminators (ddNTPs) are labelled by coloured fluorescent dyes

So the generated set a DNA fragments differing by one nucleotide, is separated in gel electrophoresis and passed through a detector to visualise the fluorescence

How well did you know this?

Not at all

Perfectly

What is done at the end of Sanger sequencing?

The reads that have been sequenced must be correctly assembled into the original strand of DNA
The reads are compared with overlapping sequences
OR
sonication - fragements generated in a solution of stiffdouble-stranded DNA to high frequency sound waves

How well did you know this?

Not at all

Perfectly

What are some computational problems for genome sequencing?

Base calling - finding the peaks
Assigning a quality score to the base
Assembling longer sequences: bacteria genomes, chromosomes (designing efficient algorithms)
They need to account for sequencing error (false overlaps: due to more fragments and larger genomes)
It needs a large CPU

How well did you know this?

Not at all

Perfectly

How are protein separated into its subunits for sequencing?

A fluorescent compound dansyl chloride reacts with primary amines to form dansylated polypeptides - in order to find the N-terminus, to reveal the number of types of subunits

High heat and aqueous acid liberates the N-terminus residue, which can be chromatographically separated from the other free amino acids

Mercaptoethanol is then used to break the di-sulphide bridges (iodoacetate added to prevent reformation)

How well did you know this?

Not at all

Perfectly

Once proteins are separated into subunits, what happens to the polypeptide chains?

They are cleaved/fragmented:
Endopeptidases - enzymes that catalyze the hydrolysis of internal peptide bonds
Exopeptidases - atalyze the hydrolysis of N- or C-terminal residues

e.g. trypsin, cyanogen bromide

How well did you know this?

Not at all

Perfectly

How are proteins sequenced to find their amino acid sequences?

Edman degradation
This removeds a peptide’s N-terminal amino acid residue - leaving the rest of the polypeptide chain in tact
The PTH-amino acid is later identified in chromatography

Then comparing amino acid sequences of the overlapping peptide fragments
Finally disulphide bond locations are identified

How well did you know this?

Not at all

Perfectly

What is gene finding?

The task of finding protein sequence in genomic DNA

Most proteins are now found by sequencing DNA but proteins are much harder to sequence
Gene finding is harder in eukaryotes than prokaryotes due to introns
Larger amounts of DNA leads to uncertainty in predicted gene numbers

How well did you know this?

Not at all

Perfectly

What are some next-generation sequencing technologies?

Pyrosequencing - 1 DNA molecule is imobilsed per 1 microscopic plastic bead
It is amplified and grows the strand (primer, polymerase & dNTP added)
Luciferase generates a flash of light, and a detector records which light is produced depending on the dNTP

Illumina sequencing - DNA segments attached to a glass plate and amplified
Fluorescent dNTPs are added with primer and polymerase
A laser identifies the fluorescent groups

How well did you know this?

Not at all

Perfectly

What stores nucleotide sequences?

Databases e.g. Genbank

The hardest part isn’t the DNA sequencing but the assembly of millions of reads in the correct order

How well did you know this?

Not at all

Perfectly

What is metagenomic sequencing?

The DNA sequences of multiple organisms are analyzed as a single dataset
This is used to characterise complex interdependent microbial communities
e.g. microbiome of the human gut

How well did you know this?

Not at all

Perfectly

What did we discover from sequencing the human genome?

Half the genome consists of repeating sequences
80% of the genome is transcribed into RNA
Only 21,000 protein-encoding genes (1.2%) AKA open reading frames (ORFs)
Very small fraction of human proteins are unique to us
2 randomly selected genomes differ by 1 nucleotide per 1000 on average (99.9% identical)

How well did you know this?

Not at all

Perfectly

What does evolution result from?

Sequence mutations:
Point mutations - single-nucleotide errors (from mispairing of bases in DNA replication)
Recombination - exchange of DNA between chromosomes
Transportation - movement of genes within/between chromosomes

Altered transcribed mRNA -> different protein may have properties that confer an advantage

How well did you know this?

Not at all

Perfectly

What are SNPs?

Single nucleotide polymorphisms:
Single base variations in the genome (essentially errors) there is a difference around 1 in 1200 = we all have individual information

They are associated with disease and other disorders
10 million have been cataloged

How well did you know this?

Not at all

Perfectly

What has evolution brought us?

Study These Flashcards

The 3 domains of life bacteria, archaea and eukarya (phylogenetic tree)

Diversity of species has arisen from: random variation, selection, divergence and speciation
Common ancestors have more obvious similarities at the molecular level
Similarity in core metabolic enzymes covers the whole tree of life

What is alignment?

Study These Flashcards

Similarity analysis of sequences begins with alignment
We can see the evolutionary relationship between sequences by identifying the aligned bases

You can identify substitution, insertion and deletion mutations of bases
Insertion and deletions are called indels or gap

What terms describe the type of mutations found within alignment?

Study These Flashcards

Neutral drift - Mutations being accepted if they don’t affect the protein function

Negative selection - Mutations being selected against if they cause harm to the protein function

Positive selection - Mutations are accepted if they introduce a useful protein function

How can we determine close/far aligned sequences?

Study These Flashcards

You can work out the score for aligned sequences
If it gets a higher score = close evolutionary relationship
A gap can result due to insertion and deletion of bases:
Aligned identical base = +1
Aligned non-identical base = 0
A gap = -1

This is easier for short sequences but we use dynamic programming (computer algorithms) to find the best alignment

What is the PAM?

Study These Flashcards

Point accepted mutation
The substitution of a single amino acid and is accepted in natural selection

A PAM matrix is a matrix where each column and row represents one of the twenty standard amino acids
PAM matrices are regularly used as substitution matrices to score sequence alignments for proteins

What are PAM scores based on?

Study These Flashcards

The scores are based on the calculation of probabilities
PDE - the probability that a D mutates to an E in a fixed evolutionary time
Calculated for all possible pairs of amino acids

The scores in the PAM matrices are not probabilities themselves
They compare the probability of 2 amino acids aligned from evolutionary mutation to the probability of them aligned by chance

What PAMs are often used?

Study These Flashcards

PAM250 and PAM120

PAM250 - an evolutionary distance of 250 PAMs

What is another series matrix?

BLOSUM | Considered better than PAM due to no extrapolation from similar sequences

What is the value of PAM/BLOSUM matrices?

Evolutionary relationships can be detected at the level of the protein sequence when the coding DNA sequence has changed so much that the relationship is undetectable This helps preserve amino acid physico-chemical properties, with the gene remaining stable and functional We can detect more distant evolutionary relationships with proteins sequences

What is a tool to search databases?

BLAST - basic local alignment search tool It will search report scores and statistics such as E-values (expectation values) E-values gives a measure of statistical significance Smaller E-value = higher statistical significance that the sequences are evolutionary relatives Should do iterative searches (repeat with intermediate sequences) - for distant relatives

What are some issues with BLAST?

Proteins are often made of domains – five different proteins below all contain a homeobox (HBX) domain BLAST detects local sequence similarity It might not cover the entire sequence (just a domain) The proteins above do not have the same function but Blast would find similarity in the HBX domains

What is homology?

Homology means descended from a common ancestor by divergent evolution Two such sequences are said to be homologs or homologous You can’t have a % homology Significant sequence similarity may be taken as evidence of possible homology

What causes divergent evolution of protein families?

Speciation – the emergence of new (reproductively isolated) species - genes evolve independently Gene duplication – a process that produces redundant copies of genes in the genome Many become pseudogenes - not fulfilling the original purpose

What are the types of homolog?

Ortholog – homologous genes in different organisms that have the same function Paralogs – homologous genes in the same organism often with different functions

Why do proteins evolve at different rates?

Effect of amino acid changes on the protein's function Protein's structural ability Use of domains from other proteins

What is the PDB?

Protein Data Bank | Stores structural bioinformatics on how macromolecular structures are displayed and compared

What tools help facilitate the classification and comparison of protein structures?

CATH - Class, Architecture, Topology and Homologous superfamily CE - Combinatorial Extension Pfam - Protein families SCOP - Structural Classification Of Proteins VAST - Vector Alignment Search Tool

What is genomics?

The study of organisms genomes

What is the C-vaue paradox?

The amount of genetic material roughly parallels the organisms complexity of its morphology and metbolism But lungfishes and some algae have incredibly large genomes Most of the 'extra' DNA is unexpressed Here lies the paradox

What is an open reading frame?

A protein-coding gene Not interrupted by STOP codons, but ends with a STOP codon Exons are relatively short in comparison to introns Introns - genes with no known function AKA orphan genes

Why is eukaryotic gene finding difficult?

``` Difficult to find exons amongst the introns This is why there is a substantial uncertainty in the number of human protein coding genes De novo (ORFs) gene finding is used in eukaryotes also but it is less accurate than in prokaryotes Junctions between exons and introns are not identified reliably ```

What is non-coding DNA made up of?

Repeated sequences | Unique sequences

What are the repeated sequences of DNA made up of?

Transpons: LINEs - long interspersed nuclear elements, molecular parasites have accumulated mutations SINEs - short interspersed nuclear elements Retrotransposons with long terminal repeats (LTRs) As these sequences are unexpressed they accumlate many polymorphisms much faster

What is systems biology?

Collecting and integrating enormous amounts of data in searchable databases so the properties and dynamics of entire biological networks can be analyzed

What are DNA microarrays or DNA chips?

They help create an accurate picture of gene expression, with the goal of transcriptomics (studing a cell's transcriptome)

What is everything we can study within a cell?

Genome Transcriptome Proteome Metabolome

Bioinformatics Flashcards

(43 cards)