Bioinformatics Flashcards

1
Q

What is bioinformatics?

A

The essential computational technology of storing and analysing biological data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we sequence nucleic acids?

A

To deduce amino acid sequences and to an extent the structure/function of proteins, sequence comparisons into evolutionary relationships and information about mutation causing inherited disease

They are duplicated, modified and expressed in order to do this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the overall strategy for sequencing a nonidentical polymer?

A
  1. Cleave the polymer into fragments that are small enough to be fully sequenced (restriction endonucleases - sticky ends)
  2. Determine the sequence of residues in each fragment
  3. Determine the order of the fragments in the original polymer by aligning fragments that contain overlapping sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the traditional DNA sequencing method? (the analysis of the DNA)

A

Chain-Terminator method or Sanger method
Uses an E. coli enzyme to make complementary copies of the single-stranded DNA being sequenced

DNA polymerase I, uses a single DNA strand as a template and takes dNTPs and assembles a complementary polynucleotide chain in the 5’ to 3’ direction
A small amount of ddNTP (lacks 3-OH’), when this analog is incorporated chain growth is terminated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we visualise the sequenced DNA in the Sanger method?

A

The chain terminators (ddNTPs) are labelled by coloured fluorescent dyes

So the generated set a DNA fragments differing by one nucleotide, is separated in gel electrophoresis and passed through a detector to visualise the fluorescence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is done at the end of Sanger sequencing?

A

The reads that have been sequenced must be correctly assembled into the original strand of DNA
The reads are compared with overlapping sequences
OR
sonication - fragements generated in a solution of stiffdouble-stranded DNA to high frequency sound waves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some computational problems for genome sequencing?

A

Base calling - finding the peaks
Assigning a quality score to the base
Assembling longer sequences: bacteria genomes, chromosomes (designing efficient algorithms)
They need to account for sequencing error (false overlaps: due to more fragments and larger genomes)
It needs a large CPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are protein separated into its subunits for sequencing?

A

A fluorescent compound dansyl chloride reacts with primary amines to form dansylated polypeptides - in order to find the N-terminus, to reveal the number of types of subunits

High heat and aqueous acid liberates the N-terminus residue, which can be chromatographically separated from the other free amino acids

Mercaptoethanol is then used to break the di-sulphide bridges (iodoacetate added to prevent reformation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Once proteins are separated into subunits, what happens to the polypeptide chains?

A

They are cleaved/fragmented:
Endopeptidases - enzymes that catalyze the hydrolysis of internal peptide bonds
Exopeptidases - atalyze the hydrolysis of N- or C-terminal residues

e.g. trypsin, cyanogen bromide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are proteins sequenced to find their amino acid sequences?

A

Edman degradation
This removeds a peptide’s N-terminal amino acid residue - leaving the rest of the polypeptide chain in tact
The PTH-amino acid is later identified in chromatography

Then comparing amino acid sequences of the overlapping peptide fragments
Finally disulphide bond locations are identified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is gene finding?

A

The task of finding protein sequence in genomic DNA

Most proteins are now found by sequencing DNA but proteins are much harder to sequence
Gene finding is harder in eukaryotes than prokaryotes due to introns
Larger amounts of DNA leads to uncertainty in predicted gene numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some next-generation sequencing technologies?

A

Pyrosequencing - 1 DNA molecule is imobilsed per 1 microscopic plastic bead
It is amplified and grows the strand (primer, polymerase & dNTP added)
Luciferase generates a flash of light, and a detector records which light is produced depending on the dNTP

Illumina sequencing - DNA segments attached to a glass plate and amplified
Fluorescent dNTPs are added with primer and polymerase
A laser identifies the fluorescent groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What stores nucleotide sequences?

A

Databases e.g. Genbank

The hardest part isn’t the DNA sequencing but the assembly of millions of reads in the correct order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is metagenomic sequencing?

A

The DNA sequences of multiple organisms are analyzed as a single dataset
This is used to characterise complex interdependent microbial communities
e.g. microbiome of the human gut

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What did we discover from sequencing the human genome?

A

Half the genome consists of repeating sequences
80% of the genome is transcribed into RNA
Only 21,000 protein-encoding genes (1.2%) AKA open reading frames (ORFs)
Very small fraction of human proteins are unique to us
2 randomly selected genomes differ by 1 nucleotide per 1000 on average (99.9% identical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does evolution result from?

A

Sequence mutations:
Point mutations - single-nucleotide errors (from mispairing of bases in DNA replication)
Recombination - exchange of DNA between chromosomes
Transportation - movement of genes within/between chromosomes

Altered transcribed mRNA -> different protein may have properties that confer an advantage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are SNPs?

A

Single nucleotide polymorphisms:
Single base variations in the genome (essentially errors) there is a difference around 1 in 1200 = we all have individual information

They are associated with disease and other disorders
10 million have been cataloged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What has evolution brought us?

A

The 3 domains of life bacteria, archaea and eukarya (phylogenetic tree)

Diversity of species has arisen from: random variation, selection, divergence and speciation
Common ancestors have more obvious similarities at the molecular level
Similarity in core metabolic enzymes covers the whole tree of life

19
Q

What is alignment?

A

Similarity analysis of sequences begins with alignment
We can see the evolutionary relationship between sequences by identifying the aligned bases

You can identify substitution, insertion and deletion mutations of bases
Insertion and deletions are called indels or gap

20
Q

What terms describe the type of mutations found within alignment?

A

Neutral drift - Mutations being accepted if they don’t affect the protein function

Negative selection - Mutations being selected against if they cause harm to the protein function

Positive selection - Mutations are accepted if they introduce a useful protein function

21
Q

How can we determine close/far aligned sequences?

A

You can work out the score for aligned sequences
If it gets a higher score = close evolutionary relationship
A gap can result due to insertion and deletion of bases:
Aligned identical base = +1
Aligned non-identical base = 0
A gap = -1

This is easier for short sequences but we use dynamic programming (computer algorithms) to find the best alignment

22
Q

What is the PAM?

A

Point accepted mutation
The substitution of a single amino acid and is accepted in natural selection

A PAM matrix is a matrix where each column and row represents one of the twenty standard amino acids
PAM matrices are regularly used as substitution matrices to score sequence alignments for proteins

23
Q

What are PAM scores based on?

A

The scores are based on the calculation of probabilities
PDE - the probability that a D mutates to an E in a fixed evolutionary time
Calculated for all possible pairs of amino acids

The scores in the PAM matrices are not probabilities themselves
They compare the probability of 2 amino acids aligned from evolutionary mutation to the probability of them aligned by chance

24
Q

What PAMs are often used?

A

PAM250 and PAM120

PAM250 - an evolutionary distance of 250 PAMs

25
Q

What is another series matrix?

A

BLOSUM

Considered better than PAM due to no extrapolation from similar sequences

26
Q

What is the value of PAM/BLOSUM matrices?

A

Evolutionary relationships can be detected at the level of the protein sequence when the coding DNA sequence has changed so much that the relationship is undetectable

This helps preserve amino acid physico-chemical properties, with the gene remaining stable and functional
We can detect more distant evolutionary relationships with proteins sequences

27
Q

What is a tool to search databases?

A

BLAST - basic local alignment search tool

It will search report scores and statistics such as E-values (expectation values)
E-values gives a measure of statistical significance
Smaller E-value = higher statistical significance that the sequences are evolutionary relatives

Should do iterative searches (repeat with intermediate sequences) - for distant relatives

28
Q

What are some issues with BLAST?

A

Proteins are often made of domains – five different proteins below all contain a homeobox (HBX) domain
BLAST detects local sequence similarity
It might not cover the entire sequence (just a domain)
The proteins above do not have the same function but Blast would find similarity in the HBX domains

29
Q

What is homology?

A

Homology means descended from a common ancestor by divergent evolution
Two such sequences are said to be homologs or homologous
You can’t have a % homology
Significant sequence similarity may be taken as evidence of possible homology

30
Q

What causes divergent evolution of protein families?

A

Speciation – the emergence of new (reproductively isolated) species - genes evolve independently

Gene duplication – a process that produces redundant copies of genes in the genome
Many become pseudogenes - not fulfilling the original purpose

31
Q

What are the types of homolog?

A

Ortholog – homologous genes in different organisms that have the same function

Paralogs – homologous genes in the same organism often with different functions

32
Q

Why do proteins evolve at different rates?

A

Effect of amino acid changes on the protein’s function
Protein’s structural ability
Use of domains from other proteins

33
Q

What is the PDB?

A

Protein Data Bank

Stores structural bioinformatics on how macromolecular structures are displayed and compared

34
Q

What tools help facilitate the classification and comparison of protein structures?

A

CATH - Class, Architecture, Topology and Homologous superfamily
CE - Combinatorial Extension
Pfam - Protein families
SCOP - Structural Classification Of Proteins
VAST - Vector Alignment Search Tool

35
Q

What is genomics?

A

The study of organisms genomes

36
Q

What is the C-vaue paradox?

A

The amount of genetic material roughly parallels the organisms complexity of its morphology and metbolism

But lungfishes and some algae have incredibly large genomes
Most of the ‘extra’ DNA is unexpressed

Here lies the paradox

37
Q

What is an open reading frame?

A

A protein-coding gene

Not interrupted by STOP codons, but ends with a STOP codon
Exons are relatively short in comparison to introns
Introns - genes with no known function AKA orphan genes

38
Q

Why is eukaryotic gene finding difficult?

A
Difficult to find exons amongst the introns 
This is why there is a substantial uncertainty in the number of human protein coding genes
De novo (ORFs) gene finding is used in eukaryotes also but it is less accurate than in prokaryotes
Junctions between exons and introns are not identified reliably
39
Q

What is non-coding DNA made up of?

A

Repeated sequences

Unique sequences

40
Q

What are the repeated sequences of DNA made up of?

A

Transpons:
LINEs - long interspersed nuclear elements, molecular parasites have accumulated mutations
SINEs - short interspersed nuclear elements
Retrotransposons with long terminal repeats (LTRs)

As these sequences are unexpressed they accumlate many polymorphisms much faster

41
Q

What is systems biology?

A

Collecting and integrating enormous amounts of data in searchable databases so the properties and dynamics of entire biological networks can be analyzed

42
Q

What are DNA microarrays or DNA chips?

A

They help create an accurate picture of gene expression, with the goal of transcriptomics (studing a cell’s transcriptome)

43
Q

What is everything we can study within a cell?

A

Genome
Transcriptome
Proteome
Metabolome