Final Flashcards

1
Q

What is the phylogenetic inference problem?

A

Trying to create a tree that represents the evolutionary relationship based on DNA sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is CMTC?

A

Continuous time markov chain

It is a mathematical process that runs along the branches of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Compare first and second order CMTC.

A

First: probability depends on current state
Second: probability depends on current and previous state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can nucleotide substitution be applied to CMTC?

A

Using a substitution rate matrix looking at the history of substitutions that became fixed in a population. SEE EQUATION IN NOTES.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is are the rules for comparing nucleotide and amino acid sequences?

A

If protein sequence is greater than 100aa, it is homologous if 25% is identical
If nucleotide sequence is greater than 100aa, it is homologous is 70% are identical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is BLAST?

A

Basic local alignment search tool. It is the most widely used heuristic algorithm in bioinformatics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three goals of BLAST?

A

Speed: the sizes of databases keep growing.
Sensitivity: must get all (or most) matches
Specificity: must get all (or most) correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is word assumption for BLAST?

A

Operates under the assumption that is 2 sequences are similar, they will have a word in common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three steps to the BLAST algorithm?

A

1) Find all possible words in the query sequence (removing those under the threshold)
2) scan the databases for the occurrences of these words
3) score the search using a matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why does BLAST remove words scored under the threshold?

A

Words that score low are common occurrences and are more likely to be chance than actually homologous. Therefore you are saving time by excluding them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the E value?

A
The number of alignments with score of s that would be expected by chance. 
E = expected value
m = length of query sequence
n = length of database
s = raw score
v and K = scaling factors
E = kmne ^ -vs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do the different E scores mean?

A

< 10^-100: identical

-50-1: maybe random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the different types of BLAST?

A

BLASTN: query and database are nt seq.
BLASTP: query and database are aa seq.
TBLASTN: query is aa and database in nt
BLASTX: query is nt and database is aa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe polymerase chain reaction.

A

Denatured at 94
Primers annealed at -68
Elongated with dNTP at 72

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the objectives of sequencing?

A

Quick, accurate, easy, and cheap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is sanger sequencing?

A

Developed by fredric sanger, it was the first sequencing method to be automized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe sanger sequencing.

A

DNA is multiplied using PCR and is cut up. A radioactive primer is added along with DNA polymerase and dNTP. The solution is divided between 4 tubes, with a different ddNTP in each. Each solution is then run on a gel. Shorter fragments move further, each fragment a nucleotide longer in length allowing you to read the sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can you automate sanger sequencing?

A

Add a radioactive label that binds to the terminal ddNTP. Flashes of colour specify the order of nucleotides.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What characterizes next generation sequencing?

A

High degree of parallelization
High throughput
Low cost
Short read length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you prepare a library?

A

Target sequences are fragmented to the desired length either enzymatically or physically.
oNTP adapters are add to the ends of the target fragments and the final library is quantitated for sequencing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the typical steps for NGS?

A
  1. Library prep
  2. Attachment of target DNA to a surface via adaptor annulment
  3. Amplification of target DNA on surface
  4. Observation of sequencing results
22
Q

What is a illumina?

A

A type of NGS that uses the sequencing by synthesis method. One of the most popular methods, it uses a camera to detect the attachment of nucleotides.

23
Q

What are the steps of illumina?

A
  1. Prep of genomic DNA: specialized enzymes fragment the DNA and attach short tags,
  2. Attachment of DNA to flow cell surface: DNA is denatured at introduced to the cell with adapters annealed to primer lawn. DNA mines to primers and comp. strand is created by DNA pol. Strands are denatured and original is washed away, leaving comp. stuck to cell.
  3. Bridge amplification: comp. strand bends and anneals to another primer on lawn. New comp. generated.
  4. Denaturation and amplification: strands are denatured and bridge amp. is repeated, resulting in clusters
  5. Sequencing through recording of flashes as NTPs bind.
24
Q

What is 454?

A

The first commercially successful NGS. It uses pyrosequencing technology, recording phosphates released during the incorporation. The DNA fragments are attached to beads which are placed in wells.

25
Q

What is SOLiD?

A

Another type of NGS that also uses DNA fragments attached to beads. It uses labelled segments with 2 binding nucleotides and a fluorescent code based on each dinucleotide, each pair being given a specific colour. Sequencing then occurs multiple times with primers slightly offset from each other for full sequence coverage.

26
Q

What is ion torrent?

A

Another NGS based on bead fragment attachment. The beads are placed in wells in the semiconductor chip. Every few seconds, a nucleotide solution floods the chip. If the nucleotide is incorporated than the released H+ changes the chip voltage. Voltage is then recorded as a nucleotide incorporation.

27
Q

What kind of data is generated from NGS?

A

GBs of data of varying quality are produced that must be assessed to remove dubious data. Illumina typically outputs in FASTQ formate.

28
Q

How is quality assessment achieved in NGS?

A

In a FASTQ file, it is composed of 4 lines/sequence.
1 - sequence identifier
2 - raw sequence data
3 - (+) symbol
4 - quality score based on ascii characters from ! to ~

29
Q

How and why is low quality data removed from NGS generated data?

A

Separate programs are used to remove low quality programs. This removal prevents low quality data from leading to incorrect reassembly of the genome and reducing the amount of data that needs processing.

30
Q

What are the two methods of genome assembly after NGS?

A

Reference based

De Novo

31
Q

What is reference based assembly?

A

Uses previously assembled genomes to provide the scaffolding to which sequence reads can be aligned. When enough reads are aligned to the reference, minor errors in the reads are filtered out.

32
Q

What is de novo assembly?

A

Involves building a sequence from scratch when there is no known reference genome. This takes advantage of how DNA fragments can overlap to stitch reads to contains consensus regions of DNA. the contiges are collected into a possible scaffold and aligned to a similar genome.

33
Q

What is required to achieve de novo assembly?

A

High coverage, long reads, and good quality reads.

34
Q

What is coverage?

A

A measure of how many reads we have for a sequence. Ideally, a sequence will result in high degree of coverage across the genome and a high depth of coverage for each base pair.

35
Q

What is annotation?

A

Process of identifying the regions of you sequence DNA that contains genes and coding regions. Many programs will annotate genome, but most are organism specific.

36
Q

What are some issues with NGS?

A

PCR has issues with high GC content, resulting in poor coverage.
De Novo has assembly issues with short reads, resulting in contains that are difficult to combine into a whole genome assembly.
Lots of man hours
Resolution of gene location

37
Q

What is third generation sequencing?

A

This is a series of sequencing techniques primarily characterized by one reads they output. They utilize single molecule sequencing, which uses individual pieces of DNA instead of short sequence amplification, and real time sequencing, which runs constantly.

38
Q

What is Pac Bio?

A

A third generation sequencing technique that utilizes single molecule real time and uses circular fragments to allow multiple passes to be done for each segment. High single pass error rate (~13%).

39
Q

What is nano pore sequencing?

A

TGS that uses ionic solutions separated by a membrane. DNA molecules pass through the membrane one base at a time, changing the current. These shifts are recorded and used to identify the sequence. This also has a high error rate (~15%) and requires high quality DNA fragments.

40
Q

What are the objectives the plant protection act?

A

To prevent the introduction and spread within Canada of plant pests of quarantine significance
To detect and control or eradicate designated plant pests in Canada
To certify plant and plant products for domestic and export trade

41
Q

Use a coin toss to simulate maximum likelihood.

A

Let heads be coded by 1 and tails by 0. Let h be the probability of heads an d1-h the probability of tails. Let x be the result of the toss.
p(x=1) = h p(x|h) = h^x(1-h)^1-x
In a set of tosses:
SEE NOTES

42
Q

What is maximum likelihood inference?

A

The value that maximizes the likelihood function

43
Q

Who was Margaret Dayhoff?

A

(1925-1983)
She pioneered bioinformatics with the creation of the atlas of protein sequences and structures in 1965. It was the beginning of collection biological data into a single place.

44
Q

What was Matoo Kimora?

A

(1924-1994)
He was a population geneticist who studied amino acid sequences and the underlying nucleotide sequences. Proposed the natural theory of molecular evolution while collecting and analyzing data.

45
Q

What is the natural theory of molecular evolution?

A

We can make certain predictions about population variation and notices there was a lot more variation than expected due to multiple variants having the same fitness advantage and allowing them to coexist. Many of these variants were neutral (not good or bad).

46
Q

Who was linus torvalds?

A

1969 - present

Invented Linus in 1991, which because the basis of all statistical computing

47
Q

Define the maximum likelihood principle.

A

It is a method of estimating the parameters of a statistical model so the observed data is most probably. In protein design, it is used infer values givent the parameters.

48
Q

What are the parameters used for pairwise disabilities?

A

The same number of observed interactions between pairs as seen in a real database.

49
Q

What is the markov chain mote carlo?

A

Assumes likelihood is maximized by gradient descent and is numerically estimated by thermodynamic integration.

50
Q

Describe the paper discussed in lecture 7.

A

Aimed to formulate a protein design problem using model-based statistical inference. They used maximum likelihood principles to estimate the unknown parameters of a statistical potential, called inverse potential. This was then based on Markov chain Monte Carlo and applied to simple pairwise contact potential.

51
Q

What were the strengths of the paper discussed in lecture 7?

A

Guaranteed an optimal predictive power of the resulting potential and was very general, able to be applied to any form of statistical potential.