Maximum Parsimony Flashcards
Evolutionary theory speciation evolution of new organisms is driven by
Mutation and Selection bias
The DNA sequence can be changed due to single base changes, deletion/ insertion of DNA segments
Mutation
Quantifies the factor by which a mutation with effect is more or less likely to be chosen during the population sampling after it first occurs
Selection Bias
_____ event leads to creation of different species
Speciation
True or false: Speciation caused by physical separation into groups where different genetic variants become dominant
True : basahin mo ulit
Define Evolution Theory
Any two species share a (possibly distant) common ancestor
DNA and protein sequences evolve at a rate that is relative constant over time and among different organisms
Molecular clock hypothesis
What is stated in Indiana University by Michael Lynch, Jeff Palmer, Matt Hann et al) in Lynch: The origin of Genome Complexity
According to this model, much of the restructuring of eukaryotic genomes was initiated by nonadaptive processes and this is turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection sana binasa mo hanggang dito
A graph reflecting the approximate distances between a set of objects (species, genes, proteins, families) in a hierarchical fashion
Phylogenetic Tree
Current species; sequences in current species
Leaves
Hypothetical common ancestor
Internal Nodes
“Time” from one speciation to the next (branching represents speciation into new species)
Branches (edges) Length
This satisfies molecular clock hypothesis all leaves at same distance from the root
Rooted Tree
Branches are also called
edges
What does edges reflect
Evolutionary distances
Classical phylogenetic analysis
Morphological features: presence pr absence of fins
Modern biological methods allow to use molecular features
Gene sequences and Protein Sequences
A phylogenetic tree that represents the evolutionary pathways of a group of species
Species tree
A phylogenetic tree constructed from a single gene from each of the species under study
Gene tree
We can get different trees
- Input sequences
- Multiple alignment programs
- Substitution models
- Phylogenetic tree reconstruction methods
Display one sequence above another with spaces (termed gaps) inserted in both to reveal similarity of nucleotides or amino acids
Sequence alignment
Gaps represents ____
Indels
Mismatch represents
Mutations
Insertion and Deletion represents
Indels
Aligns two or more sequences to highlight their similarity, inserting a small number of gaps into each sequences (usually denoted by dashes) to align wherever possible identical or similar characters
Basic Sequence Alignment Algorithm
Aligns two sequences to identify similarities/differences.
Pairwise Alignment
Handling large datasets, optimizing alignments for highly divergent sequences.
Multiple Sequence alignment
Aligns the most similar subsequence within the sequences.
Local Alignment
Aligns the entire length of both sequences from start to end.
Global Alignment
useful when searching for similar regions within sequences that might differ significantly overall.
Local Alignment
best for comparing sequences of similar length to assess their overall similarity and evolutionary relationship.
Global Alignment
Why compare biological sequences
To obtain functional or mechanistic insight about a sequence by inference from another potentially better characterized sequence
* To find whether two (or more) genes or proteins are evolutionarily related
* To find structurally or functionally similar regions within sequences (e.g. catalytic sites, binding sites for other molecules, etc)
Distance based tree methods
UPGMA and NJ
Character based (discrete) tree methods
Maximum Parsimony, Maximum Likelihood, Bayesian Methods
Distance methods are
Relationships based upon sequence similarity
Advantages of Distance method
- Computationally fast
- Single Best tree found
Disadvantages of Distance methods
Assumptions
* additive distances (always)
* molecular clock (sometimes)
Information loss occurs due to data transformation
Uninterpretable branch lengths
Single “best tree” found..
These methods attempt to map the history of gene sequences onto a tree and decide what the tree looks like
Character based methods
How to choose the best tree
To decide which tree is best we can use an optimality criterion.
* Parsimony is one such criterion (the other criteria: Maximum likelihood, minimum evolution, bayesian)
* It chooses the tree which requires the fewest mutations to explain the data.
* The Principle of Parsimony is the general scientific principle that accepts the simplest of two explanations as preferable.
Principle of Parsimony
Looks for a tree with the minimum total number of substitutions of symbols between species and their ancestors in the phylogenetic tree
The preferred evolutionary tree is the one that requires “the minimum net amount of evolution”
Principle of Parsimony
Maximum Parsimony, Because character conflict, homoplasy, is common we need a method to resolve this conflict
We can brush aside the problem and use an algorithmic method, like neighbor-joining which builds one tree from distance data [NOT recommended] - more on this later
* Or we can use an optimality criterion that allows us to rank alternate trees from best to worst
Maximum Parsimony is an
hypotheses that explain the data equally well, choose the simplest one
- Choice of simplest hypothesis is a good rule of thumb (but remember, the data matter far more than the method!)
Parsimony is an optimality criterion
Maximum Parsimony Prefer
the tree or trees that minimizes the amount of evolutionary change required to explain the data
Based on ______ “shave away all that is unnecessary” - plurality should not be posited without necessity; when there are multiple
Ockhams razor
Maximum Parsimony
hypotheses that explain the data equally well, choose the simplest one
- Choice of simplest hypothesis is a good rule of thumb (but remember, the data matter far more than the method!)
Parsimony will allow one to find the tree that minimizes homoplasy, aka the
Shortest tree
Parsimony eh eh eh basahin mo lang tong answer
but if you have made careless homology decisions (e.g. poorly aligned your data) even the most parsimonious tree may be horribly wrong
* Thus, some dadists emphasize that we dont use parsimony because it is the method most likely to find the true tree - we use it because it provides the “least falsified” hypothesis (truth is unknowable)
Assumption of character-based parsimony
- Each taxa is described by a set of characters
- Each character can be in one of finite number of states
- In one step certain changes are allowed in character states
- Goal: find evolutionary tree that explains the states of the taxa with minimal number of changes
In parsimony, the score is simply the minimum number of mutations that could possibly produce the data.
* Pro: ?
* Con: ?
- Pro: There are fast algorithms that guarantee that any tree can be scored correctly
- Con: There are lots of possible trees to choose between…
Drawbacks of Maximum Parsimony
the score of a tree is completely determined by the minimum number of mutations among all of the reconstructions of ancestral sequences.
* fails to account for the fact that the number of changes is unlikely to be equal on all branches in the tree.
o As a result, susceptible to “long-branch attraction”, in which two long branches that are not adjacent on the true tree are inferred to be closest relatives
* in practice this is still pretty good…ML/Bayesian better sana binasa mo hanggang dito
any test or metric that uses random sampling with replacement and falls under the broader class of resampling methods.
Bootstrapping
uses sampling with replacement to estimate the sampling distribution for the desired estimator.
Bootstrapping
used to assess the reliability of sequence based phylogeny.
Bootstrapping
Define bootstrapping
Bootstrap values in a phylogenetic tree indicate that out of 100, how many times the same branch is observed when repeating the generation of a phylogenetic tree on a resampled set of data.
* If we get this observation 100 times out of 100, then this supports your result.
The result of multiple substitutions at the same site in a sequence, or identical substitution in different sequences such that the apparent sequence divergence rate is lower than the actual divergence that has occurred
Genetic Saturation
Saturation affects in _____ where most distant lineages have misleadingly short branch lengths which also decreases phylogenetic information contained in the sequences
Long Branch Attraction (LBA)
is a process where genetic material moves between organisms in a way other than traditional parent-to-offspring inheritance (vertical transfer).
Horizontal Gene Transfer
Sequences diverged after a speciation event
Orthologs
Sequences diverged after a duplication event
Paralogs
Sequences Diverged after a horizontal transfer
Xenologs
Maximum Parsimony
Optimality criterion:
The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.
Advantage of Maximum Parsimony
Are simple, intuitive, and logical (many possible by pencil-and-paper).
* Can be used on molecular and non-molecular (e.g., morphological data.
* Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy
* Can be used for character (can infer the exact substitutions) and rate analysis.
* Can be used to infer the sequences of the extinct (hypothetical) ancestors.
Disadvantages of Maximum Parsimony
Are simple, intuitive, and logical (derived from “Medieval togic”, not statisticsl)
* Can be fooled by high levels of homoplasy (same’ events).
* Can become positively misleading in the “Felenstein Zone”
Phylogeny (phylogenetic tree) reconstruction:
overview
- Tree topology & branch lengths
- Computational challenge
- Huge number of tree topology
3 sequences: 1 (unrooted)
4 sequences: 3
5 sequences: 15
10 sequences: 2,027,025
20 sequences: 221,643,095,476,699, 771,875 n sequences (unrooted & rooted) ??
Phylogeny (phylogenetic tree) reconstruction:
most methods are
Heuristic, is a mental shortcut or practical approach used to solve problems or make decisions quickly
Phylogeny (phylogenetic tree) reconstruction: Two types of methods
Distance based (input: distance matrix; UPGMA, NJ)
Charactr based (input: multiple alignment)
Models of evolutionary distance
Many HHAHAHAH
Model of ED: Equal probability of change to any nucleotide
Jukes-Cantor Model (Simplest case)
Different probabilities for transitions, transversions
Kimura
Different probabilities for transitions, transversions, & takes into account genomic nucleotide bases
HKY