4. Phylogenetics Flashcards
Cross-Over
- the idea of localising the disease gene is based on the event of ‘cross over’
- during the meiosis stage there is an exchange of genetic material between homologous chromosomes
- this results in exchange of genes, genetic recombination
Haplotypes Example
Outline
- two loci: A and B
- genotypes Af, Am, Bf, Bm where f indicates father and m indicates mother
- formed by haplotypes of two gametes AfBf from father and AmBm from mother
Haplotypes Example
Non-Recombinant
-after meiosis can have
AfBf and AmBm
-i.e. no recombination between the parental haplotypes
Haplotypes Example
Recombinant
- during meiosis if crossing over occurs, can get recombination between parental haplotypes:
e. g. AmBf and AfBm
Recombination Fraction
- usually denoted θ
- the probability that a gamete is recombinant with respect to the locus
Likelihood of Crossober
- for loci in different chromosomes, independent segregation insures that R and NR gametes are equally likely to occur, θ=1/2
- for loci in the same chromosome, separation of two paternal / maternal alleles requires the occurrence of crossover between the two loci
- the closer the two loci the less likely this is, θ<1/2
Linkage Definition
- two loci with a recombination fraction less than 1/2 are said to be in linkage
- the smaller the recombination fraction, the more tightly linked the two loci are
Morgans
-the genetic map distance between two loci is defined as the expected number of crossovers occurring between them on a single chromatid during meiosis, unit Morgans
1cM ~ 1 million bases
Map Functions
- a mathematical relationship that converts map distance (m) to recombination fraction θ is called a map function
- the function connects two key quantities; genetic distance and recombination probabilities
- the most famous map functions are Haldane and Kosambi
The Haldane Map Function
Description
- assumes that crossovers occur at random, independently of each other
- the occurrence between two loci is a Poisson process i.e. they are equally likely at any point between the loci and the number of crossovers between loci follows a Poisson distribution
The Haldane Map Function
Function
θ = [1-exp(-2m)]/2
-with inverse
m = -1/2 log(1-2θ)
The Kosambi Map Function
Description
-a generalisation of the Haldane function
The Kosambi Map Function
Function
m = 1/4 log{[1+2θ]/[1-2θ]}
-with inverse
θ = 1/2 [exp(4m)-1]/[exp(4m)+1]
Genetic Marker
Definition
- genetic variants with known DNA sequence and known location
- for the purpose of linkage analysis these markers need to be easily and reliably detectable
Microsatellites
-repeats of simple DNA sequences, e.g.:
…CACACACACAC…
SNPs
-single nucleotide polymorphisms, e.g.
…CTGGTAGCTA…
…CTGGCAGCTA…
Linakge Analysis
Description
-based on the event of crossover during meiosis
-need a known genetic marker to estimate the recombination fraction θ
-once we have θ^, can estimate the distance between the marker and location of interest
-if R and NR gametes in a random sample can be counted:
θ^ = #R / [#(R+NR)]
-a test for linkage simplifies further to testing:
Ho : θ = 1/2
vs
H1 : θ < 1/2
Identifying R and NR Gametes
- to identify which gametes are R and NR, we need to know their phases
- phase is the situation where one of the alleles in the disease gene is in the same strand as one alleles of the marker
- in lab conditions (with animals) this is easily achievable
- in a human population, we need a three generation pedigree to be able to know the phase with certainty
θ^ Estimate
θ^ = R/N when RN/2
-since θ>1/2 is inadmissable on biological grounds
θ^ Estimate
R < N/2
-then: θ^ = R/N -with approximate standard error: √[θ^(1-θ^)/N] -to test Ho:θ=1/2 vs H1:θ<1/2 can use a chi square test, likelihood ratio lest and LOD score
θ^ Estimate
Chi-Square Test
-under Ho, expected numbers of R and NR are both N/2
-test statistic:
T = [R-N/2]²/[N/2] + [N-R-N/2]²/[N/2]
= [N-2R]² /N
-a one-tailed test with 1DoF
-if R > N/2, T reassigned to 0 and conclude test is not significant
θ^ Estimate
Likelihood Ratio Test
L(θ) = θ^R [1-θ]^(N-R) -can take log for log likelihood l(θ) -likelihood ratio is: Λ(θ) = L(θ) / L(θ=1/2) -test statistic: X = 2logΛ = 2 [ l(θ) - l(θ=1/2)]
θ^ Estimate
LOD Score
Z(θ) = log_10_(Λ(θ))
-the conventional critical value for calling a test significant is Z≥3
Unknown Parental Haplotypes
-two possible phases for parent: AB / ab Ab / aB -a priori these are equally likely with probability 1/2 -so likelihood is: L(θ) = 0.5θ^4[1-θ]² + 0.5θ²[1-θ]^4 -can show MLE of θ is 1/2 -once θ^ is obtained, can use the same likelihood ratio test or LOD score as phase known pedigree -but NOT chi-square test
Model Free Linkage Analysis
Description
- does not depend on prior specification of a model of inheritance for the disease of interest
- genotype frequencies and penetrance need not be known in advance
- several methods:
- -affected sib pair test (ASP)
- -non-parametric linkage (NPL) score
- concepts of allele sharing are needed
Allele Sharing Between Individuals
- IBS and IBD are concepts of allele sharing between individuals
- allele sharing is comparing the DNA sequence or allele at the same locus between two individuals
Identical by State (IBS)
-alleles are IBS if they have the same form (i.e. having the same DNA sequence) independent of ancestral origin
Identical by Descent (IBD)
- alleles are IBD if they have the same form AND have the same ancestral origin i.e. the same chromosomal region has been inherited in both individuals from a common ancestor
- alleles that are IBD must be IBS
Kinship Coeffiecient
Description
- denoted, ф
- defined as the probability that a randomly drawn allele at any locus of an individual is IBD with a randomly drawn allele at the same locus from another individual
Kinship Coefficient and IBD Sharing
- there is a simple linear relationship between kinship coefficient and the pattern of IBD sharing
1) two alleles IBD, given any allele picked at the locus from an individual we have a probability of 1/2 of sampling the IBD allele from the other indiviudal
2) one-allele IBD the same probability is 1/4
3) zero-allele IBD, the same probability is 0
Kinship Coefficient
Definition
ф = 1/2P{IBD=2} + 1/4P{IBD=1} + 0P{IBD=0}
= 1/2 E[IBD]
-where E[IBD] is the expected proportion of alleles shared IBD at the locus for the two individuals concerned
Coefficient of Relationship
E[IBD] = π
- where π is the coefficient of relationship
- and π is twice the kinship coefficient in the case of no inbreeding
Affected Sib Pair (ASP) Test
Description
-compares the observed number of independent affected sibling pairs sharing zero (no), one (n1) or two (n2) alleles at a given marker locus to the expected under no linkage
-with hypotheses
Ho : (po,p1,p2) = (1/4,1/2,1/4)
H1 : (po,p1,p2) ≠ (1/4,1/2,1/4)
-ASP tests can be broady classified into score tests (chi-square, proportion, mean) and the likelihood ratio test
Affected Sib Pair (ASP) Test
Chi-Square Test
Ho : (po,p1,p2) = (1/4,1/2,1/4) H1 : (po,p1,p2) ≠ (1/4,1/2,1/4) -test statistic: T = Σ [ni-ei]²/ei -where the sum is from i=0 to i=2 -under Ho T follows a chi-square distribution with 2DOF
Affected Sib Pair (ASP) Test
Proportion Test
-testing whether the proportion of ASPs sharing two alleles IBD (p2) is 1/4 under Ho
Ho : p2 = 1/4
H1 : p2 ≠ 1/4
-test statistic
Tprop = [n2-n/4]² / [n/4]
-under Ho T follows a chi-square distribution with 1DoF
Affected Sib Pair (ASP) Test
Mean Test
-compares whether the mean of ASPs sharing one (times 1/2) and two alleles IBS is equal to 1/2
Ho : (p1/2 + p2) = 1/2
H1 : (p1/2 + p2) ≠ 1/2
-test statistic
z = [(n1/2 + n2)-n/2] / √[n/8]
-under Ho z follows a standard normal distribution
Affected Sib Pair (ASP) Test
Likelihood Ratio Test
-NOT AN ASP TEST
-more accurate
-the number of sib pairs who share zero, one and two alleles IBD follow a multinomial distribution with parameters po, p1, p2 respectively
-test statistic
X = 2 Σ ni log(ni/ei)
-where the sum if from i=0 to i=2
-and X follows a chi-square distribution with 2DoF
Nonparametric Linkage (NPL) Score Description
- analysis of allele-sharing may be extended to other types of relative pairs and to larger sets of relatives by counting all possible inheritance patterns
- Spairs counts the number of alleles IBD shared for each affected relative pair (ARP)
- this is summed over all pairs of affected relatives
Nonparametric Linkage (NPL) Score Test
-let xi denote the number of alleles shared IBD by the ith sib-pair
-want to create test with standardised xi:
zi = [xi - E(xi)] / √[Var(xi)]
= √2 (xi-1)
-given a collection of pedigrees, the total NPL score for n affected sib pairs is
Z = 1/√n Σzi
-where the sum is from i=1 to i=n
-Z follows a standard normal distribution
Nonparametric Linkage (NPL) Score Expectation and Variance
-denote P(xi=0)=po, P(xi=1)=p1, P(xi=2)=p2
-the expected number of alleles shared IBD:
E[xi] = p1 + 2p2
E[xi²] = p1 + 4p2
-variance
Var(xi) = E(xi²) - E(xi)²
-under Ho: p1=1/2, p2=1/4
=>
E(xi) = 1
Var(xi) = 1/2
Introduction to Phylogenetics
- different organisms often contain similar DNA sequences
- in the theory of evolution this may be because a common ancestor experienced evolutionary mutational processes of substitution, insertion or deletion
Phylogeny
- any set of species is related and this relationship is called phylogeny
- this is usually described in a phylogenetic tree
What are the two types of tree?
- rooted trees
- unrooted trees
Notes on Phylogenetic Trees
- all trees are assumed to be binary
- a node is an endpoint of an edge
- the ‘root’ is the ultimate ancestor
- a labelled branching pattern is referred to as a topology
- the length of the ith edge is denoted ti
How many nodes and edges are there in a rooted tree of n leaves?
- as we move up the tree, the edges coalesce and the number of edges is reduced to one
- this gives a total of 2n-1 nodes, n terminal nodes and n-1 internal nodes
- and therefore 2n-2 edges (discounting the edge above the root node)
Pairwise Distance
Introduction
- a phylogenetic tree is constructed from a multiple alignment of DNA sequences
- a non-parametric construction of phylogenetic tree depends on pairwise distance between species
Process of Constructing a Phylogenetic Tree
1) select species (DNA sequences)
2) multiple alignment of DNA sequences, assuming fixed length and no gaps - compute pairwise distances
3) infer phylogenetic tree
Tree Construction Methods
- parametric and non-parametric
- the non-parametric methods we will be focusing on are distance matrix methods
- in particular the neighbour-joining method and clustering method
Pairwise Distance
Definition
-the pairwise distance between sequence x^i and x^j, denoted dij, is defined as the number of DNA bases that differ between the two distance, the Hamming distance
Distance Methods
-distance methods reconstruct trees (rooted or unrooted) from a set of pairwise distances between the sequences in alignment (assumed given)
Distance Function
Definition
- let M be a set and let d: MxM -> ℝ be a function
- we say that d is a distance function on M if:
1) d(u,v)>0 for all u, v ∈ M
2) d(u,u)=0 for all u ∈ M
3) d(u,v)=d(v,u) for all u, v ∈ M
4) the triangle inequality holds: d(u,v) ≤ d(u,w) + d(w,v) for all u,v,w ∈ M
Tree Generated Distance Function
Definition
-if we fix an unrooted tree T relating to the sequences (OTUs) we obtain a tree generated distance function d^T on M by declaring:
d^T(x^i,x^j) = dij^T
-to be the shortest path from x^i to x^j in T
Does there exist a tree T that generates d, which means that d^T = d (dij^T=dij) ?
N=2
-the answer is obviously yes for N=2, since there is only one possible path between each node anyway
Does there exist a tree T that generates d, which means that d^T = d (dij^T=dij) ?
N=3
-looking for positive numbers x, y, z such that:
x + y = d12
x + z = d13
y + z = d23
-there is a unique tree that generates a given distance function
-this uniqueness is a general fact for additive distance functions
Does there exist a tree T that generates d, which means that d^T = d (dij^T=dij) ?
N≥4
-not every distance function on M is additive, it can be characterised in the following way, theorem:
‘let d be a distance function M and N≥4 then d is additive if and only if the following condition holds:
for every set of four distinct numbers 1≤i,j,k,l≤N, two of the sums dij+dkl, dik+djl, dil+djk coincide and are greater than or equal to the third one’
-this condition is called the four point condition
Neighbour Joining Algorithm
Description
- an iterative algorithm that on every step replaces a pair of OTUs with a single OTU and iterates until there are only three OTUs left
- this means that for N=3, there is just one unrooted tree topology
Neighbour Joining Algorithm
ri
-for every i=1,…,N define:
ri = 1/[N-2] Σdik
-where the sum is from k=1 to N
Neighbour Joining Algorithm
Dij
-for all i,j=1,…,N and i
Neighbour Joining Algorithm
Steps
-calculate the matrix D=(Dij)
-pick a pair with 1≤i, j≤N for which Dij is minimal, such a pair may not be unique
-group x^i and x^j and replace them with x^(N+1) which represents an internal node of the future tree connected to x^i and x^j and is placed at:
d(N+1)i = 1/2 (dij + ri - rj)
d(N+1)j = 1/2 (dij + rj - ri)
-we define the distances between x^(N+1) and any x^m with m≠i,j as:
d(N+1)m = 1/2 (dim + djm - dij)
-we now have a collection of N-1 OTUs:
M’ = {x^m, x^(N+1), m≠i,j}
-repeat the above procedure again until only three OTUs are left in which case there is just one unrooted tree topology
Clustering Method
Steps
1) assign each (initial) node x^i to C^i, i.e. each node is assumed to be a cluster on its own
2) choose two clusters C^i and C^j for which d(C^i,C^j) is minimal (excluding i=j)
3) define a new cluster C^(N+1)=C^i ∪ C^j and set the distance to the remaining clusters with the distance between clusters equation
4) introduce a new internal node x^(N+1) (associated with cluster C^(N+1)) and place it at the total height d(C^i,C^j)/2 and redefine the new distance matrix
5) repeat the process until we have only one cluster and the node represents the root
Matrix of Transition Probabilities
4x4 matrix with entries pij=pij(t) with i,j∈{A,C,G,T}
-assume a Markov model where, if at to the site was in state i∈{A,C,G,T} then the probability of the event that that at time to+t the site will be in state j∈{A,C,G,T} depends only on i, j and t
Rate Matrix
P(t) = exp(tQ)
-where Q=P’(0) is the ‘rate matrix’ or matrix of instantaneous change
Juke-Cantor Model
Matrices
-sets entries in Q to -3α/4 on diagonal and α/4 elsewhere for some positive constant α
-then P(t) has elements rt on the diagonal and st elsewhere, where:
rt = pii(t) = 1/4 + 3/4 exp(-αt), for all i
st = pij(t) = 1/4 - 1/4 exp(-αt), for i≠j
Juke-Cantor Model
Nucleotide Equilibrium Frequencies
-when t->∞, rt=st=1/4 which means that the nucleotide equilibrium frequencies in this model are:
qA = qC = qG = qT = 1/4
Juke-Cantor Model
Probability
P{x1u, x2u | T, t1, t2} = Σ qa P{x1u | a,t1} P{x2u | a,t2}
-where the sum is over a∈{A,C,G,T}
Juke-Cantor Model
Likelihood
-if there are N positions (if length of sequence is N):
L(t1, t2 | T, x1, x2) = P{x1, x2 | T, t1, t2}
= ∏ P{x1u, x2u | T, t1, t2}
-where the multiplication is over u=1 to u=N
= 1/[16^(n1+n2)] {1+3exp[-α(t1+t2)]}^n1 {1-exp[-α(t1+t2)]}^n2
-where n1 is the number of positions where the nucleotides in the two sequences are identical and n2 is the number of locations where a substitution occurs