4. Phylogenetics Flashcards

1
Q

Cross-Over

A
  • the idea of localising the disease gene is based on the event of ‘cross over’
  • during the meiosis stage there is an exchange of genetic material between homologous chromosomes
  • this results in exchange of genes, genetic recombination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Haplotypes Example

Outline

A
  • two loci: A and B
  • genotypes Af, Am, Bf, Bm where f indicates father and m indicates mother
  • formed by haplotypes of two gametes AfBf from father and AmBm from mother
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Haplotypes Example

Non-Recombinant

A

-after meiosis can have
AfBf and AmBm
-i.e. no recombination between the parental haplotypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Haplotypes Example

Recombinant

A
  • during meiosis if crossing over occurs, can get recombination between parental haplotypes:
    e. g. AmBf and AfBm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Recombination Fraction

A
  • usually denoted θ

- the probability that a gamete is recombinant with respect to the locus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Likelihood of Crossober

A
  • for loci in different chromosomes, independent segregation insures that R and NR gametes are equally likely to occur, θ=1/2
  • for loci in the same chromosome, separation of two paternal / maternal alleles requires the occurrence of crossover between the two loci
  • the closer the two loci the less likely this is, θ<1/2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linkage Definition

A
  • two loci with a recombination fraction less than 1/2 are said to be in linkage
  • the smaller the recombination fraction, the more tightly linked the two loci are
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Morgans

A

-the genetic map distance between two loci is defined as the expected number of crossovers occurring between them on a single chromatid during meiosis, unit Morgans
1cM ~ 1 million bases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Map Functions

A
  • a mathematical relationship that converts map distance (m) to recombination fraction θ is called a map function
  • the function connects two key quantities; genetic distance and recombination probabilities
  • the most famous map functions are Haldane and Kosambi
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The Haldane Map Function

Description

A
  • assumes that crossovers occur at random, independently of each other
  • the occurrence between two loci is a Poisson process i.e. they are equally likely at any point between the loci and the number of crossovers between loci follows a Poisson distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The Haldane Map Function

Function

A

θ = [1-exp(-2m)]/2
-with inverse
m = -1/2 log(1-2θ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The Kosambi Map Function

Description

A

-a generalisation of the Haldane function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The Kosambi Map Function

Function

A

m = 1/4 log{[1+2θ]/[1-2θ]}
-with inverse
θ = 1/2 [exp(4m)-1]/[exp(4m)+1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Genetic Marker

Definition

A
  • genetic variants with known DNA sequence and known location
  • for the purpose of linkage analysis these markers need to be easily and reliably detectable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Microsatellites

A

-repeats of simple DNA sequences, e.g.:

…CACACACACAC…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SNPs

A

-single nucleotide polymorphisms, e.g.
…CTGGTAGCTA…
…CTGGCAGCTA…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Linakge Analysis

Description

A

-based on the event of crossover during meiosis
-need a known genetic marker to estimate the recombination fraction θ
-once we have θ^, can estimate the distance between the marker and location of interest
-if R and NR gametes in a random sample can be counted:
θ^ = #R / [#(R+NR)]
-a test for linkage simplifies further to testing:
Ho : θ = 1/2
vs
H1 : θ < 1/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Identifying R and NR Gametes

A
  • to identify which gametes are R and NR, we need to know their phases
  • phase is the situation where one of the alleles in the disease gene is in the same strand as one alleles of the marker
  • in lab conditions (with animals) this is easily achievable
  • in a human population, we need a three generation pedigree to be able to know the phase with certainty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

θ^ Estimate

A

θ^ = R/N when RN/2

-since θ>1/2 is inadmissable on biological grounds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

θ^ Estimate

R < N/2

A
-then:
θ^ = R/N
-with approximate standard error:
√[θ^(1-θ^)/N]
-to test Ho:θ=1/2 vs H1:θ<1/2 can use a chi square test, likelihood ratio lest and LOD score
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

θ^ Estimate

Chi-Square Test

A

-under Ho, expected numbers of R and NR are both N/2
-test statistic:
T = [R-N/2]²/[N/2] + [N-R-N/2]²/[N/2]
= [N-2R]² /N
-a one-tailed test with 1DoF
-if R > N/2, T reassigned to 0 and conclude test is not significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

θ^ Estimate

Likelihood Ratio Test

A
L(θ) = θ^R [1-θ]^(N-R)
-can take log for log likelihood l(θ)
-likelihood ratio is:
Λ(θ) = L(θ) / L(θ=1/2)
-test statistic:
X = 2logΛ
= 2 [ l(θ) - l(θ=1/2)]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

θ^ Estimate

LOD Score

A

Z(θ) = log_10_(Λ(θ))

-the conventional critical value for calling a test significant is Z≥3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Unknown Parental Haplotypes

A
-two possible phases for parent:
AB / ab
Ab / aB
-a priori these are equally likely with probability 1/2
-so likelihood is:
L(θ) = 0.5θ^4[1-θ]² + 0.5θ²[1-θ]^4
-can show MLE of θ is 1/2
-once θ^ is obtained, can use the same likelihood ratio test or LOD score as phase known pedigree
-but NOT chi-square test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Model Free Linkage Analysis

Description

A
  • does not depend on prior specification of a model of inheritance for the disease of interest
  • genotype frequencies and penetrance need not be known in advance
  • several methods:
  • -affected sib pair test (ASP)
  • -non-parametric linkage (NPL) score
  • concepts of allele sharing are needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Allele Sharing Between Individuals

A
  • IBS and IBD are concepts of allele sharing between individuals
  • allele sharing is comparing the DNA sequence or allele at the same locus between two individuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Identical by State (IBS)

A

-alleles are IBS if they have the same form (i.e. having the same DNA sequence) independent of ancestral origin

28
Q

Identical by Descent (IBD)

A
  • alleles are IBD if they have the same form AND have the same ancestral origin i.e. the same chromosomal region has been inherited in both individuals from a common ancestor
  • alleles that are IBD must be IBS
29
Q

Kinship Coeffiecient

Description

A
  • denoted, ф
  • defined as the probability that a randomly drawn allele at any locus of an individual is IBD with a randomly drawn allele at the same locus from another individual
30
Q

Kinship Coefficient and IBD Sharing

A
  • there is a simple linear relationship between kinship coefficient and the pattern of IBD sharing
    1) two alleles IBD, given any allele picked at the locus from an individual we have a probability of 1/2 of sampling the IBD allele from the other indiviudal
    2) one-allele IBD the same probability is 1/4
    3) zero-allele IBD, the same probability is 0
31
Q

Kinship Coefficient

Definition

A

ф = 1/2P{IBD=2} + 1/4P{IBD=1} + 0P{IBD=0}
= 1/2 E[IBD]
-where E[IBD] is the expected proportion of alleles shared IBD at the locus for the two individuals concerned

32
Q

Coefficient of Relationship

A

E[IBD] = π

  • where π is the coefficient of relationship
  • and π is twice the kinship coefficient in the case of no inbreeding
33
Q

Affected Sib Pair (ASP) Test

Description

A

-compares the observed number of independent affected sibling pairs sharing zero (no), one (n1) or two (n2) alleles at a given marker locus to the expected under no linkage
-with hypotheses
Ho : (po,p1,p2) = (1/4,1/2,1/4)
H1 : (po,p1,p2) ≠ (1/4,1/2,1/4)
-ASP tests can be broady classified into score tests (chi-square, proportion, mean) and the likelihood ratio test

34
Q

Affected Sib Pair (ASP) Test

Chi-Square Test

A
Ho : (po,p1,p2) = (1/4,1/2,1/4)
H1 : (po,p1,p2) ≠ (1/4,1/2,1/4)
-test statistic:
T = Σ [ni-ei]²/ei
-where the sum is from i=0 to i=2
-under Ho T follows a chi-square distribution with 2DOF
35
Q

Affected Sib Pair (ASP) Test

Proportion Test

A

-testing whether the proportion of ASPs sharing two alleles IBD (p2) is 1/4 under Ho
Ho : p2 = 1/4
H1 : p2 ≠ 1/4
-test statistic
Tprop = [n2-n/4]² / [n/4]
-under Ho T follows a chi-square distribution with 1DoF

36
Q

Affected Sib Pair (ASP) Test

Mean Test

A

-compares whether the mean of ASPs sharing one (times 1/2) and two alleles IBS is equal to 1/2
Ho : (p1/2 + p2) = 1/2
H1 : (p1/2 + p2) ≠ 1/2
-test statistic
z = [(n1/2 + n2)-n/2] / √[n/8]
-under Ho z follows a standard normal distribution

37
Q

Affected Sib Pair (ASP) Test

Likelihood Ratio Test

A

-NOT AN ASP TEST
-more accurate
-the number of sib pairs who share zero, one and two alleles IBD follow a multinomial distribution with parameters po, p1, p2 respectively
-test statistic
X = 2 Σ ni log(ni/ei)
-where the sum if from i=0 to i=2
-and X follows a chi-square distribution with 2DoF

38
Q
Nonparametric Linkage (NPL) Score
Description
A
  • analysis of allele-sharing may be extended to other types of relative pairs and to larger sets of relatives by counting all possible inheritance patterns
  • Spairs counts the number of alleles IBD shared for each affected relative pair (ARP)
  • this is summed over all pairs of affected relatives
39
Q
Nonparametric Linkage (NPL) Score
Test
A

-let xi denote the number of alleles shared IBD by the ith sib-pair
-want to create test with standardised xi:
zi = [xi - E(xi)] / √[Var(xi)]
= √2 (xi-1)
-given a collection of pedigrees, the total NPL score for n affected sib pairs is
Z = 1/√n Σzi
-where the sum is from i=1 to i=n
-Z follows a standard normal distribution

40
Q
Nonparametric Linkage (NPL) Score
Expectation and Variance
A

-denote P(xi=0)=po, P(xi=1)=p1, P(xi=2)=p2
-the expected number of alleles shared IBD:
E[xi] = p1 + 2p2
E[xi²] = p1 + 4p2
-variance
Var(xi) = E(xi²) - E(xi)²
-under Ho: p1=1/2, p2=1/4
=>
E(xi) = 1
Var(xi) = 1/2

41
Q

Introduction to Phylogenetics

A
  • different organisms often contain similar DNA sequences
  • in the theory of evolution this may be because a common ancestor experienced evolutionary mutational processes of substitution, insertion or deletion
42
Q

Phylogeny

A
  • any set of species is related and this relationship is called phylogeny
  • this is usually described in a phylogenetic tree
43
Q

What are the two types of tree?

A
  • rooted trees

- unrooted trees

44
Q

Notes on Phylogenetic Trees

A
  • all trees are assumed to be binary
  • a node is an endpoint of an edge
  • the ‘root’ is the ultimate ancestor
  • a labelled branching pattern is referred to as a topology
  • the length of the ith edge is denoted ti
45
Q

How many nodes and edges are there in a rooted tree of n leaves?

A
  • as we move up the tree, the edges coalesce and the number of edges is reduced to one
  • this gives a total of 2n-1 nodes, n terminal nodes and n-1 internal nodes
  • and therefore 2n-2 edges (discounting the edge above the root node)
46
Q

Pairwise Distance

Introduction

A
  • a phylogenetic tree is constructed from a multiple alignment of DNA sequences
  • a non-parametric construction of phylogenetic tree depends on pairwise distance between species
47
Q

Process of Constructing a Phylogenetic Tree

A

1) select species (DNA sequences)
2) multiple alignment of DNA sequences, assuming fixed length and no gaps - compute pairwise distances
3) infer phylogenetic tree

48
Q

Tree Construction Methods

A
  • parametric and non-parametric
  • the non-parametric methods we will be focusing on are distance matrix methods
  • in particular the neighbour-joining method and clustering method
49
Q

Pairwise Distance

Definition

A

-the pairwise distance between sequence x^i and x^j, denoted dij, is defined as the number of DNA bases that differ between the two distance, the Hamming distance

50
Q

Distance Methods

A

-distance methods reconstruct trees (rooted or unrooted) from a set of pairwise distances between the sequences in alignment (assumed given)

51
Q

Distance Function

Definition

A
  • let M be a set and let d: MxM -> ℝ be a function
  • we say that d is a distance function on M if:
    1) d(u,v)>0 for all u, v ∈ M
    2) d(u,u)=0 for all u ∈ M
    3) d(u,v)=d(v,u) for all u, v ∈ M
    4) the triangle inequality holds: d(u,v) ≤ d(u,w) + d(w,v) for all u,v,w ∈ M
52
Q

Tree Generated Distance Function

Definition

A

-if we fix an unrooted tree T relating to the sequences (OTUs) we obtain a tree generated distance function d^T on M by declaring:
d^T(x^i,x^j) = dij^T
-to be the shortest path from x^i to x^j in T

53
Q

Does there exist a tree T that generates d, which means that d^T = d (dij^T=dij) ?
N=2

A

-the answer is obviously yes for N=2, since there is only one possible path between each node anyway

54
Q

Does there exist a tree T that generates d, which means that d^T = d (dij^T=dij) ?
N=3

A

-looking for positive numbers x, y, z such that:
x + y = d12
x + z = d13
y + z = d23
-there is a unique tree that generates a given distance function
-this uniqueness is a general fact for additive distance functions

55
Q

Does there exist a tree T that generates d, which means that d^T = d (dij^T=dij) ?
N≥4

A

-not every distance function on M is additive, it can be characterised in the following way, theorem:

‘let d be a distance function M and N≥4 then d is additive if and only if the following condition holds:
for every set of four distinct numbers 1≤i,j,k,l≤N, two of the sums dij+dkl, dik+djl, dil+djk coincide and are greater than or equal to the third one’

-this condition is called the four point condition

56
Q

Neighbour Joining Algorithm

Description

A
  • an iterative algorithm that on every step replaces a pair of OTUs with a single OTU and iterates until there are only three OTUs left
  • this means that for N=3, there is just one unrooted tree topology
57
Q

Neighbour Joining Algorithm

ri

A

-for every i=1,…,N define:
ri = 1/[N-2] Σdik
-where the sum is from k=1 to N

58
Q

Neighbour Joining Algorithm

Dij

A

-for all i,j=1,…,N and i

59
Q

Neighbour Joining Algorithm

Steps

A

-calculate the matrix D=(Dij)
-pick a pair with 1≤i, j≤N for which Dij is minimal, such a pair may not be unique
-group x^i and x^j and replace them with x^(N+1) which represents an internal node of the future tree connected to x^i and x^j and is placed at:
d(N+1)i = 1/2 (dij + ri - rj)
d(N+1)j = 1/2 (dij + rj - ri)
-we define the distances between x^(N+1) and any x^m with m≠i,j as:
d(N+1)m = 1/2 (dim + djm - dij)
-we now have a collection of N-1 OTUs:
M’ = {x^m, x^(N+1), m≠i,j}
-repeat the above procedure again until only three OTUs are left in which case there is just one unrooted tree topology

60
Q

Clustering Method

Steps

A

1) assign each (initial) node x^i to C^i, i.e. each node is assumed to be a cluster on its own
2) choose two clusters C^i and C^j for which d(C^i,C^j) is minimal (excluding i=j)
3) define a new cluster C^(N+1)=C^i ∪ C^j and set the distance to the remaining clusters with the distance between clusters equation
4) introduce a new internal node x^(N+1) (associated with cluster C^(N+1)) and place it at the total height d(C^i,C^j)/2 and redefine the new distance matrix
5) repeat the process until we have only one cluster and the node represents the root

61
Q

Matrix of Transition Probabilities

A

4x4 matrix with entries pij=pij(t) with i,j∈{A,C,G,T}
-assume a Markov model where, if at to the site was in state i∈{A,C,G,T} then the probability of the event that that at time to+t the site will be in state j∈{A,C,G,T} depends only on i, j and t

62
Q

Rate Matrix

A

P(t) = exp(tQ)

-where Q=P’(0) is the ‘rate matrix’ or matrix of instantaneous change

63
Q

Juke-Cantor Model

Matrices

A

-sets entries in Q to -3α/4 on diagonal and α/4 elsewhere for some positive constant α
-then P(t) has elements rt on the diagonal and st elsewhere, where:
rt = pii(t) = 1/4 + 3/4 exp(-αt), for all i
st = pij(t) = 1/4 - 1/4 exp(-αt), for i≠j

64
Q

Juke-Cantor Model

Nucleotide Equilibrium Frequencies

A

-when t->∞, rt=st=1/4 which means that the nucleotide equilibrium frequencies in this model are:
qA = qC = qG = qT = 1/4

65
Q

Juke-Cantor Model

Probability

A

P{x1u, x2u | T, t1, t2} = Σ qa P{x1u | a,t1} P{x2u | a,t2}

-where the sum is over a∈{A,C,G,T}

66
Q

Juke-Cantor Model

Likelihood

A

-if there are N positions (if length of sequence is N):
L(t1, t2 | T, x1, x2) = P{x1, x2 | T, t1, t2}
= ∏ P{x1u, x2u | T, t1, t2}
-where the multiplication is over u=1 to u=N
= 1/[16^(n1+n2)] {1+3exp[-α(t1+t2)]}^n1 {1-exp[-α(t1+t2)]}^n2
-where n1 is the number of positions where the nucleotides in the two sequences are identical and n2 is the number of locations where a substitution occurs