Chapter 5 Flashcards
What 3 possibilities may arise for training data in terms of chosen analysis?(3)
- As a random sample from the joint distribution of Y and X. This might be the case, for example, in a medical study where we observe patients from some population (e.g.
presenting with a particular complaint) and we record various clinical variables X and whether or not the patient has a particular disease Y . In this case we can learn both
about the group membership probabilities πk and about the distribution of X and so we can either use a regression- or discriminant-based approach. - As separate random samples, chosen from each group, i.e. each value of Y . This might be the case, for example, in a clinical trial where we deliberately choose a sample of healthy patients and another sample of patients with a particular disease, taking measurements on various clinical variables X in each case. Under this sampling scenario we have no way of learning about the group membership probabilities Pr(Y = k) = πk, only the
conditional distribution of X given Y . It is therefore not appropriate to use regression based approaches. Moreover, we can only use the Bayes approach to discriminant analysis if estimates for the πk can be provided by other means. - As random samples of the group label Y for a chosen set of measurements X. This might be the case, for example, in a dose-ranging study where X represents the dose of a drug, chosen from a fixed set, and Y represents whether or not a patient suffers a particular side-effect. Under this sampling scenario we have no way of learning about the distribution of X, only the conditional distribution of Y given X. It is therefore only appropriate to use a regression-based approach.
What are mean, variance and covariance of multivariate normal?(3)
E(Xi) = µi, Var(Xi) = σii and Cov(Xi, Xj ) = σij .
Suppose X ∼ Np(µ, Σ), A is a q × p matrix and b is a q-vector.
Then the linear transformation Y = AX + b is also…
A multivariate normal, with Y ∼
Nq(Aµ + b , AΣA^T ).
What are allocation regions?(1)
This set of functions defines a partition of the sample space S for X into K distinct regions
R1, R2,…,RK (Rk ∩ Rl = ∅, ∪K k=1Rk = S) such that given X = x if Qk(x) > Ql(x), for all l/= k, then we assign Y = k.
The subsets R1, R2,…,RK are sometimes called allocation regions.
What is the Bayes classifier? What are the corresponding discriminant functions?(3)
Assigns an observation x to the class k for which the posterior probability pk(x) = Pr(Y = k|X = x) is largest. In this case
pk(x) > pl(x) ∀ l/= k
⇐⇒ fk(x)πk/(sum m=1 to k{fm(x)πm})> fl(x)πl/(sum m=1 to K{fm(x)πm} ∀l/= k
⇐⇒ fk(x)πk > fl(x)πl ∀ l/=k
Corresponding discriminant functions are:
Qk(x) = fk(x)πk, k = 1, . . ., K.
What is the maximum likelihood discriminant rule?(1)
If pik (prior group possibilities) are all equal at 1/K then observations are assigned to group with maximum likelihood.
What are decision boundaries?(1)
Qk(x) = Ql(x)
Region between Rk and Rl classification.
What is the difference between linear and quadratic discriminant analysis?(1)
Quadratic discriminant analysis (QDA) is similar except the different groups are allowed to have different covariance matrices.
Within groups sample matrix.(1)
SW = (1/n−K)[sum fromk=1 to K{(nk−1)*Sk}]
Within group k sample matrix.(1)
Sk = 1/(nk−1)Xk^THnkXk;
How do we measure classification of miscalculation? How do we know we have a good classification scheme based on this?(2)
Create a KxK matrix containing pij= Pr(allocate to grouo j|observation from group i)
For a perfect classification scheme, P would be equal to the K × K identity matrix IK. In practice, the best we can hope for is a matrix with diagonal elements close to 1 and
off-diagonal elements close to 0.
What are two types of in-sample validation methods used? What are the differences between these?(3)
-Plug in-calculates the pij from their analytic expressions,
replacing parameters with values estimated from data.
-Empirical method- Like the plug-in
method, the “empirical method” estimates the pij using the same data that were used to derive the discriminant rules. In this case we estimate pij with the empirical proportion:
pijhat = nij/ni,
where nij = #(x ∈ Rj from group i) and ni = sum{k
j=1}nij is the number of observations in group i.
Drawbacks of plug in method.(1)
Produces overestimates of the diagonal elements and underestimates of the off-diagonal elements due to over-fitting and ignoring the sampling variability in the parameter estimates.
Classification schemes without an underpinning model lack a probabilistic foundation therefore plug-in cannot be used.
Drawbacks of empirical method.(1)
Although the empirical method is simple and widely used, it also leads to optimistic estimates of the
performance of the classification scheme due to overfitting. Again this is essentially caused by double-use of the data for constructing and testing the classifier.
How do you calculate training error rate?(1)
1-(1/n)*[sum from i=1 to K{nii}]
Remember this is generally an overoptimistic view of classifier-if possible out-of-sample testing is preferred.