Classification Flashcards
week 7
classification probabilities
P(class|features) we’re essentially trying to understand the likelihood of an observation belonging to a particular class given its features.
Approaches to Classifcation
Generative classifiers:.
Generative classifiers try to understand how data is generated, modeling both the features and the classes together.
These generate probabilities P(class|predictors) by first estimating other distributions.
Rely on statistical theory like Bayes theorem.
Discriminative classifiers:
Discriminative classifiers focus on predicting the class directly based on the observed features, without necessarily understanding the underlying data generation process.
Estimate P(class|predictors) directly.
Also referred to as conditional classifiers.
prior probability for a class
The “prior probability for a class” refers to the probability of a particular class occurring before considering any evidence or features. It represents our initial belief or assumption about the likelihood of each class before we observe any data.
in classification problems, prior probabilities are used as a starting point for making predictions.
Posterior Probability for a Class:
P(j∣x) represents the probability of class j given a feature vector x. This is what we want to find out—it’s like the updated probability of each class after we’ve observed the features.
misclassification rate
The performance of a classifier is usually measured by its misclassification rate. The misclassification rate is the proportion of observations assigned to the wrong class.
Linear Discriminant Analysis
Linear Discriminant Analysis
- Linear Discriminant Analysis is often abbreviated to LDA.
- LDA applicable when all the features are quantifiable.
- We assume that fj is a (joint) normal probability distribution.
- In addition, we assume that the covariance matrix is the same from class to class.
- The classes are differentiated by locations of their means.
Kernel Discriminant Analysis (KDA):
KDA extends LDA by allowing for more complex, nonlinear decision boundaries between classes. It achieves this by mapping the data into a higher-dimensional space using a kernel function. KDA essentially “lifts” the data into a higher-dimensional space where the classes might be more easily separable and then applies LDA in this new space.
In the context of classification trees, the Gini index (or Gini coefficient) at a node is defined by
G=∑Ci=1(1−pi)pi
Explain why the Gini index is a measure of node impurity. As part of your answer you should define the meaning of the quantities pi
and C
in the above equation.
C
is the number of classes [1], and pi
is the proportion of observations at the node of class i
[1]. G
can be thought of as the probability of incorrectly classifying an observation at the node when they’re randomly reallocated [1]. So G=0
is smallest if pi=1
for some pi
(i.e. zero impurity), while G
is largest if each of the pi=1/C
are equal [1].
Give a short description of how the term fj(x)
is estimated in
linear discriminant analysis, and
kernel discriminant analysis.
In LDA we assume that fj
in each class follows a multivariate normal distribution with possibly different means μj
[1] but common covariance [1].
In KDA we estimate fj
using a kernel density estimate [2].
Describe the difference between prediction, classification and clustering problems.
Prediction and classification are where you have a data with a target variable either continuous (prediction) or categorical (classification) and are tasked with training a model to predict that target [1]. Clustering on the other hand does not have a target variable in the data. Instead the purpose is to group the data into clusters based on their similarity [1].
cost_complexity=0.005 argument.
The cost_complexity argument is to control the pruning of the tree. The tree will only allow a split if the improvement in fit is at least 0.005 [2].
augment() function
The augment() function is predicting on the test set. This has been used as we must assess performance on independent data.
Describe the aim of cluster analysis and give one application to demonstrate its use.
Cluster analysis is the method of data grouping, or data segmentation.
The aim of cluster analysis is to delineate ‘natural groups’ of data, with high within-class similarity and low between-class similarity.
example: investor class,customer segmentation for marketing purposes
Give three possible choices of distance measures for use in clustering, commenting on their applicability to different types of data.
Euclidean distance “as the crow flies”, straight line distance in variable space Pythagoras’ thereom: the square root of the sum of squared differences on each dimension most commonly used for continuous data
Manhattan distance “as the taxi drives”, simply the sum of the absolute differences in each dimension less affected by outliers
Mahalanobis distance takes into account covariances good for spherical clusters
Explain the difference between supervised and unsupervised classification. Which of these is k
-means cluster analysis, and why?
Supervised classification has a target variable; unsupervised does not. k-means is unsupervised, because it doesn’t have a target variable.