Classification, Clustering, And Feature Learning Flashcards
When is classification used?
When relating categorical response variables to predictor variables.
What is a supervised process?
Processes that require a training dataset that shows how particular attributes match up to an outcome of interest.
E.g a set of data where we know for sure that individuals either have or do not have a disease, and we have various attributes measured for each individual.
This can be made into a model and tested on some test data.
What is an unsupervised process?
A process that doesn’t require a training dataset or any pre defined outcomes
What are the methods of classification?
1) Logistic regression
2) Random forest classification
3) Support vector machines
What does the logistic function do?
Transforms linear function allowing us to visualise a binary y scale where y varies from 0 to 1.
This is useful in a binary classification problem when things you want to predict can either take a value of 0 or 1 (yes or no)
This allows for a continuous predictor variable on the X axis and a probability of achieving classification 1 or 0 on the y axis.
What does the X axis of a logistic regression curve show?
Shows the continuous predictor variable
What does the y axis of a logistic regression curve show?
The probability of achieving classification of 1 or 0.
The predictions (on the y axis) for each value of the predictor variable (X axis) give a probability for observing the outcome we have coded as “1” for that value of the predictor variable.
How do support vector machines achieve classification?
They achieve classification by mapping the training data points to points in k dimensional space.
Where k is the number of attributes of the data.
Once data is plotted in k dimensional space a hyperplane separates known categories whilst achieving maximum separation between categories.
What is a hyperplane?
The hyperplane is one dimensional order lower than the order of k dimensional space. Allowing the hyperplane to slice through k dimensional space therefore separating categories
What does random forest classification do?
The algorithm takes the training dataset and uses it to build a set of decision trees, each based on a subset of the training data.
When we want to classify a new data point the data point is classified by every decision tree in the forest.
Each tree then votes for how the data point should be classified.
Giving us a measure of the probability that the data point belongs in a particular class.
What is an advantage of random forest classification?
Random forest algorithms also offer embedded feature selection.
Therefore we can look at the measures it uses to determine the importance of each factor.
What is mean decrease accuracy?
R permutes each variable of interest and measures how the accuracy of classification is affected.
The greater the decrease in accuracy when a variable is randomly permuted the more important it is.
What is mean decrease Gini?
Gini is a way of measuring if something is homogeneous or heterogeneous.
Each node in each decision tree is associated with a decrease in Gini impurity score for descendent nodes compared to the parent node.
Adding up all the reductions every time a node decision tree is split based on a particular variable gives a measure of importance for that variable.
What is a test data set? And what can it be used for in classification?
A test dataset is some data that has been kept where you know the classification of each data point and test a model against it to see how well the model makes predictions compared to the tire classifications.
Draw a confusion matrix?
Drawing