week 3 - classification Flashcards
what is the difference between a parameter and hyperparameter?
parameter = learned by model during training e.g regression coefficients
hyperparameter = set by user or grid search. E.g lambda
what is the function for a logistic regression called?
The sigmoidal function
Implicit in this function is a threshold, which is used to make classifications
what is the cost function for logistic regression?
You want to minimise the log loss or the cross entropy loss
what is the equation for accuracy and what does it mean?
meaning: overall proportion of correct classifications
accuracy = correct predictions/total predictions
what is sensitivity?
Proportion of true positives correctly identified/ true positive rate
TP/ TP+FN = TP/P
Sensitivity is also known as recall
AKA how good is it at identifying positives
What is specitify?
The proportion of true negatives that are correctly identified
TN/TN+FP = TN/N
This is equivalent to 1 - false positive rate
Also known as true negative rate
AKA how good is it at identifying negatives
What is precision?
TP/PP = TP/TP+FP
The proportion of positive results that were correctly classified
What do the rows and columns in the confusion matrix correspond to?
Rows = predicted
Columns = Actual truth
How can we change the specifity and sensitive of a logistic regression curve?
We can adjust the threshold
A lower threshold will increase sensitivity
A higher threshold will increase specifity
What does the ROC curve show?
The balance between sensitivity and specitivity.
The X axis shows FPR (1 - specificity)
The Y axis shows TPR (Sensitivity)
The ROC curve shows how the sensitivity and specificity vary with different logistic regression curve thresholds. E.g, if you had a threshold that classified ALL the true positives correctly, what would be the FPR?. Or if you had a threshold that calculated 0.7 of the true positives correctly, what would be the FPR rate etc..
The point at 0,0 represents a threshold that doesn’t classify anything as positive
ROC curves make it easy to identify the best thresholds for making a decision
What does the AUC show?
The AUC is the area under the ROC curve
A bigger AUC indicates that the proportion of true positives vs false positives is maximised
When might precision-recall curve be more useful than a ROC curve?
If there’s a highly imbalanced sample. This is because it classifies based on the rate within each classification categories, rather than between categories. This ensures it remains sensitive to the minority class predictions
what is the margin in support vector machines?
The distances between the objects and the thresholds
how does allowing misclassifications impact the bias variance trade off?
Allowing misclassifications improves the variance. Otherwise the data may be fitting to outliers
SVM then often uses a soft margin, which allows for some misclassifications
how does a kernel support vector machine work?
It aims to find a hyperplane that classifies classes with the Maximum margin
SVM machines optimise hinge loss , which measures how well the classifier’s decision boundary separates the classes and penalizes misclassifications based on their distance to the decision boundary. So for points that are classified correctly it doesn’t matter how close they are to the boundary, but for points that are misclassified the penalty is proportional to how far it is from the boundary
Non-lienar kernals can seperate data linearly by adding extra dimensions
What is the kernel trick
by adding extra dimensions, the support vector machine can find new ways to linearly seperate the data
SVM machines use kernel functions to systematically find the SVC classifiers in higher dimensions
Polynomial kernel d = 1 uses 1 dimension. Polynomial kernel d = 2 adds a y dimension of x^2
Polynomial kernel d = 3 adds a third dimension of x^3
We can find a good value for d using cross-validation
how does k-nearest neighbours work?
for the data-point to classify, it measures the distance to its neighbouring data points.
It then classifies it based on the dominant class amongst the data points nearest neighbours
what is the difference between KNN and other classification algorithms?
The model doesn’t fit itself to the training dataset, as the model essentially IS the training data.
However its still useful to use cross validation and split the data because we still have hyperparameters such as optimal K or optimal distance metric
how does one vs. rest work for multi-class classification?
Classifications are split into as many different classification problems are there are numbers of classes. E.g AD, MCI AND CN would be AD vs rest. MCI vs. rest and CN vs. rest
what is one vs. one?
n classes = n * (n-1)/2 classifiers
out of all the binary classifiers, the prediction = the class with the most votes
If you get equal number from each classifier, you can consider how far the data point is from each of the boundaries
What are strengths and weaknesses of SVM, logistic regression and KNN?
LOGISTIC REGRESSION:
- probabilistic interpretation
- can be regularised to avoid overfitting
- However tends to underperform for non-linear boundaries
SVM:
- Can model non-linear boundaries
- However to tricky tune as you need to select the right kernel, and also its computationally intensive
KNN:
- simple with no training time
- However you need to select k and its very computationally intensive for large data