Jupyter Notebook 1.3-Binary_Classification Flashcards
What is classification?
- Definition: Classification is a type of supervised learning where the goal is to predict a categorical label for an input.
- Examples: Spam detection (Spam or Not Spam), Tumor diagnosis (Malignant or Benign).
- Key Algorithms: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (k-NN).
- Output: Discrete values (e.g., classes like 0/1, Yes/No).
- Performance Metrics: Accuracy, Precision, Recall, F1-Score, AUC-ROC.
What is regression?
- Definition: Regression is a type of supervised learning where the goal is to predict a continuous value based on input features.
- Examples: House price prediction, Stock market forecasting, Temperature prediction.
- Key Algorithms: Linear Regression, Polynomial Regression, Decision Trees, Random Forest, Support Vector Regression (SVR).
- Output: Continuous values (e.g., numerical quantities).
- Performance Metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared.
What do the notations below stand for?
P
N
TP
FP
TN
FN
P = All actual positive data points
N = All actual negative data points
TP = True positives (correctly identified positives)
FP = False positives (negatives wrongly identified as positives)
TN = True negative (correctly identified negatives)
FN = False negative (positives wrongly identified as negatives)
What are the two main forms or supervised learning?
Classification and Regression
What is the special case of coss-validation called where K is set to the number of data points in the training set?
Kanskje ett random eksamens spørsmål? maybe good maby shit
It’s called: leave-one-out.
Each fold is then a single sample
How does cross-validation work?
We randomly split the training set into several parts, called folds. Say into K folds, then train a model K times, each time using a different fold for evaluation and training on the remaining K-1. The average score of the K runs is used to estimate the model’s performance
What is the Trap of Unbalanced Datasets?
A situation where one class significantly outnumbers the other, leading to misleading model performance, such as high accuracy despite poor detection of the minority class (diabetes competition)
Key problems:
* Accuracy Paradox: Hight overall accuracy but poor minority class detection
* Biased Models: The model may focus on the majority class, ignoring minory cases
Solutions:
* Resampling: Oversample the minority class or undersample the majority class
* Adjust Metrics: Use precision, recall, F1-score or balanced accuracy
* Class Weights: Penalize wrong predictions on the minory class.
Too many cases of healthy individuals than people diagnosed with diabetes! Unbalanced as fudge yo
What does the StandardScaler do in machine learning?
Standardizes features by scaling them to have a mean of 0 and standard deviation of 1.
z = (x-μ)/σ
x: Original feature value
μ: Mean of the feature
σ: Standard deviation of the feature
- Ensures features contribute equally to the model.
- Improves performance of algorithms sensitive to data scale (e.g., SGD, SVM, KNN, Neural Networks).