Fundamentals of Machine Learning I (Not Part of Certification) Flashcards
Characteristics of Regression?
- Predicts amount of something based on numerical values
- Uses data splitting and penalty to get an effective mathematical function
- Supervised
- Metrics
- MAE, MSE, RMSE, R^2
What is binary classification + characteristics
- Prediction of one of two classes ex(diabetic to not diabetic)
- Supervised
- Uses probability
- Classifies probability with sigmoid curve
- Logistic Regression is a binary classifier
- made with random subset of data
- Confusion Matrixes
- F1, Recall, Precision, AUC, TPR, FPR, ROC
What is RMSE
RMSE (Root Mean Squared Error) Used to measure number of
incorrect predictions
What is MAE
MAE (Mean Absolute Error) Calculated with mean error
What is F1
F1 Score
- Harmonic mean of Precision and Recall.
- Formula: 2 * (Precision * Recall) / (Precision + Recall).
What is Recall
Recall (Sensitivity, TPR)
- Proportion of actual positives correctly identified by the model.
- Formula: TP / (TP + FN).
What is AUC
AUC (Area Under the Curve)
- Measures overall classification performance as the area under the ROC curve.
- Values range from 0 to 1, with 1 being perfect and 0.5 representing random guessing.
What is TPR
TPR (True Positive Rate, Recall)
- Proportion of actual positives classified as positive.
- Formula: TP / (TP + FN).
What is Precision
Precision
- Proportion of predicted positives that are actual positives.
- Formula: TP / (TP + FP).
What is FPR
FPR (False Positive Rate)
- Proportion of actual negatives incorrectly classified as positive.
- Formula: FP / (FP + TN).
What is ROC
ROC (Receiver Operating Characteristic) Curve
- Plots TPR vs. FPR across different thresholds.
- Used to assess model performance over varying decision boundaries.
What is R^2
R^2 (Coefficient of determination) Used to measure variance in
data to calculate the fit of the model
What is MSE
MSE (Mean squared error) Mean of error amount squared. Used
to amplify the error amount
What is multiclass classification?
Multiclass classification is used to predict which of multiple possible classes an observation belongs to. It calculates probability values for each class label and predicts the most probable class.
(Supervised)
What are the two types of algorithms used in multiclass classification?
The two types of algorithms are:
One-vs-Rest (OvR): Trains a binary classification function for each class.
Multinomial: Creates a single function that returns a probability distribution for all possible classes.
How does the One-vs-Rest (OvR) algorithm work?
The OvR algorithm trains separate binary classification functions for each class. Each function calculates the probability that an observation belongs to a specific class compared to all others, and the class with the highest probability is predicted.
What is a multinomial algorithm, and how does it work?
A multinomial algorithm creates a single function that returns a vector with probability values for each class. These values add up to 1, and the class with the highest probability is predicted. An example is the softmax function.
How can you evaluate a multiclass classification model?
You can evaluate by calculating binary classification metrics for each class or aggregate metrics across all classes. Metrics like accuracy, recall, precision, and F1-score can be derived from a multiclass confusion matrix.
How are binary classification metrics used in evaluating multiclass classification models?
In multiclass classification, binary metrics such as accuracy, recall, precision, and F1-score can be calculated for each individual class. These are derived from the confusion matrix, where True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) are recorded for each class. Aggregate metrics can then be calculated for overall model evaluation.
What metrics can be derived from a multiclass confusion matrix?
Metrics derived from a multiclass confusion matrix include:
Accuracy, Recall, Precision, F1-Score
Define Unsupervised Learning
Unsupervised learning uses data without labels to find patterns or groupings. Examples: clustering, dimensionality reduction.
Define Supervised Learning
Supervised learning uses labeled data (features and known outcomes) to train a model to make predictions. Examples: classification, regression.
What is clustering in machine learning?
Clustering is an unsupervised learning method where observations are grouped into clusters based on similarities in their features, without using labels.
Why is clustering considered unsupervised learning?
Clustering is unsupervised because it doesn’t rely on known label values. Instead, it groups data points based solely on feature similarities.
What is an example of clustering?
In a flower dataset with features like the number of leaves and petals, clustering groups similar flowers based on these features without knowing their species.
What is the centroid process in K-Means clustering?
- Initialization: Choose 𝑘 clusters and randomly select 𝑘 initial centroids.
- Assignment Step: Calculate distances from each point to centroids, assigning points to the nearest centroid.
- Update Step: Recalculate centroids by finding the mean of assigned points.
- Repeat: Iterate the assignment and update steps until centroids stabilize or a maximum number of iterations is reached.
What is K-Means clustering?
K-Means is a clustering algorithm that assigns data points to clusters by minimizing the distance to the cluster’s centroid, repeatedly adjusting centroids until stable clusters are formed.
How do you evaluate a clustering model?
Clustering models are evaluated by how well the clusters are separated, using metrics like average distance to centroid, maximum distance, and silhouette score
What is a silhouette score in clustering?
A silhouette score measures cluster separation, ranging from -1 to 1, where values closer to 1 indicate better-defined clusters.
What metrics are used to evaluate clusters?
Metrics include average distance to cluster center, average distance to other centers, maximum distance to center, and silhouette score.
What is a Centroid
A centroid is the central point of a cluster, representing the mean position of all data points assigned to that cluster in the feature space. It serves as the reference for assigning new points to clusters in algorithms like K-Means.