Testing & Evaluating Flashcards
Accuracy
A metric used to evaluate the performance of a classification model by measuring the proportion of correctly classified instances among all the instances in the dataset. It is calculated as the ratio of the number of correct predictions to the total number of predictions made by the model.
Accuracy provides an overall assessment of the model’s ability to correctly classify instances across all classes and is commonly used as a performance measure for balanced datasets with roughly equal class distributions. However, accuracy may not be suitable for imbalanced datasets, where the class distribution is skewed, as it can be misleading and biased towards the majority class. In such cases, other evaluation metrics, such as precision, recall, F1 score, or area under the ROC curve, may provide a more comprehensive assessment of the model’s performance.”
Area under ROC Curve
The area under the receiver operating characteristic (ROC) curve, often abbreviated as AUC-ROC or AUC, is a metric used to evaluate the performance of binary classification models by measuring the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) across different decision thresholds. The ROC curve plots the true positive rate against the false positive rate for various threshold values, and the AUC-ROC represents the area under this curve. A higher AUC-ROC value indicates better discrimination and predictive performance of the model, with a value of 1 indicating perfect classification, while a value of 0.5 indicates random guessing. AUC-ROC is widely used in binary classification tasks to compare and select models, especially in scenarios where the class distribution is imbalanced or the costs of false positives and false negatives are unequal.
Bias (of the model)
Error introduced by approximating a real-world problem with a simplified model that does not capture all the underlying patterns or relationships in the data. A model with high bias tends to underfit the training data, meaning it has high error on both the training and test datasets due to oversimplified assumptions or inadequate complexity. Bias measures how closely the average prediction of a model matches the true underlying value it is trying to predict. Minimizing bias typically involves increasing the complexity or flexibility of the model to capture more nuanced patterns in the data.
A model has a low bias if it predicts well the labels of the training data. If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits
Bias-variance trade-off
A fundamental concept in supervised learning that describes the relationship between bias, variance, and model complexity. Bias measures the error introduced by simplifying assumptions in the model, while variance measures the variability of model predictions across different training datasets. The bias-variance trade-off states that as the complexity of a model increases, bias decreases but variance increases, and vice versa. The goal in machine learning is to find the optimal balance between bias and variance to minimize the overall error or generalization error of the model. Overly simple models suffer from high bias and underfitting, while overly complex models suffer from high variance and overfitting.
Binary cross-entropy
Also known as log loss or logistic loss, is a loss function used in binary classification tasks to measure the difference between the predicted probabilities and the actual binary labels. It quantifies the discrepancy between the predicted probability distribution and the true distribution of the binary outcomes. Binary cross-entropy is commonly used as the objective function in training logistic regression models and binary classifiers based on neural networks. Minimizing binary cross-entropy during training helps optimize the model parameters to improve its ability to discriminate between the two classes and make accurate predictions.
Chi-squared test
A statistical hypothesis test used to determine whether there is a significant association between two categorical variables in a dataset. It is used to test the independence or dependence of categorical variables by comparing observed frequencies of variable combinations to expected frequencies under a null hypothesis of independence. The chi-squared test calculates the chi-squared statistic, which quantifies the discrepancy between observed and expected frequencies, and compares it to a chi-squared distribution to assess the significance of the association. Chi-squared tests are commonly used in contingency table analysis, goodness-of-fit tests, and feature selection in machine learning and data analysis.
Classification report
A summary of the performance of a classification model on a dataset, providing metrics such as precision, recall, F1 score, and support for each class in the dataset. It is generated after evaluating the model on a test dataset and provides insights into the model’s ability to correctly classify instances across different classes. A typical classification report includes metrics such as precision (the ratio of true positive predictions to the total predicted positives), recall (the ratio of true positive predictions to the total actual positives), F1 score (the harmonic mean of precision and recall), and support (the number of instances in each class). Classification reports are commonly used to evaluate and compare the performance of different classification models and assess their suitability for specific tasks.
Coefficient of determination (R^2)
The coefficient of determination, often denoted as R², is a statistical measure used to assess the goodness of fit of a regression model. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
R² ranges from 0 to 1:
* 0: Model explains none of the variance.
* 1: Model perfectly explains all the variance.
R² is calculated as the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS), where ESS measures the variance explained by the model, and TSS measures the total variance in the dependent variable.
Confusion Matrix
A table used to evaluate the performance of a classification model by tabulating the actual and predicted classes of observations. It provides a summary of the model’s predictions, including true positive, true negative, false positive, and false negative counts for each class. Confusion matrices are commonly used to compute evaluation metrics such as accuracy, precision, recall, F1-score, and visualize the performance of classification models.
Cosine Similarity
Cosine similarity is a measure of similarity between two vectors in a multi-dimensional space, often used in information retrieval, recommendation systems, and text mining to compare the similarity of documents or feature vectors. Cosine similarity measures the cosine of the angle between the two vectors, with values ranging from -1 to 1. A cosine similarity of 1 indicates that the vectors are identical (pointing in the same direction), while a cosine similarity of -1 indicates that the vectors are diametrically opposed (pointing in opposite directions). Cosine similarity is calculated as the dot product of the two vectors divided by the product of their magnitudes.
Cost function
A mathematical function used to quantify the error or discrepancy between the predicted outputs of a machine learning model and the true labels or targets in the training data. The cost function measures how well the model’s predictions align with the true values and provides a measure of the model’s performance. The goal of training a machine learning model is to minimize the cost function by adjusting the model parameters (weights and biases) using optimization algorithms such as gradient descent. Common cost functions include mean squared error (MSE) for regression tasks, cross-entropy loss for classification tasks, and various custom loss functions tailored to specific machine learning problems.
Cost-Sensitive accuracy
A performance metric used to evaluate the effectiveness of a classification model, taking into account the costs associated with different types of classification errors. In scenarios where the costs of false positives and false negatives are unequal or asymmetric, traditional accuracy metrics may not adequately reflect the true performance of the model. Cost-sensitive accuracy adjusts the accuracy metric by weighting the contributions of different types of errors based on their associated costs. For example, in a medical diagnosis task, misclassifying a patient with a serious condition as healthy (false negative) may incur higher costs than misclassifying a healthy patient as having the condition (false positive). Cost-sensitive accuracy provides a more comprehensive evaluation of the model’s performance by considering the relative importance or costs of different types of errors.
Cross entropy
Used to measure the difference between two probability distributions or the dissimilarity between predicted and true probability distributions. In machine learning, cross entropy is commonly used as a loss function in classification tasks, where it quantifies the difference between the predicted class probabilities and the actual class labels. Minimizing cross entropy is equivalent to maximizing the likelihood of the correct class labels given the model predictions, making it a popular choice for training classifiers in neural networks and other machine learning models.
Cross-Validation (CV)
A resampling technique used to assess the performance and generalization ability of a machine learning model by partitioning the dataset into multiple subsets, or folds, and iteratively training and evaluating the model on different combinations of training and validation data. In k-fold cross-validation, the dataset is divided into k equally sized folds, and the model is trained k times, each time using a different fold as the validation set and the remaining folds as the training set. Cross-validation helps estimate the model’s performance on unseen data and detect potential issues such as overfitting or underfitting by providing a more reliable estimate of the model’s generalization error compared to a single train-test split.
CV Score
CV score, or cross-validation score, refers to the evaluation metric used to assess the performance of a machine learning model during cross-validation. It represents the average performance of the model across multiple folds of the dataset and provides an estimate of the model’s generalization ability. The CV score is typically calculated as the average of the evaluation metric (such as accuracy, precision, recall, or F1 score) computed on each fold of the cross-validation process. A higher CV score indicates better performance, while a lower CV score suggests poorer generalization ability of the model.
Entropy
Entropy is a measure of uncertainty or disorder in a system, commonly used in information theory and decision tree algorithms to quantify the impurity of a set of data. In the context of decision trees and classification algorithms, entropy is calculated based on the distribution of class labels in a dataset and represents the average amount of information required to classify an instance in the dataset. Higher entropy indicates higher uncertainty or disorder, while lower entropy indicates more homogeneous or pure class distributions. Entropy is used as a splitting criterion in decision tree algorithms such as C4.5 and CART to determine the optimal feature and threshold for partitioning the data into subsets that maximize the purity of the resulting nodes.
F1 Score
Metric used to evaluate the performance of a classification model, particularly in binary classification tasks, where there are two classes (positive and negative). It’s especially useful when your classes are imbalanced (one class occurs much more frequently than another). IT is also usefull when false positives and false negatives have different costs (For example, in medical diagnostics, a false negative (missing a disease) might be much more important to avoid than a false positive (further testing))
It is the harmonic mean (combines) of precision and recall, calculated as:
2 * (precision * recall) / (precision + recall).
Precision and recall often have a trade-off relationship. The F1 score helps to find a balance between the two for an overall evaluation of your model. A high F1 score indicates that the model has both high precision (few false positives) and high recall (few false negatives), making it a useful metric for evaluating classifiers when the class distribution is imbalanced or when false positives and false negatives have different costs or implications.
Feature Importance
The degree to which each feature in a dataset contributes to the predictive power of a machine learning model. It helps in understanding which features are most influential in making predictions. Feature importance analysis is often performed after training a model to identify the most informative features and to prioritize them for further analysis or feature engineering.
there are sever libraries for feature importance:
- scikit-learn: Provides feature_importances_ for tree-based models and permutation importance.
- Yellowbrick: A visualization library that helps you visualize feature importance.
- SHAP: A library specifically designed for explaining model outputs, including robust feature importance explanations.
common techniques for determining feature importance in machine learning:
- Feature Importance from Tree-Based Models
- Permutation Importance
- Coefficient Magnitude in Linear Models
- Mean Decrease Impurity