Testing & Evaluating Flashcards

1
Q

Accuracy

A

A metric used to evaluate the performance of a classification model by measuring the proportion of correctly classified instances among all the instances in the dataset. It is calculated as the ratio of the number of correct predictions to the total number of predictions made by the model.

Accuracy provides an overall assessment of the model’s ability to correctly classify instances across all classes and is commonly used as a performance measure for balanced datasets with roughly equal class distributions. However, accuracy may not be suitable for imbalanced datasets, where the class distribution is skewed, as it can be misleading and biased towards the majority class. In such cases, other evaluation metrics, such as precision, recall, F1 score, or area under the ROC curve, may provide a more comprehensive assessment of the model’s performance.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Area under ROC Curve

A

The area under the receiver operating characteristic (ROC) curve, often abbreviated as AUC-ROC or AUC, is a metric used to evaluate the performance of binary classification models by measuring the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) across different decision thresholds. The ROC curve plots the true positive rate against the false positive rate for various threshold values, and the AUC-ROC represents the area under this curve. A higher AUC-ROC value indicates better discrimination and predictive performance of the model, with a value of 1 indicating perfect classification, while a value of 0.5 indicates random guessing. AUC-ROC is widely used in binary classification tasks to compare and select models, especially in scenarios where the class distribution is imbalanced or the costs of false positives and false negatives are unequal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Bias (of the model)

A

Error introduced by approximating a real-world problem with a simplified model that does not capture all the underlying patterns or relationships in the data. A model with high bias tends to underfit the training data, meaning it has high error on both the training and test datasets due to oversimplified assumptions or inadequate complexity. Bias measures how closely the average prediction of a model matches the true underlying value it is trying to predict. Minimizing bias typically involves increasing the complexity or flexibility of the model to capture more nuanced patterns in the data.

A model has a low bias if it predicts well the labels of the training data. If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bias-variance trade-off

A

A fundamental concept in supervised learning that describes the relationship between bias, variance, and model complexity. Bias measures the error introduced by simplifying assumptions in the model, while variance measures the variability of model predictions across different training datasets. The bias-variance trade-off states that as the complexity of a model increases, bias decreases but variance increases, and vice versa. The goal in machine learning is to find the optimal balance between bias and variance to minimize the overall error or generalization error of the model. Overly simple models suffer from high bias and underfitting, while overly complex models suffer from high variance and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Binary cross-entropy

A

Also known as log loss or logistic loss, is a loss function used in binary classification tasks to measure the difference between the predicted probabilities and the actual binary labels. It quantifies the discrepancy between the predicted probability distribution and the true distribution of the binary outcomes. Binary cross-entropy is commonly used as the objective function in training logistic regression models and binary classifiers based on neural networks. Minimizing binary cross-entropy during training helps optimize the model parameters to improve its ability to discriminate between the two classes and make accurate predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Chi-squared test

A

A statistical hypothesis test used to determine whether there is a significant association between two categorical variables in a dataset. It is used to test the independence or dependence of categorical variables by comparing observed frequencies of variable combinations to expected frequencies under a null hypothesis of independence. The chi-squared test calculates the chi-squared statistic, which quantifies the discrepancy between observed and expected frequencies, and compares it to a chi-squared distribution to assess the significance of the association. Chi-squared tests are commonly used in contingency table analysis, goodness-of-fit tests, and feature selection in machine learning and data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification report

A

A summary of the performance of a classification model on a dataset, providing metrics such as precision, recall, F1 score, and support for each class in the dataset. It is generated after evaluating the model on a test dataset and provides insights into the model’s ability to correctly classify instances across different classes. A typical classification report includes metrics such as precision (the ratio of true positive predictions to the total predicted positives), recall (the ratio of true positive predictions to the total actual positives), F1 score (the harmonic mean of precision and recall), and support (the number of instances in each class). Classification reports are commonly used to evaluate and compare the performance of different classification models and assess their suitability for specific tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Coefficient of determination (R^2)

A

The coefficient of determination, often denoted as R², is a statistical measure used to assess the goodness of fit of a regression model. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

R² ranges from 0 to 1:
* 0: Model explains none of the variance.
* 1: Model perfectly explains all the variance.

R² is calculated as the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS), where ESS measures the variance explained by the model, and TSS measures the total variance in the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Confusion Matrix

A

A table used to evaluate the performance of a classification model by tabulating the actual and predicted classes of observations. It provides a summary of the model’s predictions, including true positive, true negative, false positive, and false negative counts for each class. Confusion matrices are commonly used to compute evaluation metrics such as accuracy, precision, recall, F1-score, and visualize the performance of classification models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cosine Similarity

A

Cosine similarity is a measure of similarity between two vectors in a multi-dimensional space, often used in information retrieval, recommendation systems, and text mining to compare the similarity of documents or feature vectors. Cosine similarity measures the cosine of the angle between the two vectors, with values ranging from -1 to 1. A cosine similarity of 1 indicates that the vectors are identical (pointing in the same direction), while a cosine similarity of -1 indicates that the vectors are diametrically opposed (pointing in opposite directions). Cosine similarity is calculated as the dot product of the two vectors divided by the product of their magnitudes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cost function

A

A mathematical function used to quantify the error or discrepancy between the predicted outputs of a machine learning model and the true labels or targets in the training data. The cost function measures how well the model’s predictions align with the true values and provides a measure of the model’s performance. The goal of training a machine learning model is to minimize the cost function by adjusting the model parameters (weights and biases) using optimization algorithms such as gradient descent. Common cost functions include mean squared error (MSE) for regression tasks, cross-entropy loss for classification tasks, and various custom loss functions tailored to specific machine learning problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cost-Sensitive accuracy

A

A performance metric used to evaluate the effectiveness of a classification model, taking into account the costs associated with different types of classification errors. In scenarios where the costs of false positives and false negatives are unequal or asymmetric, traditional accuracy metrics may not adequately reflect the true performance of the model. Cost-sensitive accuracy adjusts the accuracy metric by weighting the contributions of different types of errors based on their associated costs. For example, in a medical diagnosis task, misclassifying a patient with a serious condition as healthy (false negative) may incur higher costs than misclassifying a healthy patient as having the condition (false positive). Cost-sensitive accuracy provides a more comprehensive evaluation of the model’s performance by considering the relative importance or costs of different types of errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cross entropy

A

Used to measure the difference between two probability distributions or the dissimilarity between predicted and true probability distributions. In machine learning, cross entropy is commonly used as a loss function in classification tasks, where it quantifies the difference between the predicted class probabilities and the actual class labels. Minimizing cross entropy is equivalent to maximizing the likelihood of the correct class labels given the model predictions, making it a popular choice for training classifiers in neural networks and other machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cross-Validation (CV)

A

A resampling technique used to assess the performance and generalization ability of a machine learning model by partitioning the dataset into multiple subsets, or folds, and iteratively training and evaluating the model on different combinations of training and validation data. In k-fold cross-validation, the dataset is divided into k equally sized folds, and the model is trained k times, each time using a different fold as the validation set and the remaining folds as the training set. Cross-validation helps estimate the model’s performance on unseen data and detect potential issues such as overfitting or underfitting by providing a more reliable estimate of the model’s generalization error compared to a single train-test split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CV Score

A

CV score, or cross-validation score, refers to the evaluation metric used to assess the performance of a machine learning model during cross-validation. It represents the average performance of the model across multiple folds of the dataset and provides an estimate of the model’s generalization ability. The CV score is typically calculated as the average of the evaluation metric (such as accuracy, precision, recall, or F1 score) computed on each fold of the cross-validation process. A higher CV score indicates better performance, while a lower CV score suggests poorer generalization ability of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Entropy

A

Entropy is a measure of uncertainty or disorder in a system, commonly used in information theory and decision tree algorithms to quantify the impurity of a set of data. In the context of decision trees and classification algorithms, entropy is calculated based on the distribution of class labels in a dataset and represents the average amount of information required to classify an instance in the dataset. Higher entropy indicates higher uncertainty or disorder, while lower entropy indicates more homogeneous or pure class distributions. Entropy is used as a splitting criterion in decision tree algorithms such as C4.5 and CART to determine the optimal feature and threshold for partitioning the data into subsets that maximize the purity of the resulting nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

F1 Score

A

Metric used to evaluate the performance of a classification model, particularly in binary classification tasks, where there are two classes (positive and negative). It’s especially useful when your classes are imbalanced (one class occurs much more frequently than another). IT is also usefull when false positives and false negatives have different costs (For example, in medical diagnostics, a false negative (missing a disease) might be much more important to avoid than a false positive (further testing))

It is the harmonic mean (combines) of precision and recall, calculated as:
2 * (precision * recall) / (precision + recall).

Precision and recall often have a trade-off relationship. The F1 score helps to find a balance between the two for an overall evaluation of your model. A high F1 score indicates that the model has both high precision (few false positives) and high recall (few false negatives), making it a useful metric for evaluating classifiers when the class distribution is imbalanced or when false positives and false negatives have different costs or implications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Feature Importance

A

The degree to which each feature in a dataset contributes to the predictive power of a machine learning model. It helps in understanding which features are most influential in making predictions. Feature importance analysis is often performed after training a model to identify the most informative features and to prioritize them for further analysis or feature engineering.

there are sever libraries for feature importance:
- scikit-learn: Provides feature_importances_ for tree-based models and permutation importance.
- Yellowbrick: A visualization library that helps you visualize feature importance.
- SHAP: A library specifically designed for explaining model outputs, including robust feature importance explanations.

common techniques for determining feature importance in machine learning:
- Feature Importance from Tree-Based Models
- Permutation Importance
- Coefficient Magnitude in Linear Models
- Mean Decrease Impurity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Ground truth

A

Ground truth refers to the true or correct labels, values, or outcomes associated with a dataset, typically used as a reference for evaluating the performance of a machine learning model. In supervised learning tasks, ground truth represents the actual target variable or response variable that the model aims to predict. Ground truth is often obtained through manual labeling, expert knowledge, or experimental measurements and serves as the gold standard against which the model’s predictions are compared. Ground truth provides an objective basis for assessing the accuracy, precision, recall, and other performance metrics of the model and helps identify errors, biases, and limitations in the model’s predictions.

20
Q

LB Score

A

LB score, or leaderboard score, refers to the evaluation metric used to rank participants in a machine learning competition or challenge based on their model’s performance on a held-out test dataset or leaderboard. The LB score represents the model’s performance according to the competition’s evaluation criteria and is used to determine the final standings and winners of the competition. The LB score is typically calculated using the same evaluation metric specified in the competition’s rules and guidelines, such as accuracy, log loss, mean absolute error (MAE), or root mean square error (RMSE). Participants submit their model predictions on the test dataset, and their LB scores are computed and displayed on a public leaderboard for comparison and ranking.

21
Q

Leave-One-Out Cross-Validation

A

A model evaluation technique particularly suited for small datasets where you want to maximize data usage. In LOOCV, you repeatedly split your dataset into training and testing sets. For each split:

Training Set: Consists of all but one single sample from your dataset.
Testing Set: The one sample you held out.
Model Building: You train a model on the training set and evaluate its performance on the single test sample.
Iterate: You repeat this process, using each sample in your dataset as the test sample exactly once.
Average Performance: Finally, you average the error metrics across all the iterations to get an overall estimate of model performance.
Key Points:

Exhaustive: Uses every data point for testing, good for squeezing information from small datasets.
Computationally Expensive: With large datasets, it can be slow as you train many models.
Bias-Variance Trade-off: Tends to have low bias (captures complex patterns well) but can suffer from high variance (sensitive to small data changes).

22
Q

MAE (Mean absolute error)

A

MAE (Mean Absolute Error): A metric used to evaluate the accuracy of a regression model. It measures the average absolute difference between predicted values and true values.

Calculation: The average of the absolute differences between each predicted value (ŷᵢ) and its corresponding true value (yᵢ) across all instances (i) in the dataset.

Properties:
* Robust to outliers.
* Interpretable in the same units as the target variable.
* Lower MAE indicates better accuracy (0 means perfect predictions).

23
Q

Maximum likelihood

A

Instead of minimizing the average loss, like in linear regression, we now maximize the likelihood of the training data. The optimization criterion is used for example in in logistic regression

24
Q

MSE (Mean square error)

A

MSE (Mean Squared Error): A metric used to evaluate the accuracy of a regression model. It measures the average squared difference between predicted values and true values.

Calculation: The average of the squared differences between each predicted value (ŷᵢ)
and its corresponding true value (yᵢ) across all instances in the dataset.

Properties:
* Sensitive to outliers (due to squaring the errors).
* Common loss function for regression problems.
* Lower MSE indicates better accuracy (0 means perfect predictions).

25
Q

Overfitting

A

Overfitting is the property of a model such that the model predicts very well labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training.

Occurs when a machine learning model learns to capture noise or random fluctuations in the training data instead of identifying the underlying patterns or relationships. As a result, the model performs well on the training data but fails to generalize to new, unseen data. Overfitting often happens when the model is too complex relative to the size of the training dataset, leading to memorization of the training examples rather than learning meaningful representations. Common symptoms of overfitting include excessively low training error but high test error, poor performance on new data, and high variance in model predictions. Techniques to mitigate overfitting include regularization, cross-validation, early stopping, and using simpler model architectures.

26
Q

P-value

A

A statistical measure used to assess the strength of evidence against a null hypothesis in a hypothesis test. It represents the probability of observing the test results as extreme as, or more extreme than, the observed results under the assumption that the null hypothesis is true. A low p-value indicates strong evidence against the null hypothesis, suggesting that the observed results are unlikely to occur by chance alone, leading to rejection of the null hypothesis. Conversely, a high p-value indicates weak evidence against the null hypothesis, suggesting that the observed results are likely to occur by chance, leading to failure to reject the null hypothesis. The p-value is typically compared to a predefined significance level (alpha) to determine the statistical significance of the test results.

27
Q

Permutation importance

A

A technique used to assess the importance or contribution of each feature in a machine learning model to its predictive performance. It measures the change in model performance (such as accuracy, precision, or recall) when the values of a feature are randomly permuted while keeping other features unchanged. A feature with high permutation importance indicates that shuffling its values disrupts the model’s predictions and reduces its performance, suggesting that the feature carries valuable information for making accurate predictions. Permutation importance provides insights into feature importance that are independent of the model architecture and can help identify influential features, prioritize feature selection, and interpret model predictions.

28
Q

Precision

A

A performance metric for classification models. It measures the proportion of true positive predictions among all positive predictions. Represents the model’s ability to correctly identify relevant instances (true positives) while minimizing false positives.
(Imaginee asking in archivum for specific type of documents. The precision is the proportion of relevant documents in the list of all returned documents)

Calculation: True Positives / (True Positives + False Positives)

Higher precision: Fewer false positives, more confidence in the model’s positive predictions.
Lower precision: More false positives, less confidence.

29
Q

Predefined significance level (alpha)

A

In statistical hypothesis testing, the predefined significance level, denoted as alpha (α), is the threshold used to determine whether to reject the null hypothesis. It represents the probability of rejecting the null hypothesis when it is actually true. Commonly chosen values for alpha are 0.05 or 0.01, but the specific value depends on the context of the analysis and the desired balance between Type I and Type II errors. If the p-value calculated from the test statistic is less than alpha, the null hypothesis is rejected, indicating that the results are statistically significant.

30
Q

Recall

A

Also known as sensitivity or true positive rate. A performance metric for classification models. Measures completeness or coverage of a classification model . It measures the proportion of true positive predictions out of all actual positive instances. Represents the model’s ability to identify all relevant instances (minimize false negatives).
(Imaginee asking in archivum for specific type of documents. The recall is the ratio of the relevant documents returned by the search engine to the total number of the relevant documents that could have been returned.)

Calculation: True Positives / (True Positives + False Negatives)

Higher recall: Fewer false negatives, meaning the model misses fewer actually positive cases.
Lower recall: More false negatives, meaning the model fails to identify many positive cases.

31
Q

Residual

A

The difference between the observed (actual) value of the dependent variable and the value predicted by the regression model. Represents unexplained variation (error) in the data the model doesn’t account for.

Formula: Residualᵢ = Observedᵢ - Predictedᵢ

Uses (Residual Analysis):
* Assess model fit
* Identify outliers
* Diagnose assumption violations in the model

32
Q

Similarity score

A

A measure used to quantify the similarity or distance between two objects or entities in a dataset, often employed in clustering, recommendation systems, information retrieval, and other machine learning tasks. Similarity scores can take various forms depending on the nature of the data and the application context. Similarity scores play a crucial role in clustering algorithms to group similar data points together and in recommendation systems to identify items or users with similar preferences.

Common similarity measures include cosine similarity, Euclidean distance, Manhattan distance, Jaccard similarity, Pearson correlation coefficient, and edit distance, among others.

A higher similarity score indicates greater similarity or proximity between the objects, while a lower similarity score indicates greater dissimilarity or distance.

33
Q

Support (in Classification report)

A

Support is about your dataset, not directly about model performance. Support column in a classification report indicates the number of true instances (samples) of each class present in your dataset. Helps you see if there are large differences in the number of examples in each class. Imbalance can impact model performance. Metrics like accuracy can be misleading with imbalanced data. Support lets you interpret other metrics in the context of how many samples each class has.

34
Q

Underfitting

A

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns or relationships in the training data, leading to poor performance on both the training and test datasets. An underfit model exhibits high bias and low variance, failing to learn from the training data and making overly simplistic predictions that do not generalize well to new, unseen data. Common causes of underfitting include using insufficiently complex models, omitting relevant features, and applying excessive regularization or constraints that restrict the model’s capacity to learn from the data. Underfitting can be detected by observing excessively high training error and poor performance on the test data, indicating that the model has not captured the true underlying structure of the data.

If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits. So, underfitting is the inability of the model to predict well the labels of the data it was trained on. There could be several reasons for underfitting, the most important of which are:
* your model is too simple for the data (for example a linear model can often underfit); * the features you engineered are not informative enough.

35
Q

Validation

A

The goal of validation is to check how well the model can make predictions on new, unseen data points that it hasn’t been trained on. Validation is crucial for learning how to improve model by tunning hyperparameters during development and for evaluating the reliability and effectiveness of a model before deploying it in real-world scenarios.

Validation is essential, but it’s not the same as testing on completely unseen data. Validation results often lead to refining your model and training process

Training-Validation Split: The dataset is divided into two parts: a training set and a validation set. The model is trained on the training set and then evaluated on the validation set. This technique is simple but may lead to high variance in performance estimates, especially with small datasets. Results are informative for the tweeking we might want to do to the model.

Holdout Validation: Essentialy the same as train-validation but used at the end of the project to asses model already developed. What we usually call testing at the end of the project (only testing it after deploying it to real world can give us more accurate info about model performance)

K-Fold Cross-Validation: The dataset is partitioned into k equal-sized folds. The model is trained k times, each time using
K-1 folds for training and one fold for validation. The performance metrics are averaged over all iterations. K-Fold Cross-Validation provides more reliable estimates of model performance, especially with limited data.

Leave-One-Out Cross-Validation (LOOCV): Each data point in the dataset is sequentially held out as a validation set, and the model is trained on the remaining data points. This process is repeated for each data point, and the performance metrics are averaged over all iterations. LOOCV provides an unbiased estimate of model performance but can be computationally expensive, especially with large datasets.

Stratified Cross-Validation: Similar to k-Fold Cross-Validation, but ensures that each fold contains approximately the same proportion of samples from each class. This is particularly useful for imbalanced datasets where certain classes are underrepresented.

36
Q

Validation loss

A

The error or loss calculated on a separate validation dataset during the training process. It is used to monitor the performance of the model on data that it hasn’t seen during training and serves as an estimate of how well the model will generalize to new, unseen data.
Solely minimizing the training loss can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on new data. This is where validation loss comes in.

The validation loss is calculated using the same loss function as the training loss but on a separate validation dataset that is not used for training. This dataset serves as a proxy for unseen data, allowing us to evaluate how well the model generalizes. By monitoring the validation loss during training, we can detect signs of overfitting. If the validation loss starts to increase while the training loss continues to decrease, it indicates that the model is overfitting to the training data and may not generalize well to new data.

The goal during training is to find the point at which both the training loss and the validation loss are minimized. This point represents the best compromise between fitting the training data well and generalizing to new data. Techniques such as early stopping, where training is halted when the validation loss starts to increase, can help prevent overfitting and improve model generalization.

37
Q

Variance (Model)

A

Variance refers to the sensitivity of the model’s predictions to small fluctuations or variations in the training data. A model with high variance exhibits excessive sensitivity to the training data, capturing noise or random fluctuations in the data rather than the underlying patterns or relationships. High variance models tend to overfit the training data, leading to low training error but high test error and poor generalization to new, unseen data. Variance measures the extent to which the model’s predictions differ across different training datasets sampled from the same underlying distribution.

Techniques to reduce variance and mitigate overfitting include regularization, cross-validation, ensemble methods, and using simpler model architectures.

38
Q

Sensitivity

A

Sensitivity (True Positive Rate): A measure of how often a test accurately detects the presence of the condition in people who actually have it. High sensitivity means the test misses few positive cases (low false negatives). Think of it as the ability to detect true danger. For example, a highly sensitive medical test is good at detecting a disease even in its early stages.

An opposite of Specifity (True Negative Rate)

39
Q

Specifity

A

Specificity (True Negative Rate): A measure of how often a test correctly identifies people who truly don’t have the condition. High specificity means the test has low false positives. Think of it as the ability to give a correct “all-clear” signal. For example, a highly specific medical test is good at correctly indicating the absence of a disease.

An opposite of Sensitivity (True Positive Rate)

40
Q

Classification metrics:

A
  • Accuracy
  • Confusion Matrix
  • Precision
  • Recall
  • F1 Score
  • AUC-ROC Curve
  • Specificity
  • Cross-entropy Loss
  • Matthews Correlation Coefficient (MCC)
41
Q

Fold (In Cross-validation)

A

Cross-validation is a technique to evaluate how well a machine learning model generalizes to unseen data. Your dataset is split into several smaller subsets called folds. A common choice is 5-fold or 10-fold cross-validation. Each fold gets a chance to play the role of the “testing set”: The model is trained on the combination of all the other folds. Performance is evaluated on the held-out fold.

42
Q

Holdout Validation

A

technique for evaluating how well a machine learning model generalizes to unseen data. The core idea is to split your dataset into two parts: a ‘training set’ and a ‘holdout set’. The model is trained exclusively on the training set, and its performance is then evaluated on the holdout set, which simulates how the model would perform on new, previously unseen data. Since the holdout set wasn’t used during training, it gives a more reliable estimate of the model’s true generalization ability, helping to avoid overfitting where a model learns the training data too specifically.

You might use a holdout set iteratively throughout development, adjusting your model based on its performance. The test set is meant to be used only once. If you use a holdout set repeatedly to tune your model, you risk it subtly influencing your choices and biasing your estimation. The test set, held strictly separate, avoids this.

Holdout set is used primarily during the model development process. Test set is used for final, rigorous assessment reserved until the very end of the development process. This is meant to give an unbiased estimate of the final model’s performance to help you decide if it’s ready for deployment.

43
Q

Redundency

A

Redundancy refers to the inclusion of unnecessary or duplicate information in a system, process, or data set. redundancy can have both positive and negative implications depending on the context in which it is applied. Understanding the role of redundancy and carefully managing it can help optimize system performance, reliability, and resilience while minimizing unnecessary overhead and complexity.

In machine learning may on the other hand waste resources by processing unnecessary information or may contribute to notising better an important signal of feature if exposed to it more times.

44
Q

Regression metrics:

A

Mean Absolute Error (MAE): Averages the absolute differences between the predicted and true values. Easy to understand and less sensitive to outliers than MSE. give insight into the average size of errors made by the model. Also measures error magnitude, and is even less affected by extreme outliers compared to MAE.

Mean Squared Error (MSE): Averages the squared differences between the predicted and true values. Emphasizes larger errors due to squaring. give insight into the average size of errors made by the model. MSE emphasizes larger errors due to squaring, making it more sensitive to outliers than MAE.

Root Mean Squared Error (RMSE): The square root of MSE, bringing it back to the same units as the target variable. give insight into the average size of errors made by the model.

Median Absolute Error: Similar to MAE, but less affected by extreme outliers.

R-Squared (Coefficient of Determination): Measures the proportion of variance in the target variable explained by the model. Ranges from 0 to 1, with higher being better. indicate how well the model captures the variability of the data. R-squared ranges from 0 to 1, with higher values being better.

Explained Variance Score: Similar to R-Squared, but adjusted for the number of features used in the model. indicate how well the model captures the variability of the data. R-squared ranges from 0 to 1, with higher values being better.

Max Error: Reports the worst-case error, important for safety-critical applications. Provides information about the largest error, which is important in safety-critical applications.

45
Q

Repeated measures design

A

In a repeated measures design, the same participants are measured multiple times under different conditions or across time. The goal is to see how the individuals change in response to these variations. For example, you might test a group of participants’ memory performance before and after a training intervention. The key advantage is that each participant serves as their own control, reducing the impact of individual differences and making it easier to detect the effects of the thing you’re manipulating (like the training intervention).

46
Q

Stratified Cross-Validation

A

Technique used to evaluate machine learning models that works to ensure each fold (subset) of your data maintains the same proportions of your target classes as the overall dataset. Imagine you have a dataset with an imbalanced class distribution – perhaps 80% of your data represents one class and only 20% represents another. Regular cross-validation might randomly split the data, resulting in some folds having very few, or even none, of the minority class examples. This makes it difficult to get a reliable assessment of how well the model performs on that minority class. Stratified Cross-Validation addresses this by splitting the data in a way that preserves the original class balance in each fold. With this technique, a fold with 80%/20% class distribution ensures that each subsequent fold also maintains this ratio. This leads to a more robust evaluation of your model, especially crucial when dealing with imbalanced data where performance on all classes matters.

  1. Grouping: The dataset is divided based on class labels.
  2. Shuffling within Groups: Data points within each class group are randomly shuffled.
  3. Fold Creation: Samples from each class group are selected proportionally to maintain the original class balance in each fold.