Testing & Evaluating Flashcards

Question

Overfitting

Answer 1

Overfitting is the property of a model such that the model predicts very well labels of the examples used during training but frequently makes errors when applied to examples that weren’t seen by the learning algorithm during training. Occurs when a machine learning model learns to capture noise or random fluctuations in the training data instead of identifying the underlying patterns or relationships. As a result, the model performs well on the training data but fails to generalize to new, unseen data. Overfitting often happens when the model is too complex relative to the size of the training dataset, leading to memorization of the training examples rather than learning meaningful representations. Common symptoms of overfitting include excessively low training error but high test error, poor performance on new data, and high variance in model predictions. Techniques to mitigate overfitting include regularization, cross-validation, early stopping, and using simpler model architectures.

Answer 2

A statistical measure used to assess the strength of evidence against a null hypothesis in a hypothesis test. It represents the probability of observing the test results as extreme as, or more extreme than, the observed results under the assumption that the null hypothesis is true. A low p-value indicates strong evidence against the null hypothesis, suggesting that the observed results are unlikely to occur by chance alone, leading to rejection of the null hypothesis. Conversely, a high p-value indicates weak evidence against the null hypothesis, suggesting that the observed results are likely to occur by chance, leading to failure to reject the null hypothesis. The p-value is typically compared to a predefined significance level (alpha) to determine the statistical significance of the test results.

Answer 3

A technique used to assess the importance or contribution of each feature in a machine learning model to its predictive performance. It measures the change in model performance (such as accuracy, precision, or recall) when the values of a feature are randomly permuted while keeping other features unchanged. A feature with high permutation importance indicates that shuffling its values disrupts the model's predictions and reduces its performance, suggesting that the feature carries valuable information for making accurate predictions. Permutation importance provides insights into feature importance that are independent of the model architecture and can help identify influential features, prioritize feature selection, and interpret model predictions.

Answer 4

A performance metric for classification models. It measures the proportion of true positive predictions among all positive predictions. Represents the model's ability to correctly identify relevant instances (true positives) while minimizing false positives. (Imaginee asking in archivum for specific type of documents. The precision is the proportion of relevant documents in the list of all returned documents) Calculation: True Positives / (True Positives + False Positives) Higher precision: Fewer false positives, more confidence in the model's positive predictions. Lower precision: More false positives, less confidence.

Answer 5

In statistical hypothesis testing, the predefined significance level, denoted as alpha (α), is the threshold used to determine whether to reject the null hypothesis. It represents the probability of rejecting the null hypothesis when it is actually true. Commonly chosen values for alpha are 0.05 or 0.01, but the specific value depends on the context of the analysis and the desired balance between Type I and Type II errors. If the p-value calculated from the test statistic is less than alpha, the null hypothesis is rejected, indicating that the results are statistically significant.

Answer 6

Also known as sensitivity or true positive rate. A performance metric for classification models. Measures completeness or coverage of a classification model . It measures the proportion of true positive predictions out of all actual positive instances. Represents the model's ability to identify all relevant instances (minimize false negatives). (Imaginee asking in archivum for specific type of documents. The recall is the ratio of the relevant documents returned by the search engine to the total number of the relevant documents that could have been returned.) Calculation: True Positives / (True Positives + False Negatives) Higher recall: Fewer false negatives, meaning the model misses fewer actually positive cases. Lower recall: More false negatives, meaning the model fails to identify many positive cases.

Answer 7

The difference between the observed (actual) value of the dependent variable and the value predicted by the regression model. Represents unexplained variation (error) in the data the model doesn't account for. Formula: Residualᵢ = Observedᵢ - Predictedᵢ Uses (Residual Analysis): * Assess model fit * Identify outliers * Diagnose assumption violations in the model

Answer 8

A measure used to quantify the similarity or distance between two objects or entities in a dataset, often employed in clustering, recommendation systems, information retrieval, and other machine learning tasks. Similarity scores can take various forms depending on the nature of the data and the application context. Similarity scores play a crucial role in clustering algorithms to group similar data points together and in recommendation systems to identify items or users with similar preferences. Common similarity measures include cosine similarity, Euclidean distance, Manhattan distance, Jaccard similarity, Pearson correlation coefficient, and edit distance, among others. A higher similarity score indicates greater similarity or proximity between the objects, while a lower similarity score indicates greater dissimilarity or distance.

Answer 9

Support is about your dataset, not directly about model performance. Support column in a classification report indicates the number of true instances (samples) of each class present in your dataset. Helps you see if there are large differences in the number of examples in each class. Imbalance can impact model performance. Metrics like accuracy can be misleading with imbalanced data. Support lets you interpret other metrics in the context of how many samples each class has.

Answer 10

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns or relationships in the training data, leading to poor performance on both the training and test datasets. An underfit model exhibits high bias and low variance, failing to learn from the training data and making overly simplistic predictions that do not generalize well to new, unseen data. Common causes of underfitting include using insufficiently complex models, omitting relevant features, and applying excessive regularization or constraints that restrict the model's capacity to learn from the data. Underfitting can be detected by observing excessively high training error and poor performance on the test data, indicating that the model has not captured the true underlying structure of the data. If the model makes many mistakes on the training data, we say that the model has a high bias or that the model underfits. So, underfitting is the inability of the model to predict well the labels of the data it was trained on. There could be several reasons for underfitting, the most important of which are: * your model is too simple for the data (for example a linear model can often underfit); * the features you engineered are not informative enough.

Answer 11

The goal of validation is to check how well the model can make predictions on new, unseen data points that it hasn't been trained on. Validation is crucial for learning how to improve model by tunning hyperparameters during development and for evaluating the reliability and effectiveness of a model before deploying it in real-world scenarios. Validation is essential, but it's not the same as testing on completely unseen data. Validation results often lead to refining your model and training process Training-Validation Split: The dataset is divided into two parts: a training set and a validation set. The model is trained on the training set and then evaluated on the validation set. This technique is simple but may lead to high variance in performance estimates, especially with small datasets. Results are informative for the tweeking we might want to do to the model. Holdout Validation: Essentialy the same as train-validation but used at the end of the project to asses model already developed. What we usually call testing at the end of the project (only testing it after deploying it to real world can give us more accurate info about model performance) K-Fold Cross-Validation: The dataset is partitioned into k equal-sized folds. The model is trained k times, each time using K-1 folds for training and one fold for validation. The performance metrics are averaged over all iterations. K-Fold Cross-Validation provides more reliable estimates of model performance, especially with limited data. Leave-One-Out Cross-Validation (LOOCV): Each data point in the dataset is sequentially held out as a validation set, and the model is trained on the remaining data points. This process is repeated for each data point, and the performance metrics are averaged over all iterations. LOOCV provides an unbiased estimate of model performance but can be computationally expensive, especially with large datasets. Stratified Cross-Validation: Similar to k-Fold Cross-Validation, but ensures that each fold contains approximately the same proportion of samples from each class. This is particularly useful for imbalanced datasets where certain classes are underrepresented.

Answer 12

The error or loss calculated on a separate validation dataset during the training process. It is used to monitor the performance of the model on data that it hasn't seen during training and serves as an estimate of how well the model will generalize to new, unseen data. Solely minimizing the training loss can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on new data. This is where validation loss comes in. The validation loss is calculated using the same loss function as the training loss but on a separate validation dataset that is not used for training. This dataset serves as a proxy for unseen data, allowing us to evaluate how well the model generalizes. By monitoring the validation loss during training, we can detect signs of overfitting. If the validation loss starts to increase while the training loss continues to decrease, it indicates that the model is overfitting to the training data and may not generalize well to new data. The goal during training is to find the point at which both the training loss and the validation loss are minimized. This point represents the best compromise between fitting the training data well and generalizing to new data. Techniques such as early stopping, where training is halted when the validation loss starts to increase, can help prevent overfitting and improve model generalization.

Answer 13

Variance refers to the sensitivity of the model's predictions to small fluctuations or variations in the training data. A model with high variance exhibits excessive sensitivity to the training data, capturing noise or random fluctuations in the data rather than the underlying patterns or relationships. High variance models tend to overfit the training data, leading to low training error but high test error and poor generalization to new, unseen data. Variance measures the extent to which the model's predictions differ across different training datasets sampled from the same underlying distribution. Techniques to reduce variance and mitigate overfitting include regularization, cross-validation, ensemble methods, and using simpler model architectures.

Answer 14

Sensitivity (True Positive Rate): A measure of how often a test accurately detects the presence of the condition in people who actually have it. High sensitivity means the test misses few positive cases (low false negatives). Think of it as the ability to detect true danger. For example, a highly sensitive medical test is good at detecting a disease even in its early stages. An opposite of Specifity (True Negative Rate)

Answer 15

Specificity (True Negative Rate): A measure of how often a test correctly identifies people who truly don't have the condition. High specificity means the test has low false positives. Think of it as the ability to give a correct "all-clear" signal. For example, a highly specific medical test is good at correctly indicating the absence of a disease. An opposite of Sensitivity (True Positive Rate)

Answer 16

- Accuracy - Confusion Matrix - Precision - Recall - F1 Score - AUC-ROC Curve - Specificity - Cross-entropy Loss - Matthews Correlation Coefficient (MCC)

Answer 17

Cross-validation is a technique to evaluate how well a machine learning model generalizes to unseen data. Your dataset is split into several smaller subsets called folds. A common choice is 5-fold or 10-fold cross-validation. Each fold gets a chance to play the role of the "testing set": The model is trained on the combination of all the other folds. Performance is evaluated on the held-out fold.

Answer 18

technique for evaluating how well a machine learning model generalizes to unseen data. The core idea is to split your dataset into two parts: a 'training set' and a 'holdout set'. The model is trained exclusively on the training set, and its performance is then evaluated on the holdout set, which simulates how the model would perform on new, previously unseen data. Since the holdout set wasn't used during training, it gives a more reliable estimate of the model's true generalization ability, helping to avoid overfitting where a model learns the training data too specifically. You might use a holdout set iteratively throughout development, adjusting your model based on its performance. The test set is meant to be used only once. If you use a holdout set repeatedly to tune your model, you risk it subtly influencing your choices and biasing your estimation. The test set, held strictly separate, avoids this. Holdout set is used primarily during the model development process. Test set is used for final, rigorous assessment reserved until the very end of the development process. This is meant to give an unbiased estimate of the final model's performance to help you decide if it's ready for deployment.

Answer 19

Redundancy refers to the inclusion of unnecessary or duplicate information in a system, process, or data set. redundancy can have both positive and negative implications depending on the context in which it is applied. Understanding the role of redundancy and carefully managing it can help optimize system performance, reliability, and resilience while minimizing unnecessary overhead and complexity. In machine learning may on the other hand waste resources by processing unnecessary information or may contribute to notising better an important signal of feature if exposed to it more times.

Answer 20

Mean Absolute Error (MAE): Averages the absolute differences between the predicted and true values. Easy to understand and less sensitive to outliers than MSE. give insight into the average size of errors made by the model. Also measures error magnitude, and is even less affected by extreme outliers compared to MAE. Mean Squared Error (MSE): Averages the squared differences between the predicted and true values. Emphasizes larger errors due to squaring. give insight into the average size of errors made by the model. MSE emphasizes larger errors due to squaring, making it more sensitive to outliers than MAE. Root Mean Squared Error (RMSE): The square root of MSE, bringing it back to the same units as the target variable. give insight into the average size of errors made by the model. Median Absolute Error: Similar to MAE, but less affected by extreme outliers. R-Squared (Coefficient of Determination): Measures the proportion of variance in the target variable explained by the model. Ranges from 0 to 1, with higher being better. indicate how well the model captures the variability of the data. R-squared ranges from 0 to 1, with higher values being better. Explained Variance Score: Similar to R-Squared, but adjusted for the number of features used in the model. indicate how well the model captures the variability of the data. R-squared ranges from 0 to 1, with higher values being better. Max Error: Reports the worst-case error, important for safety-critical applications. Provides information about the largest error, which is important in safety-critical applications.

Answer 21

In a repeated measures design, the same participants are measured multiple times under different conditions or across time. The goal is to see how the individuals change in response to these variations. For example, you might test a group of participants' memory performance before and after a training intervention. The key advantage is that each participant serves as their own control, reducing the impact of individual differences and making it easier to detect the effects of the thing you're manipulating (like the training intervention).

Answer 22

Technique used to evaluate machine learning models that works to ensure each fold (subset) of your data maintains the same proportions of your target classes as the overall dataset. Imagine you have a dataset with an imbalanced class distribution – perhaps 80% of your data represents one class and only 20% represents another. Regular cross-validation might randomly split the data, resulting in some folds having very few, or even none, of the minority class examples. This makes it difficult to get a reliable assessment of how well the model performs on that minority class. Stratified Cross-Validation addresses this by splitting the data in a way that preserves the original class balance in each fold. With this technique, a fold with 80%/20% class distribution ensures that each subsequent fold also maintains this ratio. This leads to a more robust evaluation of your model, especially crucial when dealing with imbalanced data where performance on all classes matters. 1. Grouping: The dataset is divided based on class labels. 2. Shuffling within Groups: Data points within each class group are randomly shuffled. 3. Fold Creation: Samples from each class group are selected proportionally to maintain the original class balance in each fold.

Testing & Evaluating Flashcards

(46 cards)