Linear Reg Flashcards

1
Q

How do we evaluate a regression model?

A

• Given N examples, pairs xi yi, linear regression computes a model
• So that for each point,
• We evaluate the model by computing the Residual Sum of Squares (RSS)

The goal of linear regression is thus to find the weights that minimize RSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the assumptions made for linear regression?!

A

• Linearity
– When applying linear regression, prediction is a linear combinations of the inputs

• Normality
–The target outcome follows a normal distribution

• Homoscedasticity
–The variance of the error terms is assumed to be constant over the entire feature space

• Independence
– Each instance is independent from one another

• Absence of Multicollinearity
–There are no strongly correlated features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the coefficient of determination R squared in linear regression? What does it indicates?

A

• Total sum of squares
• Coefficient of determination
• R2 measures of how well the regression line approximates the real data points. When R2 is 1, the regression line perfectly fits the data.

• R2 increases with the number of features even if they do not convey any information about the target
• Therefore, it is usually better to use the adjusted R2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we evaluate a model?

A

• Models should be evaluated using data that have not been used to build the model itself
• Example: would be feasible to evaluate students using exactly the same problems solved in class?
• The available data must be split between training and test
–Training data will be used to build the model
–Test data will be used to evaluate the model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is cross-validation?

A

• First step
– Data is split into k subsets of equal size
• Second step
–Each subset in turn is used for testing and the remainder for training
• This is called k-fold cross-validation and avoids overlapping test sets
• Often the subsets are stratified before cross-validation is performed
• The error estimates are averaged to yield an overall error estimate

• Standard method for evaluation stratified ten-fold cross-validation
• Why ten? Experiments have shown that this is the best choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better : repeated stratified cross-validation
• Ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) • Other approaches appear to be robust, e.g., 5x2 crossvalidation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is over fitting?

A

Very good performance on the training set (model fits precisely patterns present in training data)
Terrible performance on the test set (patterns were just noise and are no longer present)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why and how regularizations such as Ridge and Lasso work?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we analyze the effect of regularizations on models?

A

We could plot the weight values before and after the application of the regularizations

We could also analyze the effect of the regularizations as the alpha value changes plotting the weight values against the variation on alpha.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the strategies to evaluate the best alpha value

A

• To select the best value of α we cannot use the test set since it is going to be used for evaluating the final model (which uses α)

• Need to reserve part of the training data to evaluate possible candidate values of α and to select the best one

• If we have enough data, we can extract a validation set from the training data which will be used to select α

• If we don’t have enough data, we should select α by applying k-fold cross-validation over the training data choosing the α corresponding to the lowest average cost over the k folds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some of the metrics used to evaluate classification models

A

• Accuracy
–Classifier accuracy in predicting the correct the class labels

• Speed
–Time to construct the model (training time)
–Time to use the model to label unseen data

• Other Criteria
–Robustness in handling noise
– Scalability
– Interpretability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are linear classifiers? How do they work?

A

Linear classifiers are algorithms used in machine learning to classify data points by separating them into different classes using a linear decision boundary. They work by finding a hyperplane (a line in 2D, a plane in 3D, or a higher-dimensional equivalent) that best divides the data points of different classes.

Key Components:

1.	Linear Decision Boundary: The boundary is defined by a linear equation of the form:

f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b

where:
• \mathbf{x} is the input feature vector.
• \mathbf{w} is the weight vector that defines the orientation of the hyperplane.
• b is the bias term that shifts the hyperplane.
2. Classification Rule:
• A data point is classified based on which side of the hyperplane it lies. For binary classification:

\text{Class 1 if } f(\mathbf{x}) \geq 0, \text{ otherwise Class 2}.

How Linear Classifiers Work:

1.	Training: The algorithm adjusts the weights ( \mathbf{w} ) and bias ( b ) during training using labeled data so that the hyperplane best separates the classes.
•	Algorithms like Perceptron, Support Vector Machine (SVM), or optimization techniques like Gradient Descent are used for this purpose.
2.	Prediction: For a new input, the model calculates  f(\mathbf{x})  and determines the class based on the sign or value of  f(\mathbf{x}) .
3.	Evaluation: The performance of the classifier is measured using metrics like accuracy, precision, recall, and others.

Common Examples of Linear Classifiers:

1.	Logistic Regression: Models the probability of a binary outcome and uses a logistic function.
2.	Support Vector Machines (Linear Kernel): Maximizes the margin between classes while finding the optimal hyperplane.
3.	Perceptron Algorithm: A simple linear classifier that adjusts weights iteratively.

Limitations:

•	Not Suitable for Non-linear Data: Linear classifiers cannot model complex relationships or datasets where classes are not linearly separable.
•	Sensitive to Feature Scaling: The performance depends heavily on how features are scaled.

Extensions for Non-linear Data:

•	Kernel methods (e.g., in SVMs) or feature transformations (e.g., polynomial features) can help handle non-linear data while still using a linear classifier approach.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Detail the logistics regression technique

A

Logistic regression is a supervised learning technique used for binary classification problems, where the output variable can take one of two possible values (e.g., yes/no, 0/1, spam/not spam). Unlike linear regression, logistic regression predicts the probability that a given input belongs to a particular class, mapping the output to a range between 0 and 1 using a sigmoid function.

  1. Key Concepts

Model Equation

Logistic regression uses the following model:

P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)

where:
• \mathbf{x} : Input feature vector.
• \mathbf{w} : Weight vector (coefficients).
• b : Bias (intercept).
• \sigma(z) : Sigmoid function defined as:

\sigma(z) = \frac{1}{1 + e^{-z}}

The sigmoid function maps any real-valued number to the range [0, 1].

Decision Boundary

To classify data points, logistic regression uses a threshold (e.g., 0.5):
• If P(y=1 | \mathbf{x}) \geq 0.5 , classify as class 1.
• Otherwise, classify as class 0.

The decision boundary is a linear hyperplane, defined by:

\mathbf{w}^T\mathbf{x} + b = 0

  1. Training Process

Log-Likelihood Function

The model is trained by maximizing the likelihood of the observed data. The likelihood for a dataset with n samples is:

L(\mathbf{w}, b) = \prod_{i=1}^n P(y_i | \mathbf{x}i)

Taking the logarithm (log-likelihood) simplifies computation:

\log L(\mathbf{w}, b) = \sum{i=1}^n \Big[ y_i \log P(y_i | \mathbf{x}_i) + (1 - y_i) \log (1 - P(y_i | \mathbf{x}_i)) \Big]

Optimization

The log-likelihood function is maximized to find the optimal weights ( \mathbf{w} ) and bias ( b ):
1. Gradient Descent or variants like Stochastic Gradient Descent (SGD) are commonly used to optimize the parameters.
2. The gradients of the log-likelihood with respect to the parameters are computed to update them iteratively:

\mathbf{w} \gets \mathbf{w} + \eta \nabla_{\mathbf{w}} \log L

where \eta is the learning rate.

  1. Advantages• Probabilistic Output: Predicts probabilities, which makes it interpretable and useful in risk-based decision-making.
    • Efficient: Works well for linearly separable datasets and is computationally efficient.
    • Feature Importance: The learned weights ( \mathbf{w} ) provide insights into feature importance.
  2. Limitations• Linear Decision Boundary: Cannot handle non-linear relationships unless features are transformed.
    • Imbalanced Data: Can perform poorly if one class dominates the dataset. Techniques like class weighting or oversampling are needed.
    • Outliers: Sensitive to outliers, which can significantly affect the decision boundary.
  3. Extensions

Multinomial Logistic Regression:

For multi-class classification, logistic regression can be extended using the softmax function, which generalizes the sigmoid function to multiple classes.

P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}

Regularized Logistic Regression:

Adding regularization terms helps prevent overfitting:
• L1 Regularization: Adds \lambda \sum |w_i| (LASSO).
• L2 Regularization: Adds \lambda \sum w_i^2 (Ridge).

  1. Applications• Medical Diagnosis: Predicting the presence of a disease (e.g., diabetes).
    • Spam Filtering: Classifying emails as spam or not spam.
    • Customer Churn Prediction: Identifying customers likely to leave a service.
    • Credit Scoring: Determining the likelihood of loan default.

By mapping probabilities to binary outcomes with a linear decision boundary, logistic regression is both a simple yet powerful classification tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define the one versus the rest multi class classification technique

A

• For each class, it creates one classifier that predicts the target class against all the others

• Given three classes A, B, C, it computes three models
– One that predicts A against B and C
– One that predicts B against A and C, and
– One that predicts C against A and B

• Then, given an example, all the three classifiers are applied and the label with the highest probability is returned

• Alternative approaches include the minimization of loss based on the multinomial loss fit across the entire probability distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we use logistic regression for multiclass classification?

A

Logistic regression can be extended to handle multiclass classification problems (where the output has more than two classes) using two main approaches: One-vs-Rest (OvR) and Multinomial Logistic Regression (Softmax Regression). Here’s how they work:

  1. One-vs-Rest (OvR) Approach

In this method, logistic regression is applied multiple times, once for each class. For a problem with K classes, the approach works as follows:
1. Binary Classifiers: Train K binary logistic regression classifiers, where each classifier distinguishes one class from the rest (e.g., “Class 1 vs. Not Class 1,” “Class 2 vs. Not Class 2,” and so on).
2. Prediction:
• For a new input, each classifier predicts a probability for its respective class.
• The class with the highest probability is assigned as the final prediction:

\hat{y} = \arg\max_{k \in {1, 2, \dots, K}} P(y=k | \mathbf{x})

Advantages of OvR:

•	Simple to implement using binary logistic regression.
•	Efficient for problems with a small number of classes.

Limitations of OvR:

•	Can be computationally expensive for large numbers of classes (since  K  models are trained).
•	May not perform as well if the classes are highly imbalanced.
  1. Multinomial Logistic Regression (Softmax Regression)

This is the direct extension of logistic regression for multiclass classification, where a single model predicts the probabilities for all K classes simultaneously. It uses the softmax function to ensure the output probabilities for all classes sum to 1.

Model

For a dataset with K classes, the probability of a data point \mathbf{x} belonging to class k is given by:

P(y = k | \mathbf{x}) = \frac{\exp(\mathbf{w}k^T \mathbf{x} + b_k)}{\sum{j=1}^K \exp(\mathbf{w}_j^T \mathbf{x} + b_j)}

where:
• \mathbf{w}_k and b_k are the weight vector and bias for class k .
• The denominator normalizes the probabilities.

Decision Rule

The predicted class is the one with the highest probability:

\hat{y} = \arg\max_{k \in {1, 2, \dots, K}} P(y = k | \mathbf{x})

Training

The model is trained by maximizing the log-likelihood for all classes. For n samples, the log-likelihood is:

\log L = \sum_{i=1}^n \sum_{k=1}^K \mathbf{1}(y_i = k) \log P(y_i = k | \mathbf{x}_i)

where \mathbf{1}(y_i = k) is an indicator function (1 if y_i = k , 0 otherwise).

Optimization is done using methods like Gradient Descent or Stochastic Gradient Descent.

Advantages of Softmax Regression:

•	Single model handles all classes.
•	Provides probabilistic outputs for all classes.
•	Works well for balanced and separable datasets.

Limitations of Softmax Regression:

•	Computationally expensive for datasets with a large number of classes.
•	Assumes linear separability in the feature space.
  1. Regularization

To prevent overfitting, regularization can be applied to both approaches:
• L1 Regularization (LASSO): Encourages sparsity in weights.
• L2 Regularization (Ridge): Penalizes large weights to improve generalization.

  1. Comparison of OvR and Softmax

Feature OvR Softmax Regression
Number of Models K binary models 1 multinomial model
Training Complexity Linear in K More complex (joint training for all classes)
Output Class probabilities for each binary model Probabilities for all classes in one step
Use Case Few classes, simpler datasets Balanced and larger datasets

  1. Applications• Image Classification: Recognizing objects (e.g., dog, cat, car) in images.
    • Document Classification: Classifying documents into categories (e.g., sports, technology, politics).
    • Medical Diagnosis: Predicting types of diseases or conditions.

By choosing between OvR and Softmax Regression based on the dataset and problem requirements, logistic regression becomes a versatile tool for multiclass classification tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define the confusion matrix and its attributes. What is the importance of distinguishing the different types of errors?

A

Confusion Matrix: Definition

A confusion matrix is a tool used to evaluate the performance of a classification model. It provides a summary of the predictions made by the model compared to the actual labels in the dataset. It breaks down the outcomes into four categories: True Positives, True Negatives, False Positives, and False Negatives, which give insight into the types of errors the model makes.

Attributes of the Confusion Matrix

1.	True Positives (TP):
•	Instances where the model correctly predicts the positive class.
•	For example, the model predicts “disease present” when the disease is indeed present.
2.	True Negatives (TN):
•	Instances where the model correctly predicts the negative class.
•	For example, the model predicts “no disease” when there is no disease.
3.	False Positives (FP):
•	Instances where the model incorrectly predicts the positive class.
•	For example, the model predicts “disease present” when there is no disease.
•	This is also known as a Type I error or a “false alarm.”
4.	False Negatives (FN):
•	Instances where the model incorrectly predicts the negative class.
•	For example, the model predicts “no disease” when the disease is present.
•	This is also known as a Type II error or a “miss.”

Importance of Distinguishing Different Types of Errors

1.	Context-Specific Impact:
•	The severity of false positives and false negatives depends on the application.
•	In medical diagnosis, a false negative (missing a disease) may be life-threatening, while a false positive (incorrectly diagnosing a disease) may cause unnecessary anxiety and tests.
2.	Decision-Making:
•	By understanding the types of errors, we can adjust the model to minimize the more critical error type. For example, in fraud detection, reducing false negatives (undetected fraud) is often more important than reducing false positives (flagging legitimate transactions as fraud).
3.	Model Evaluation:
•	Metrics like precision, recall, and F1-score depend on these error types. For instance, precision focuses on minimizing false positives, while recall emphasizes reducing false negatives.
4.	Imbalanced Datasets:
•	In datasets with imbalanced classes (e.g., rare diseases), accuracy alone can be misleading. Distinguishing errors helps ensure the model is evaluated based on how well it handles the minority class.
5.	Real-World Implications:
•	Understanding and balancing the trade-offs between false positives and false negatives ensures the model’s outputs align with the desired outcomes in practical scenarios.

By analyzing the confusion matrix, we can fine-tune a model to achieve a balance that best fits the specific goals and constraints of the application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we use the confusion matrix to calculate the accuracy of a model? Why is that not enough? In which situations accuracy is not effective/usefull?

A

Using the Confusion Matrix to Calculate Accuracy

The accuracy of a model measures the proportion of correct predictions (both true positives and true negatives) out of the total predictions. It can be calculated from the confusion matrix using the formula:

\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Predictions (TP + TN + FP + FN)}}

In simple terms, it is the ratio of correctly classified instances (both positive and negative) to the total number of instances in the dataset.

Why Accuracy Is Not Always Enough

Although accuracy is intuitive and easy to calculate, it does not always provide a complete picture of model performance. This is because:
1. Class Imbalance:
• In datasets where one class dominates (e.g., 95% of samples belong to Class A and only 5% to Class B), a model that always predicts Class A will achieve 95% accuracy but will fail completely at identifying Class B.
• In such cases, accuracy is misleading because it does not account for the model’s ability to correctly classify the minority class.
2. No Insight into Error Types:
• Accuracy does not distinguish between false positives (Type I errors) and false negatives (Type II errors). For certain applications, one type of error might be far more critical than the other.
• Example: In cancer detection, missing a cancer case (false negative) is more serious than falsely diagnosing cancer (false positive).
3. Lack of Granularity:
• Accuracy is a single metric and does not provide insights into specific aspects of the model’s performance, such as precision, recall, or the trade-offs between them.
4. Overfitting and Bias:
• High accuracy might indicate overfitting to the training data or bias in the dataset, where the model memorizes patterns instead of generalizing well.

Situations Where Accuracy Is Not Effective

1.	Imbalanced Datasets:
•	Example: In fraud detection, where only 1% of transactions are fraudulent, a model predicting all transactions as “non-fraudulent” will have 99% accuracy but will fail to detect any fraud cases.
2.	High Cost of Specific Errors:
•	Example: In medical diagnosis, missing a disease (false negative) might have serious consequences, even if the model achieves high accuracy overall.
3.	Multi-Class Problems:
•	In multi-class classification, accuracy alone does not reveal which classes are being misclassified and whether certain classes are disproportionately affected.
4.	Anomalies and Rare Events:
•	Example: In cybersecurity, detecting rare attacks is crucial, and a high accuracy model might fail to identify these rare cases effectively.

Better Alternatives to Accuracy

When accuracy is not effective, other metrics derived from the confusion matrix are more informative:
1. Precision:
• Focuses on the reliability of positive predictions ( \frac{TP}{TP + FP} ).
• Useful when false positives are costly (e.g., spam filtering).
2. Recall (Sensitivity):
• Measures the ability to identify all actual positives ( \frac{TP}{TP + FN} ).
• Useful when false negatives are costly (e.g., medical diagnosis).
3. F1-Score:
• Combines precision and recall into a single metric ( 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ).
• Useful for imbalanced datasets.
4. Specificity:
• Measures the ability to identify actual negatives ( \frac{TN}{TN + FP} ).
• Important when false positives need to be minimized.
5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
• Evaluates the trade-off between true positive and false positive rates across different thresholds.

By considering these metrics alongside accuracy, we gain a more comprehensive understanding of model performance, especially in critical or imbalanced scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can we use a cost matrix along with the confusion matrix to better evaluate a model?

A

Using a Cost Matrix with a Confusion Matrix

A cost matrix is a tool used to quantify the cost or impact of different types of errors (false positives and false negatives) and correct predictions (true positives and true negatives). By combining it with a confusion matrix, we can evaluate a model’s performance more realistically, considering the actual costs or consequences of its predictions.

How a Cost Matrix Works

A cost matrix assigns a numerical value (cost) to each outcome in the confusion matrix:
• True Positives (TP): Often assigned a reward or zero cost.
• True Negatives (TN): Often assigned a reward or zero cost.
• False Positives (FP): Associated with the cost of a Type I error.
• False Negatives (FN): Associated with the cost of a Type II error.

Example Cost Matrix for Binary Classification:

Actual \ Predicted Positive Negative
Positive (Actual) 0 (Reward) FN Cost
Negative (Actual) FP Cost 0 (Reward)

Steps to Use a Cost Matrix with a Confusion Matrix

1.	Calculate the Confusion Matrix:
•	Derive the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for the model’s predictions.
2.	Define the Cost Matrix:
•	Assign costs based on the problem’s context. For example:
•	In fraud detection: Cost of missing a fraud (FN) is much higher than falsely flagging a legitimate transaction (FP).
•	In medical diagnosis: Missing a disease (FN) may have life-threatening consequences, while misdiagnosing a healthy patient (FP) may result in unnecessary tests.
3.	Compute the Total Cost:
•	Multiply the confusion matrix values by the corresponding costs from the cost matrix.
•	Calculate the total cost using the formula:

\text{Total Cost} = (TP \cdot C_{TP}) + (FP \cdot C_{FP}) + (FN \cdot C_{FN}) + (TN \cdot C_{TN})

•	Where  C_{TP}, C_{FP}, C_{FN}, C_{TN}  are the costs from the cost matrix.
4.	Evaluate the Model:
•	Compare the total costs of different models to identify the one that minimizes the overall cost, rather than solely relying on metrics like accuracy.

Why a Cost Matrix Improves Model Evaluation

1.	Realistic Decision-Making:
•	Incorporates the real-world consequences of errors, making the evaluation more aligned with the application’s requirements.
•	Example: In fraud detection, the cost of missing fraud is higher than the cost of flagging a legitimate transaction.
2.	Prioritization of Errors:
•	Helps prioritize reducing specific errors (false positives or false negatives) based on their impact.
3.	Balancing Class Imbalances:
•	Adjusts for the unequal importance of classes, especially in datasets with rare but critical events (e.g., fraud, diseases).
4.	Guides Threshold Selection:
•	A cost-sensitive approach can help choose an optimal decision threshold to minimize overall costs.

Example

Scenario: Medical Diagnosis

•	TP (Correctly detects disease): Cost = $0 (Reward for correct detection).
•	FP (Healthy person diagnosed with disease): Cost = $100 (Cost of unnecessary tests).
•	FN (Missed disease): Cost = $10,000 (Cost of untreated disease).
•	TN (Correctly identifies healthy): Cost = $0.

Suppose the confusion matrix for a model is:
• TP = 90, FP = 10, FN = 5, TN = 95.

Cost Computation:

\text{Total Cost} = (90 \cdot 0) + (10 \cdot 100) + (5 \cdot 10,000) + (95 \cdot 0)

\text{Total Cost} = 0 + 1,000 + 50,000 + 0 = 51,000

This cost-driven evaluation reveals the high penalty for false negatives, emphasizing the need for a model with higher recall.

When to Use a Cost Matrix

1.	Applications with High-Stakes Errors:
•	Fraud detection, medical diagnosis, credit risk analysis, cybersecurity.
2.	Imbalanced Datasets:
•	When class distribution is skewed, and some errors (e.g., false negatives) are more critical than others.
3.	Cost-Sensitive Decision Making:
•	In scenarios where the focus is on minimizing the overall cost rather than maximizing general metrics like accuracy.

By using a cost matrix, we shift from generic model evaluation to cost-sensitive optimization, enabling better alignment with real-world objectives.

18
Q

Define the precision and recall metrics

A

• Alternatives to accuracy, introduced in the area of information retrieval and search engine

• Precision
–In the information retrieval context represents the percentage of actually good documents that have been shown as a result.
–Percentage of items classified as positive that are actually positive

• Recall
–Percentage of positive examples that are classified as positive
– In the information retrieval context, recall represents the percentage of good documents shown with respect to the existing ones.

19
Q

What is the F1 metrics? How can we calculate it?

A

F1 Metric: Definition

The F1 metric (or F1-score) is a measure of a model’s accuracy that considers both precision and recall. It is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between these two measures. The F1-score is especially useful in cases of imbalanced datasets, where accuracy might be misleading.
• Precision: The proportion of true positive predictions out of all positive predictions made by the model.
• Recall (Sensitivity): The proportion of true positive predictions out of all actual positive instances.

The F1-score is calculated to give equal weight to precision and recall, making it effective when both metrics are important.

Formula to Calculate F1-Score

The F1-score is computed using the formula:

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Where:
• \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
• \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}

Steps to Calculate F1-Score

1.	Determine Precision:
•	Count the true positives (TP) and false positives (FP).
•	Calculate precision using the formula:  \text{Precision} = \frac{TP}{TP + FP} .
2.	Determine Recall:
•	Count the true positives (TP) and false negatives (FN).
•	Calculate recall using the formula:  \text{Recall} = \frac{TP}{TP + FN} .
3.	Calculate the F1-Score:
•	Use the precision and recall values in the F1 formula:

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Why Use the F1-Score?

1.	Balances Precision and Recall:
•	When both false positives and false negatives are critical, the F1-score provides a balanced evaluation.
2.	Useful for Imbalanced Datasets:
•	Accuracy may appear high if the majority class dominates, but the F1-score reflects the model’s performance for minority classes by focusing on TP, FP, and FN.
3.	Handles Trade-Offs:
•	High precision and low recall (or vice versa) result in a low F1-score, emphasizing the importance of balancing the two.

Example Scenario

Consider a binary classification model:
• The model predicts 50 positives.
• Of these, 30 are true positives (TP) and 20 are false positives (FP).
• There are 70 actual positives in the dataset, so there are 40 false negatives (FN).

Step 1: Calculate Precision

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{30}{30 + 20} = 0.6

Step 2: Calculate Recall

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{30}{30 + 40} = 0.4286

Step 3: Calculate F1-Score

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.6 \cdot 0.4286}{0.6 + 0.4286} \approx 0.5

Limitations of F1-Score

•	It does not account for true negatives (TN), so it might not fully reflect the model’s overall performance, particularly when TN is important.
•	The F1-score assumes equal importance for precision and recall. If one is more critical, other metrics like the weighted F1-score or a customized cost function might be more suitable.

When to Use the F1-Score

•	When dealing with imbalanced datasets.
•	When both false positives and false negatives are significant but need to be balanced.
•	In applications like fraud detection, medical diagnosis, or spam filtering, where one type of error might dominate but balancing both errors is essential.
20
Q

What are the interpretations for the weights according to the attribute type?

A

• Numerical variables
– Increasing the numerical feature by one unit changes the estimated outcome by its weight

• Binary variables
– Changing the variable’s value modifies the outcome by the variable’s weight

• Nominal variables
– They are generally transformed using one-hot-encoding
thus the values are mapped into binary variables

• Intercept
– The interpretation of this weight makes the most sense when the values have been normalized (standardized)
– In this case, the intercept reflects the predicted outcome when all the variables are at mean value

21
Q

What is class imbalance? How could we solve it?

A

Class Imbalance

In many data sets there are a disproportionate number of instances that belong to different classes

In health-care applications, we expect to observe a smaller number of subjects who are positively diagnosed.

In credit card fraud detection, fraudulent transactions are greatly outnumbered by legitimate transactions.

Strategies for Imbalance Datasets

• A basic approach for creating balanced training sets is to generate a sample of training instances where the rare class has adequate representation.
• Two types of sampling methods to enhance the representation of the minority class: undersampling and oversampling

• Undersampling
–The frequency of the majority class is reduced to match the frequency of the minority class
– However, some of the useful negative examples may not be chosen for training, therefore, resulting in an inferior classification model.

Oversampling:

• Examples of the minority class are artificially created to make them equal in proportion to the number of negative instances (e.g., by duplicating existing examples or creating new ones)
• Duplicating a positive instance is analogous to doubling its weight during the training stage.The same effect can be achieved by assigning higher weights to positive instances than negative instances (an approach that can be used for example with logistic regression, ANN, and SVM).
• Duplicated examples have an artificially lower variance compared with their true distribution in the overall data.This can bias the classifier to the specific distribution of training instances, which may not be representative of the distribution of test instances, leading to poor generalizability.

22
Q

What is the SMOTE technique?

A

• To overcome the limitations of oversampling by duplication, we can generate synthetic positive instances in the neighborhood of existing positive instances.

• Synthetic Minority Oversampling Technique (SMOTE)
– First determine the k-nearest positive neighbors of every positive instance x
– Then generate a synthetic positive instance at some intermediate point along the line segment joining x to one of its randomly chosen k-nearest neighbor, xk.
– Repeat the process until the desired number of positive instances is reached

• SMOTE generates new positive instances in the convex hull of the existing positive class. Hence, it does not improve the representation of the positive class outside the boundary of existing positive instances

23
Q

How to compare the relative performance among competing models?

• Suppose we have two models
–Model MA with an accuracy = 82% computed using 10-fold crossvalidation –Model MB with an accuracy = 80% computed using 10-fold crossvalidation
• How much confidence can we place on accuracy of MA and MB?
• Can we say MA is better than MB?
• Can the performance difference be the result of random fluctuations in the test set?

A

How do we know that the difference in performance is not just due to chance?

We computes the odds of it! Apply the t-test and compute the p-value

The p-value represents the probability that the reported difference is due to chance

24
Q

What is the general idea when applying student-t test to two models?

A

• First decide on a confidence level, for example, 95%
– Corresponds to false discovery (false positive) rate: 𝛂 = 5%
– How frequently you are willing to declare difference when there is none

• Apply k-fold cross-validation to each model
– Obtaining k evaluations for each algorithm over same folds

• Apply Student’s t-test and compute p-value to determine whether reported difference is statistically significant
– If p-value>𝛂 then difference is not significant (can claim nothing)
– If p-value<𝛂 then difference is significant (claim one better than the other) – Note that the t-test can be paired or unpaired

25
Q

What about multiple hypothesis testing? How could we do this?

A

Bonferroni Correction

• Assume that individual tests are independent.
• Divide the desired p-value threshold by the number of tests performed.
• Example
– We now have, the threshold set to 0.05/20 = 0.0025.
–P(making a mistake) = 0.0025
–P(not making a mistake) = 0.9975
–P(not making any mistake) = 0.9975^20 = 0.9512
–P(making at least one mistake) = 1 - 0.9512 = 0.0488

Non Parametric Tests

• They do not make any assumption about the distribution of the variable in the population

• Mann-Whitney U Test
–Nonparametric equivalent of the independent t-test

• Wilcoxon matched-pairs signed rank test
–Used to compare two related groups

26
Q

What are the impacts of the classifier threshold in precision and recall?

A

We can use the threshold to optimize our precision and recall
a higher the threshold, increases precision and lower recall
a lower threshold, decreases precision and increase recall

• Suppose we use a near one threshold to classify positive examples

• Then, we will classify as positives only examples for which we are very confident (this is a pessimistic classifier)

• Precision will be high
– In fact, we are not likely to produce few false positives

• Recall will be low
– In fact, we are likely to produce many false negatives

• Suppose we use a near zero threshold to classify positive examples

• Then, we will classify everything as positives (this is an optimistic classifier)

• Precision will be low as we are going to generate the maximum number of false positives (everything is positive!)

• Recall will be high since by classifying everything as positive we are going to generate the minimum number of false negatives

27
Q

How can we determine the best classification threshold?

A

• Plot precision as a function of recall for varying threshold values
• The best classifier would be the one that has always a precision
equal to one (but never happens)
• More in general classifiers will show of different shapes
• How to decide among more classifiers?
–Use the area under the curve (the nearer to one, the better)
–Use F1 measure

Determining the Best Classification Threshold Using a Precision-Recall Curve and F1-Score

A classification threshold is a value that determines the point at which a model classifies predictions as positive or negative. Adjusting this threshold directly affects the precision and recall of the model, and consequently, the F1-score. The best threshold balances these metrics based on the application’s requirements.

Steps to Determine the Best Threshold

  1. Generate the Precision-Recall Curve• A precision-recall curve plots precision (y-axis) against recall (x-axis) for various threshold values.
    • Lower thresholds increase recall but may reduce precision, while higher thresholds increase precision but reduce recall.
  2. Calculate F1-Score for Each Threshold• For each threshold on the precision-recall curve:
    • Compute precision: \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
    • Compute recall: \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
    • Compute the F1-score:

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

  1. Select the Threshold with the Highest F1-Score• The threshold with the maximum F1-score represents the optimal trade-off between precision and recall.
    • This threshold is ideal when precision and recall are equally important.

Visualization Approach

•	Plot the precision-recall curve and annotate it with F1-scores for key thresholds.
•	Mark the threshold corresponding to the highest F1-score on the curve.
•	This helps in visually understanding how threshold adjustments impact precision, recall, and F1.

Why Use the F1-Score?

•	The F1-score balances precision and recall, making it a suitable metric for imbalanced datasets.
•	When both false positives and false negatives are important, the F1-score helps identify a threshold that minimizes both error types.

When Precision or Recall is More Critical

•	If precision is more important (e.g., minimizing false positives in spam detection), select a threshold that maximizes precision, even if it lowers recall.
•	If recall is more important (e.g., detecting all cases of disease), select a threshold that maximizes recall, even if precision is slightly reduced.

Limitations

•	The optimal threshold based on F1-score might not be suitable if precision and recall have unequal importance. In such cases, weighted metrics or a cost-sensitive approach might be more appropriate.
•	Real-world applications may require domain-specific adjustments to thresholds beyond what the F1-score alone suggests.

By calculating the F1-score for various thresholds on the precision-recall curve, you can identify the best classification threshold for balancing precision and recall effectively, ensuring the model performs optimally for its intended use case.

28
Q

What are the ROC curves? How could we use them to compare models?

A

ROC Curves: Definition

A Receiver Operating Characteristic (ROC) curve is a graphical representation of a classification model’s performance across various decision thresholds. It plots:
• True Positive Rate (TPR) (y-axis), also known as Recall or Sensitivity:

\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}

•	False Positive Rate (FPR) (x-axis):

\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}

The curve shows the trade-off between sensitivity and specificity as the classification threshold is adjusted.

Key Features of the ROC Curve

1.	Diagonal Line (Baseline):
•	Represents a random classifier. A curve close to this line indicates poor performance.
2.	Perfect Model:
•	A perfect model reaches the top-left corner of the graph, indicating  \text{TPR} = 1  and  \text{FPR} = 0 .
3.	Area Under the Curve (AUC):
•	The AUC-ROC score quantifies the overall performance of the model. It ranges from 0 to 1:
•	1.0: Perfect model.
•	0.5: Random guess.
•	< 0.5: Worse than random.

Using ROC Curves to Compare Models

  1. Compare AUC Scores• Models with a higher AUC score generally perform better across all thresholds.
    • Example: If Model A has an AUC of 0.90 and Model B has an AUC of 0.75, Model A is better at distinguishing between classes.
  2. Visual Analysis• Compare the shapes of the ROC curves:
    • A steeper curve near the top-left corner indicates better performance at higher TPRs and lower FPRs.
    • A flatter curve closer to the diagonal line suggests poor discrimination ability.
  3. Focus on Specific Regions• Depending on the application, some thresholds might matter more than others:
    • In medical diagnosis, prioritize the part of the curve where FPR is low, as false positives can be costly.
    • In fraud detection, focus on higher TPR regions to detect as many true positives as possible.
  4. Threshold Selection• Use the ROC curve to select a threshold that balances TPR and FPR according to the application’s needs.

Advantages of ROC Curves

1.	Threshold Independence:
•	ROC curves evaluate performance across all thresholds, offering a holistic view of the model.
2.	Class Imbalance Resilience:
•	Unlike accuracy, ROC curves are not affected by imbalanced datasets since they focus on TPR and FPR.
3.	Comparative Analysis:
•	Easily compare multiple models’ discrimination abilities in the same plot.

Limitations of ROC Curves

1.	Not Suitable for Imbalanced Datasets:
•	In highly imbalanced datasets, FPR might appear low due to the abundance of true negatives, making the curve misleading.
•	In such cases, a Precision-Recall (PR) curve is often preferred.
2.	Application-Specific Metrics:
•	ROC curves focus on TPR and FPR but might not reflect application-specific costs or priorities (e.g., false negatives being more critical than false positives).

Example Scenario: Comparing Models

Case:

You have three models (A, B, and C) for a binary classification task. Plot their ROC curves:
1. Model A:
• AUC = 0.95 (steep curve near the top-left corner).
• Excellent at distinguishing between positive and negative classes.
2. Model B:
• AUC = 0.80 (moderate curve).
• Performs well but less reliably than Model A.
3. Model C:
• AUC = 0.60 (curve close to the diagonal line).
• Barely better than random guessing.

Decision:

•	Select Model A for the best performance across thresholds.
•	If your use case prioritizes specific thresholds (e.g., low FPR), examine the corresponding regions of the ROC curves.

By using ROC curves and their AUC scores, you can compare models, analyze trade-offs, and select the best model for your specific application.

29
Q

What does the no free lunch theorem stipulates?

A

• If the goal is to obtain good generalization performance, there are no context-independent or usage-independent reasons to favor one classification method over another

• If one algorithm seems to outperform another in a certain situation, it is a consequence of its fit to the particular problem, not the general superiority of the algorithm

• When confronting a new problem, this theorem suggests that we should focus on the aspects that matter most
–Prior information
–Data distribution
–Amount of training data
–Cost or reward

30
Q

What is the k-nearest neighbor? How do this method relates to Instance-based method? What are the differences between them?

A

The k-Nearest Neighbor (k-NN) is a simple, yet powerful, supervised machine learning algorithm often used for classification and regression tasks. It classifies a data point based on the majority class of its nearest neighbors or predicts a value by averaging the values of its nearest neighbors.

How k-NN Works:

1.	Training Phase:
•	k-NN does not perform explicit training. It simply stores the training data.
•	This is why it is considered a lazy learning algorithm.
2.	Prediction Phase:
•	When given a new data point, the algorithm computes the distance between the point and all points in the training set.
•	It selects the k nearest neighbors (commonly using Euclidean distance, but other metrics like Manhattan or Minkowski can be used).
•	For classification, the predicted class is determined by majority voting among the neighbors.
•	For regression, the prediction is often the average (or weighted average) of the neighbors’ values.

Relation to Instance-Based Methods:

•	Instance-based methods are a family of machine learning techniques where the model “learns” by storing the training data and making predictions based on these stored instances.
•	k-NN is a classic example of an instance-based method because it relies entirely on stored training instances to make predictions, rather than building an explicit model or deriving parameters.

Differences Between k-NN and Instance-Based Methods (Broadly):

While k-NN is a specific implementation of instance-based learning, other instance-based methods may differ in the following ways:

Aspect k-NN Other Instance-Based Methods
Specificity A specific algorithm within instance-based methods. A broader category including other methods like RBF networks or case-based reasoning.
Distance Function Typically uses Euclidean or similar distance metrics. May use more complex similarity measures depending on the method.
Prediction Strategy k-NN uses majority voting (classification) or averaging (regression). Other methods may employ weighting, kernel functions, or heuristics.
Memory Usage Stores all training data, often leading to high memory requirements. Some methods may condense or preprocess the instances for efficiency.
Adaptation Predictions rely on all data points near a query. Other methods might use only specific “prototypical” instances or adapt based on context.

Key Takeaways:

•	k-NN is a specific type of instance-based learning method.
•	All k-NN methods are instance-based, but not all instance-based methods are k-NN.
•	Instance-based methods include a range of algorithms that rely on stored instances for predictions, sometimes with enhancements to address k-NN’s drawbacks, such as sensitivity to irrelevant features or large memory requirements.
31
Q

What is the impact of the number of neighbors chosen on the k-nearest method?

A

If k is too small, classification might be sensitive to noise points
If k is too large, neighborhood may include quite dissimilar examples

32
Q

What is the cost off applying k-nearest neighbor? What are the methods to improve this cost?

A

• Basic Approach
–Linear scan of the data
–Classification time for a single distance depends on the number of data points and the number of variables O(nd) for n instances of d variables
–This becomes prohibitive when the training set is large

• Nearest-neighbor search can be speeded up by using
–KD-Trees
– Ball-Trees

33
Q

Define the KD-Tree method. Discuss its effectiveness

A

Split the space hierarchically using a tree generated from the data
To find the neighbor of a specific example, navigate the tree using the example

Effectiveness of KD-trees

• Search complexity depends on depth of tree.
• It is the logarithm of number of nodes for balanced tree O(log(n))
• Occasional rebalancing of tree may be needed randomizing order of data is another option
• But amount of backtracking required depends on quality of tree
• Some nodes are square (good) while others are skinny (bad)

34
Q

In case we try a logistic regression and the k-nearest neighbor and the results of the first attempt are bad, but the ones from the second are good, what does that says about the nature of the problem?

A

It suggest that the thing I am trying to predict has local properties

35
Q

What does Naive Bayes classifier do? What is its general idea? What assumption does it make?

A

• What’s the probability of the class given an example?
• An example is represented as a tuple of attributes
• Given the target y (identifying the class value for the instance) we are looking for the class with the highest probability for x

• Naïve Bayes classifiers assume that attributes are statistically independent.Thus, evidence splits into parts that are independent

• Training
–Count the frequency of tuples (xi,y) for each attribute value xi and each class value y
–Use the counts to compute estimates for the class probability P(y) and the conditional probability P(xi|y)
• Testing
– Given an example x, computes the most likely class as

• Two assumptions
–Attributes are equally important
–Attribute are statistically independent
• Statistically independent means
–That knowing the value of one attribute xj says nothing about the value of another xi if the
class y is known, that is, P(xi|xj,y) = P(xi|y)
–Independence assumption is almost never correct! But the scheme works well in practice

36
Q

Given the example in DM3 image calculate the probabilities of the

A

Answer in image DM3

37
Q

What is the zero frequency problem in Naive Bayes classifiers? How could we solve it?

A

• What if an attribute value does not occur with every class value? (for instance, “Outlook = overcast” for class “no”)

• The corresponding probability will be zero, and posteriori probability will also be zero! (No matter how likely the other values are!)

• Typical remedy is to add 1 to count for every (attribute value, class) pair
–Process called smoothing.
–Adding 1 is called a Laplace estimator

• Resulting probabilities will never be zero! It also stabilizes probability estimates

38
Q

How does Naive Bayes deal with missing values?

A

• During training, instance is not included in frequency count for attribute value-class combination

• During testing, the attribute will be omitted from calculation

39
Q

How does Naive Bayes deals with numeric attributes?

A

• So far, we applied Naïve Bayes to categorical data.
• What if some (or all) of the attributes are numeric?
Two options. Discretize the data to make it binary or discrete
• Compute a probability density for each class
–Assume parametric form for distribution and estimate its parameters. E.g., assume attribute values for class follow Gaussian distribution
–Directly estimate probability density from the data.
E.g., use kernel smoothing to estimate density of values along axis for class

40
Q

Calculate the probability of … in image DM4

A

Answer in DM4

41
Q

What are Bayesian Belief Networks? How does it works?

A

• Bayesian Belief Networks (BBN) provide a graphical representation of probabilistic relationships among a set of random variables

• Describe the probability distribution governing a set of variables by specifying –Conditional independence assumptions that apply on subsets of the variables
–A set of conditional probabilities

• Two key elements
– A direct acyclic graph, encoding the dependence relationships among variables
–The network topology imposes conditions regarding the variable conditional independence
–A probability table associating each node to its immediate parents’ nodes

42
Q

Calculate the example in image DM5

A

Answer in DM5