Linear Reg Flashcards
How do we evaluate a regression model?
• Given N examples, pairs xi yi, linear regression computes a model
• So that for each point,
• We evaluate the model by computing the Residual Sum of Squares (RSS)
The goal of linear regression is thus to find the weights that minimize RSS
What are the assumptions made for linear regression?!
• Linearity
– When applying linear regression, prediction is a linear combinations of the inputs
• Normality
–The target outcome follows a normal distribution
• Homoscedasticity
–The variance of the error terms is assumed to be constant over the entire feature space
• Independence
– Each instance is independent from one another
• Absence of Multicollinearity
–There are no strongly correlated features
What is the coefficient of determination R squared in linear regression? What does it indicates?
• Total sum of squares
• Coefficient of determination
• R2 measures of how well the regression line approximates the real data points. When R2 is 1, the regression line perfectly fits the data.
• R2 increases with the number of features even if they do not convey any information about the target
• Therefore, it is usually better to use the adjusted R2
How do we evaluate a model?
• Models should be evaluated using data that have not been used to build the model itself
• Example: would be feasible to evaluate students using exactly the same problems solved in class?
• The available data must be split between training and test
–Training data will be used to build the model
–Test data will be used to evaluate the model performance
What is cross-validation?
• First step
– Data is split into k subsets of equal size
• Second step
–Each subset in turn is used for testing and the remainder for training
• This is called k-fold cross-validation and avoids overlapping test sets
• Often the subsets are stratified before cross-validation is performed
• The error estimates are averaged to yield an overall error estimate
• Standard method for evaluation stratified ten-fold cross-validation
• Why ten? Experiments have shown that this is the best choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better : repeated stratified cross-validation
• Ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) • Other approaches appear to be robust, e.g., 5x2 crossvalidation
What is over fitting?
Very good performance on the training set (model fits precisely patterns present in training data)
Terrible performance on the test set (patterns were just noise and are no longer present)
Why and how regularizations such as Ridge and Lasso work?
How can we analyze the effect of regularizations on models?
We could plot the weight values before and after the application of the regularizations
We could also analyze the effect of the regularizations as the alpha value changes plotting the weight values against the variation on alpha.
What are the strategies to evaluate the best alpha value
• To select the best value of α we cannot use the test set since it is going to be used for evaluating the final model (which uses α)
• Need to reserve part of the training data to evaluate possible candidate values of α and to select the best one
• If we have enough data, we can extract a validation set from the training data which will be used to select α
• If we don’t have enough data, we should select α by applying k-fold cross-validation over the training data choosing the α corresponding to the lowest average cost over the k folds
What are some of the metrics used to evaluate classification models
• Accuracy
–Classifier accuracy in predicting the correct the class labels
• Speed
–Time to construct the model (training time)
–Time to use the model to label unseen data
• Other Criteria
–Robustness in handling noise
– Scalability
– Interpretability
What are linear classifiers? How do they work?
Linear classifiers are algorithms used in machine learning to classify data points by separating them into different classes using a linear decision boundary. They work by finding a hyperplane (a line in 2D, a plane in 3D, or a higher-dimensional equivalent) that best divides the data points of different classes.
Key Components:
1. Linear Decision Boundary: The boundary is defined by a linear equation of the form:
f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b
where:
• \mathbf{x} is the input feature vector.
• \mathbf{w} is the weight vector that defines the orientation of the hyperplane.
• b is the bias term that shifts the hyperplane.
2. Classification Rule:
• A data point is classified based on which side of the hyperplane it lies. For binary classification:
\text{Class 1 if } f(\mathbf{x}) \geq 0, \text{ otherwise Class 2}.
How Linear Classifiers Work:
1. Training: The algorithm adjusts the weights ( \mathbf{w} ) and bias ( b ) during training using labeled data so that the hyperplane best separates the classes. • Algorithms like Perceptron, Support Vector Machine (SVM), or optimization techniques like Gradient Descent are used for this purpose. 2. Prediction: For a new input, the model calculates f(\mathbf{x}) and determines the class based on the sign or value of f(\mathbf{x}) . 3. Evaluation: The performance of the classifier is measured using metrics like accuracy, precision, recall, and others.
Common Examples of Linear Classifiers:
1. Logistic Regression: Models the probability of a binary outcome and uses a logistic function. 2. Support Vector Machines (Linear Kernel): Maximizes the margin between classes while finding the optimal hyperplane. 3. Perceptron Algorithm: A simple linear classifier that adjusts weights iteratively.
Limitations:
• Not Suitable for Non-linear Data: Linear classifiers cannot model complex relationships or datasets where classes are not linearly separable. • Sensitive to Feature Scaling: The performance depends heavily on how features are scaled.
Extensions for Non-linear Data:
• Kernel methods (e.g., in SVMs) or feature transformations (e.g., polynomial features) can help handle non-linear data while still using a linear classifier approach.
Detail the logistics regression technique
Logistic regression is a supervised learning technique used for binary classification problems, where the output variable can take one of two possible values (e.g., yes/no, 0/1, spam/not spam). Unlike linear regression, logistic regression predicts the probability that a given input belongs to a particular class, mapping the output to a range between 0 and 1 using a sigmoid function.
- Key Concepts
Model Equation
Logistic regression uses the following model:
P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b)
where:
• \mathbf{x} : Input feature vector.
• \mathbf{w} : Weight vector (coefficients).
• b : Bias (intercept).
• \sigma(z) : Sigmoid function defined as:
\sigma(z) = \frac{1}{1 + e^{-z}}
The sigmoid function maps any real-valued number to the range [0, 1].
Decision Boundary
To classify data points, logistic regression uses a threshold (e.g., 0.5):
• If P(y=1 | \mathbf{x}) \geq 0.5 , classify as class 1.
• Otherwise, classify as class 0.
The decision boundary is a linear hyperplane, defined by:
\mathbf{w}^T\mathbf{x} + b = 0
- Training Process
Log-Likelihood Function
The model is trained by maximizing the likelihood of the observed data. The likelihood for a dataset with n samples is:
L(\mathbf{w}, b) = \prod_{i=1}^n P(y_i | \mathbf{x}i)
Taking the logarithm (log-likelihood) simplifies computation:
\log L(\mathbf{w}, b) = \sum{i=1}^n \Big[ y_i \log P(y_i | \mathbf{x}_i) + (1 - y_i) \log (1 - P(y_i | \mathbf{x}_i)) \Big]
Optimization
The log-likelihood function is maximized to find the optimal weights ( \mathbf{w} ) and bias ( b ):
1. Gradient Descent or variants like Stochastic Gradient Descent (SGD) are commonly used to optimize the parameters.
2. The gradients of the log-likelihood with respect to the parameters are computed to update them iteratively:
\mathbf{w} \gets \mathbf{w} + \eta \nabla_{\mathbf{w}} \log L
where \eta is the learning rate.
- Advantages• Probabilistic Output: Predicts probabilities, which makes it interpretable and useful in risk-based decision-making.
• Efficient: Works well for linearly separable datasets and is computationally efficient.
• Feature Importance: The learned weights ( \mathbf{w} ) provide insights into feature importance. - Limitations• Linear Decision Boundary: Cannot handle non-linear relationships unless features are transformed.
• Imbalanced Data: Can perform poorly if one class dominates the dataset. Techniques like class weighting or oversampling are needed.
• Outliers: Sensitive to outliers, which can significantly affect the decision boundary. - Extensions
Multinomial Logistic Regression:
For multi-class classification, logistic regression can be extended using the softmax function, which generalizes the sigmoid function to multiple classes.
P(y = k | \mathbf{x}) = \frac{e^{\mathbf{w}k^T \mathbf{x}}}{\sum{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}
Regularized Logistic Regression:
Adding regularization terms helps prevent overfitting:
• L1 Regularization: Adds \lambda \sum |w_i| (LASSO).
• L2 Regularization: Adds \lambda \sum w_i^2 (Ridge).
- Applications• Medical Diagnosis: Predicting the presence of a disease (e.g., diabetes).
• Spam Filtering: Classifying emails as spam or not spam.
• Customer Churn Prediction: Identifying customers likely to leave a service.
• Credit Scoring: Determining the likelihood of loan default.
By mapping probabilities to binary outcomes with a linear decision boundary, logistic regression is both a simple yet powerful classification tool.
Define the one versus the rest multi class classification technique
• For each class, it creates one classifier that predicts the target class against all the others
• Given three classes A, B, C, it computes three models
– One that predicts A against B and C
– One that predicts B against A and C, and
– One that predicts C against A and B
• Then, given an example, all the three classifiers are applied and the label with the highest probability is returned
• Alternative approaches include the minimization of loss based on the multinomial loss fit across the entire probability distribution
How can we use logistic regression for multiclass classification?
Logistic regression can be extended to handle multiclass classification problems (where the output has more than two classes) using two main approaches: One-vs-Rest (OvR) and Multinomial Logistic Regression (Softmax Regression). Here’s how they work:
- One-vs-Rest (OvR) Approach
In this method, logistic regression is applied multiple times, once for each class. For a problem with K classes, the approach works as follows:
1. Binary Classifiers: Train K binary logistic regression classifiers, where each classifier distinguishes one class from the rest (e.g., “Class 1 vs. Not Class 1,” “Class 2 vs. Not Class 2,” and so on).
2. Prediction:
• For a new input, each classifier predicts a probability for its respective class.
• The class with the highest probability is assigned as the final prediction:
\hat{y} = \arg\max_{k \in {1, 2, \dots, K}} P(y=k | \mathbf{x})
Advantages of OvR:
• Simple to implement using binary logistic regression. • Efficient for problems with a small number of classes.
Limitations of OvR:
• Can be computationally expensive for large numbers of classes (since K models are trained). • May not perform as well if the classes are highly imbalanced.
- Multinomial Logistic Regression (Softmax Regression)
This is the direct extension of logistic regression for multiclass classification, where a single model predicts the probabilities for all K classes simultaneously. It uses the softmax function to ensure the output probabilities for all classes sum to 1.
Model
For a dataset with K classes, the probability of a data point \mathbf{x} belonging to class k is given by:
P(y = k | \mathbf{x}) = \frac{\exp(\mathbf{w}k^T \mathbf{x} + b_k)}{\sum{j=1}^K \exp(\mathbf{w}_j^T \mathbf{x} + b_j)}
where:
• \mathbf{w}_k and b_k are the weight vector and bias for class k .
• The denominator normalizes the probabilities.
Decision Rule
The predicted class is the one with the highest probability:
\hat{y} = \arg\max_{k \in {1, 2, \dots, K}} P(y = k | \mathbf{x})
Training
The model is trained by maximizing the log-likelihood for all classes. For n samples, the log-likelihood is:
\log L = \sum_{i=1}^n \sum_{k=1}^K \mathbf{1}(y_i = k) \log P(y_i = k | \mathbf{x}_i)
where \mathbf{1}(y_i = k) is an indicator function (1 if y_i = k , 0 otherwise).
Optimization is done using methods like Gradient Descent or Stochastic Gradient Descent.
Advantages of Softmax Regression:
• Single model handles all classes. • Provides probabilistic outputs for all classes. • Works well for balanced and separable datasets.
Limitations of Softmax Regression:
• Computationally expensive for datasets with a large number of classes. • Assumes linear separability in the feature space.
- Regularization
To prevent overfitting, regularization can be applied to both approaches:
• L1 Regularization (LASSO): Encourages sparsity in weights.
• L2 Regularization (Ridge): Penalizes large weights to improve generalization.
- Comparison of OvR and Softmax
Feature OvR Softmax Regression
Number of Models K binary models 1 multinomial model
Training Complexity Linear in K More complex (joint training for all classes)
Output Class probabilities for each binary model Probabilities for all classes in one step
Use Case Few classes, simpler datasets Balanced and larger datasets
- Applications• Image Classification: Recognizing objects (e.g., dog, cat, car) in images.
• Document Classification: Classifying documents into categories (e.g., sports, technology, politics).
• Medical Diagnosis: Predicting types of diseases or conditions.
By choosing between OvR and Softmax Regression based on the dataset and problem requirements, logistic regression becomes a versatile tool for multiclass classification tasks.
Define the confusion matrix and its attributes. What is the importance of distinguishing the different types of errors?
Confusion Matrix: Definition
A confusion matrix is a tool used to evaluate the performance of a classification model. It provides a summary of the predictions made by the model compared to the actual labels in the dataset. It breaks down the outcomes into four categories: True Positives, True Negatives, False Positives, and False Negatives, which give insight into the types of errors the model makes.
Attributes of the Confusion Matrix
1. True Positives (TP): • Instances where the model correctly predicts the positive class. • For example, the model predicts “disease present” when the disease is indeed present. 2. True Negatives (TN): • Instances where the model correctly predicts the negative class. • For example, the model predicts “no disease” when there is no disease. 3. False Positives (FP): • Instances where the model incorrectly predicts the positive class. • For example, the model predicts “disease present” when there is no disease. • This is also known as a Type I error or a “false alarm.” 4. False Negatives (FN): • Instances where the model incorrectly predicts the negative class. • For example, the model predicts “no disease” when the disease is present. • This is also known as a Type II error or a “miss.”
Importance of Distinguishing Different Types of Errors
1. Context-Specific Impact: • The severity of false positives and false negatives depends on the application. • In medical diagnosis, a false negative (missing a disease) may be life-threatening, while a false positive (incorrectly diagnosing a disease) may cause unnecessary anxiety and tests. 2. Decision-Making: • By understanding the types of errors, we can adjust the model to minimize the more critical error type. For example, in fraud detection, reducing false negatives (undetected fraud) is often more important than reducing false positives (flagging legitimate transactions as fraud). 3. Model Evaluation: • Metrics like precision, recall, and F1-score depend on these error types. For instance, precision focuses on minimizing false positives, while recall emphasizes reducing false negatives. 4. Imbalanced Datasets: • In datasets with imbalanced classes (e.g., rare diseases), accuracy alone can be misleading. Distinguishing errors helps ensure the model is evaluated based on how well it handles the minority class. 5. Real-World Implications: • Understanding and balancing the trade-offs between false positives and false negatives ensures the model’s outputs align with the desired outcomes in practical scenarios.
By analyzing the confusion matrix, we can fine-tune a model to achieve a balance that best fits the specific goals and constraints of the application.
How can we use the confusion matrix to calculate the accuracy of a model? Why is that not enough? In which situations accuracy is not effective/usefull?
Using the Confusion Matrix to Calculate Accuracy
The accuracy of a model measures the proportion of correct predictions (both true positives and true negatives) out of the total predictions. It can be calculated from the confusion matrix using the formula:
\text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Predictions (TP + TN + FP + FN)}}
In simple terms, it is the ratio of correctly classified instances (both positive and negative) to the total number of instances in the dataset.
Why Accuracy Is Not Always Enough
Although accuracy is intuitive and easy to calculate, it does not always provide a complete picture of model performance. This is because:
1. Class Imbalance:
• In datasets where one class dominates (e.g., 95% of samples belong to Class A and only 5% to Class B), a model that always predicts Class A will achieve 95% accuracy but will fail completely at identifying Class B.
• In such cases, accuracy is misleading because it does not account for the model’s ability to correctly classify the minority class.
2. No Insight into Error Types:
• Accuracy does not distinguish between false positives (Type I errors) and false negatives (Type II errors). For certain applications, one type of error might be far more critical than the other.
• Example: In cancer detection, missing a cancer case (false negative) is more serious than falsely diagnosing cancer (false positive).
3. Lack of Granularity:
• Accuracy is a single metric and does not provide insights into specific aspects of the model’s performance, such as precision, recall, or the trade-offs between them.
4. Overfitting and Bias:
• High accuracy might indicate overfitting to the training data or bias in the dataset, where the model memorizes patterns instead of generalizing well.
Situations Where Accuracy Is Not Effective
1. Imbalanced Datasets: • Example: In fraud detection, where only 1% of transactions are fraudulent, a model predicting all transactions as “non-fraudulent” will have 99% accuracy but will fail to detect any fraud cases. 2. High Cost of Specific Errors: • Example: In medical diagnosis, missing a disease (false negative) might have serious consequences, even if the model achieves high accuracy overall. 3. Multi-Class Problems: • In multi-class classification, accuracy alone does not reveal which classes are being misclassified and whether certain classes are disproportionately affected. 4. Anomalies and Rare Events: • Example: In cybersecurity, detecting rare attacks is crucial, and a high accuracy model might fail to identify these rare cases effectively.
Better Alternatives to Accuracy
When accuracy is not effective, other metrics derived from the confusion matrix are more informative:
1. Precision:
• Focuses on the reliability of positive predictions ( \frac{TP}{TP + FP} ).
• Useful when false positives are costly (e.g., spam filtering).
2. Recall (Sensitivity):
• Measures the ability to identify all actual positives ( \frac{TP}{TP + FN} ).
• Useful when false negatives are costly (e.g., medical diagnosis).
3. F1-Score:
• Combines precision and recall into a single metric ( 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} ).
• Useful for imbalanced datasets.
4. Specificity:
• Measures the ability to identify actual negatives ( \frac{TN}{TN + FP} ).
• Important when false positives need to be minimized.
5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve):
• Evaluates the trade-off between true positive and false positive rates across different thresholds.
By considering these metrics alongside accuracy, we gain a more comprehensive understanding of model performance, especially in critical or imbalanced scenarios.