Supervised - regression Flashcards
What is Regression analysis
is a type of predictive modeling technique
which is used to find the relationship between a dependent variable (usually known as the “Y” variable) and either one
independent variable (the “X” variable) or a series of
independent variables
objective/ what is it for: Regression model
estimates the nature of the relationship
between the independent and dependent variables
❖ Strength of the relationship and its significance!
Explanatory variables?
independent variables
variables to be explained?
dependent variables
Regression analysis can be used for
- Predicting specific outcomes from changes, like estimating production needs for a product.
- Projecting future trends or values, such as forecasting stock prices.
- Assessing the influence of different factors on an outcome, like measuring the impact of advertising on sales during an event.
The simplest model is
linear regression
linear regression objective
Fit data with the best hyper-plane which “goes through” the points
in the following regression mode: y = b0 + b1x +e:
1. The relationship between x and y is
2. what are the Two parameters to estimate
3. what is e
- a linear or straight-line relationship
- The slope of the line β1
and the y-intercept
β0( least squares) - is the unexplained, random, or error component
The estimates are determined by
Drawing a sample from the population
of interest
❖ Calculating sample statistics
❖ Producing a straight line that cuts into
the data
The best line is
the one that minimizes the Sum of Squared
Differences (SSD) between the points and the line
The regression line’s representation of the data is evaluated through several methods
Coefficient of Determination (R-squared): This metric quantifies how much of the variance in the dependent variable is explained by the independent variable. Higher R-squared values, closer to 1, indicate better explanatory power.
Residual Analysis: By examining the differences between actual and predicted values (residuals), we assess how well the model captures the data’s variability. Smaller residuals suggest a better fit.
Visual Inspection: Plotting the regression line on a scatter plot allows for a visual assessment of its fit. A good fit is indicated by the line passing through the center of data points, capturing the overall trend.
Significance of Regression Coefficients: Evaluating the significance of coefficients, particularly the slope, determines if the relationship between variables is statistically meaningful. If coefficients are significant, the model provides valid insights into the relationship between X and Y.
what is Coefficient of Determination
measure of how well the
regression line represents the data
❖ The percentage of variability in Y that can be explained by
variability in X
❖ The further the line is away from the points, the less it can explain the
variability
❖ The Coefficient of Determination lies between 0 and 1
Outliers, Investigate possibilities, Identify outliers from, residual, standard residual
Outliers: Unusually small or large observations
Investigate possibilities: recording error, sample membership, validity
Identify outliers from scatter diagram
Suspect outlier if |standard residual| > 2
Residual: Difference between actual value and estimated value
Standard residual: Residual divided by standard deviation of residuals
One of the most challenging aspects of machine learning is
finding the
right set of features, or variables, that can accurately capture the
relationship between inputs and outputs
define feature selection
is the process of selecting a subset of relevant features
from the original set of features to improve model performance
❖ In essence, it is about identifying the most informative features that can
help the model make accurate predictions
popular techniques for feature selection
stepwise
regression
what is stepwise regression
a method that iteratively adds or removes
features from a model based on their statistical significance
when is stepwise stopped
repeated until a set of features that maximizes the model performance is
identified
advantages of stepwise
when dealing with a large
number of features, as it can help to reduce the number of features in the
model without sacrificing accuracy
limitation of stepwise
assumes that the relationship between
the features and the target variable is linear, which may not always be the
case in real-world scenarios
Types of Stepwise Regression
- Forward selection: Starts with empty feature set, adds most statistically significant feature iteratively until model performance can’t be improved further.
- Backward elimination: Begins with full feature set, removes least statistically significant feature iteratively until model performance can’t be improved further.
- Bidirectional elimination: Combines forward and backward selection, alternates between adding and removing features until no further improvements in model performance can be made.
Forward Selection steps
- Start with empty or intercept-only model.
- Conduct separate regressions for each predictor.
- Select predictor with strongest relationship to target.
- Forward Selection:
- Add selected predictor to model.
- Repeat with remaining predictors.
- Stop based on predefined rule or validation metrics.