Supervised - regression Flashcards
What is Regression analysis
is a type of predictive modeling technique
which is used to find the relationship between a dependent variable (usually known as the “Y” variable) and either one
independent variable (the “X” variable) or a series of
independent variables
objective/ what is it for: Regression model
estimates the nature of the relationship
between the independent and dependent variables
❖ Strength of the relationship and its significance!
Explanatory variables?
independent variables
variables to be explained?
dependent variables
Regression analysis can be used for
- Predicting specific outcomes from changes, like estimating production needs for a product.
- Projecting future trends or values, such as forecasting stock prices.
- Assessing the influence of different factors on an outcome, like measuring the impact of advertising on sales during an event.
The simplest model is
linear regression
linear regression objective
Fit data with the best hyper-plane which “goes through” the points
in the following regression mode: y = b0 + b1x +e:
1. The relationship between x and y is
2. what are the Two parameters to estimate
3. what is e
- a linear or straight-line relationship
- The slope of the line β1
and the y-intercept
β0( least squares) - is the unexplained, random, or error component
The estimates are determined by
Drawing a sample from the population
of interest
❖ Calculating sample statistics
❖ Producing a straight line that cuts into
the data
The best line is
the one that minimizes the Sum of Squared
Differences (SSD) between the points and the line
The regression line’s representation of the data is evaluated through several methods
Coefficient of Determination (R-squared): This metric quantifies how much of the variance in the dependent variable is explained by the independent variable. Higher R-squared values, closer to 1, indicate better explanatory power.
Residual Analysis: By examining the differences between actual and predicted values (residuals), we assess how well the model captures the data’s variability. Smaller residuals suggest a better fit.
Visual Inspection: Plotting the regression line on a scatter plot allows for a visual assessment of its fit. A good fit is indicated by the line passing through the center of data points, capturing the overall trend.
Significance of Regression Coefficients: Evaluating the significance of coefficients, particularly the slope, determines if the relationship between variables is statistically meaningful. If coefficients are significant, the model provides valid insights into the relationship between X and Y.
what is Coefficient of Determination
measure of how well the
regression line represents the data
❖ The percentage of variability in Y that can be explained by
variability in X
❖ The further the line is away from the points, the less it can explain the
variability
❖ The Coefficient of Determination lies between 0 and 1
Outliers, Investigate possibilities, Identify outliers from, residual, standard residual
Outliers: Unusually small or large observations
Investigate possibilities: recording error, sample membership, validity
Identify outliers from scatter diagram
Suspect outlier if |standard residual| > 2
Residual: Difference between actual value and estimated value
Standard residual: Residual divided by standard deviation of residuals
One of the most challenging aspects of machine learning is
finding the
right set of features, or variables, that can accurately capture the
relationship between inputs and outputs
define feature selection
is the process of selecting a subset of relevant features
from the original set of features to improve model performance
❖ In essence, it is about identifying the most informative features that can
help the model make accurate predictions