Regression Flashcards
What is a linear model *
a linear model is a mathematical representation used to describe the relationship between one or more predictor variables and a response variable.
What are the steps to model fitting
What are the steps to model fitting
What is model fitting *
model fitting refers to the process of adjusting the parameters of a statistical model to best match the observed data. The goal is to find the model parameters that minimize the discrepancy between the model predictions and the actual data
What is the least squares approach
What is the equation for a line on a graph
What is the slope coefficient and what is it used for *
The slope coefficient determines the magnitude of change in the response variable for a given change in the predictor variable.
The slope coefficient quantifies the strength and direction of the relationship between the predictor variable(s) and the response variable. A positive coefficient indicates a positive relationship (as the predictor variable increases, the response variable tends to increase), while a negative coefficient indicates a negative relationship.
What is the intercept coefficient and what is it used for*
Model Interpretation: The intercept provides the baseline value of the response variable when all predictor variables are absent or have a value of zero. It represents the starting point of the regression line or plane.
Centering Data: In some cases, centering the predictor variables around their mean can be beneficial for interpretation, and the intercept then represents the estimated response value when the predictor variables are at their mean values.
Extrapolation: The intercept can be used for extrapolation beyond the observed range of predictor variables. However, caution should be exercised in extrapolating beyond the range of observed data, as the model’s validity may not hold outside this range.
Model Specification: The intercept helps define the structure of the regression model. Including an intercept term allows the regression line or plane to deviate from passing through the origin, accommodating situations where the response variable does not start at zero when all predictor variables are zero.
How is epsilon calculated (errors)
What does epsilon tell you
What is the coefficient of determination and what is it used for
The coefficient of determination, often denoted as R squared , is a statistical measure that represents the proportion of the variance in the dependent variable (response variable) that is explained by the independent variables (predictor variables) in a regression model. In other words, R squared quantifies the goodness of fit of the regression model to the observed data.
What is the equation for adjusted R squared
How do you test the significance of regression *
To test the overall significance of the regression model, you typically perform an analysis of variance (ANOVA) test. The null hypothesis (H0) for this test is that all regression coefficients are equal to zero, meaning that the model does not explain any variability in the response variable. The alternative hypothesis (Ha) is that at least one regression coefficient is not equal to zero, indicating that the model explains some variability in the response variable.
Calculate the F-statistic: The ANOVA test calculates an F-statistic, which compares the variability explained by the regression model to the variability not explained by the model.
Determine the Critical Value: Determine the critical value of the F-statistic based on the chosen significance level (α) and the degrees of freedom associated with the regression model.
Make a Decision: If the calculated F-statistic is greater than the critical value, you reject the null hypothesis and conclude that the regression model is significant. Otherwise, you fail to reject the null hypothesis.
Testing the Significance of Individual Coefficients:
To test the significance of individual coefficients (slope coefficients), you typically use t-tests. The null hypothesis (H0) for each coefficient test is that the corresponding coefficient is equal to zero, indicating that the predictor variable does not have a significant effect on the response variable. The alternative hypothesis (Ha) is that the coefficient is not equal to zero, indicating a significant effect.
Calculate the t-statistic: For each coefficient, calculate a t-statistic using the estimated coefficient value, its standard error, and the degrees of freedom.
Determine the Critical Value: Determine the critical value of the t-statistic based on the chosen significance level (α) and the degrees of freedom associated with the regression model.
Make a Decision: If the absolute value of the calculated t-statistic is greater than the critical value, you reject the null hypothesis and conclude that the coefficient is significant. Otherwise, you fail to reject the null hypothesis.
What are the assumptions of model validation
What are diagnostic plots *
Diagnostic plots, also known as residual plots, are graphical tools used in regression analysis to assess the assumptions and validity of the regression model. These plots allow you to examine the residuals (the differences between the observed values and the values predicted by the model) to identify any patterns or deviations that may indicate problems with the model. Here are some common diagnostic plots:
Residuals vs. Fitted Values Plot:
This plot shows the residuals (vertical axis) plotted against the fitted values (predicted values from the regression model, horizontal axis). It helps you assess whether the residuals have a consistent spread across the range of fitted values, which is an assumption of homoscedasticity. Ideally, the points should be randomly scattered around zero, without any clear patterns or trends.
Normal Q-Q (Quantile-Quantile) Plot:
This plot compares the quantiles of the residuals (vertical axis) to the quantiles of a theoretical normal distribution (horizontal axis). It helps you assess whether the residuals are normally distributed, which is an assumption of normality. In a normal Q-Q plot, the points should fall approximately along a straight line, indicating that the residuals are normally distributed.
Residuals vs. Predictor Variables Plot:
Separate scatterplots of residuals against each predictor variable are created. These plots help you assess whether there are any systematic patterns or relationships between the residuals and the predictor variables. Deviations from randomness in these plots may indicate potential issues such as nonlinearity or omitted variable bias.
Leverage-Residuals Plot:
This plot shows leverage values (a measure of how much influence each observation has on the regression model) on the horizontal axis and standardized residuals on the vertical axis. It helps you identify influential observations that may have a disproportionate impact on the regression coefficients. Points with high leverage and high residuals may indicate outliers or influential observations.
Cook’s Distance Plot:
Cook’s distance is a measure of the influence of each observation on the regression coefficients. This plot shows Cook’s distances on the vertical axis and observation indices on the horizontal axis. It helps you identify influential observations that may significantly affect the regression coefficients. Points with high Cook’s distances may warrant further investigation.
Scale-Location Plot (Squared Residuals vs. Fitted Values Plot):
This plot is similar to the residuals vs. fitted values plot but uses the square root of the absolute residuals on the vertical axis. It helps you assess whether the spread of residuals is consistent across the range of fitted values, similar to the residuals vs. fitted values plot.