Linear regression Flashcards

1
Q

Linear regression

A

For interval and ratio scales. For ordinal dependent variables.

Notation: Yk = b0 + b1 Xk + ek

Yk: dependent variable
Xk: independent variable
b0: Intercept (value of y, if x=0)
b1: slope (change in y, if x increases by 1)
ek: error term / residual
k: data row k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dependent variable

A

A variable for which at least some of the variation is theorized to be caused by one or more independent variables.

Also termed response variable (experiments), outcome variable, criterion variable, target variable and output variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Independent variable

A

A variable that is theorized to cause variation in the dependent variable.

Also termed predictor variable (regressions), explanatory variable, treatment variable (experiments), manipulated variable (experiments) and input variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Null - hypothesis

A

H0 is a theory-based statement about what we would expect to observe there were no relationship between an independent variable and the dependent variable. It assumes that two possibilities are the same, i.e., that observed differences are due to chance alone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

p-value

A

Statistical hypothesis test. Measures the probability that we would see the relationship that we are finding because of random chance. Ranges between 0 and 1. The closer to zero, the less likely it is by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Bivariate relational hypotheses

A

Relationships between two variables. Directed and undirected relationships. Test for statistical significance via correlation or univariate regression analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Multivariate relational hypotheses

A

Relationships between more than two variables. Directed relationships. Test for statistical significance via multivariate regression analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Type I error

A

False positive. Rejecting the null hypothesis although the null hypothesis is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Type II error

A

False negative. Accepting the null hypothesis although the null hypothesis is false. Typically happens because the sample is too small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Correlation (r)

A

A statistical measure of covariation which summarizes the direction and strength of the linear relationship between two variables.

A value of 0.64 means that 64% of the variance in one variable can be explained by the variance in the other variable.

0.9 < r < 1.0 Very strong correlation -0.9 > r > -1.0
0.7 < r < 0.9 Strong correlation -0.7 > r > -0.9
0.5 < r < 0.7 Average correlation -0.5 > r > - 0.7
0.2 < r < 0.5 Weak correlation -0.2 > r > -0.5
0.0 < r < 0.2 Very weak correlation -0.0 > r > -0.2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Spearman’s correlation

A

Rank correlation, i.e., a statistical dependence between the rankings of two variables. Dichotomous and ordinal scales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pearson’s correlation

A

Linear correlation, i.e., a statistical dependence between two metric variables. Interval and ratio scales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Degrees of freedom

A

In general terms, degrees of freedom can be thought of as the number of independent pieces of information available for estimating a parameter or for calculating a statistic. It reflects the number of values in a calculation that are free to vary. It helps determine the appropriate distribution to use when making inferences about the population from a sample.

Degrees of freedom (df) play a critical role in various statistical tests, including ANOVA, the F-statistic, the t-statistic, Pearson’s r, and the chi-squared test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Scatterplots

A

Scatterplots are ideal for visualizing the relationship between two continuous variables. They help to identify whether a relationship exists (e.g., positive, negative, or no correlation). In other words, it helps detecting outliers, identifying correlations and assessing linear vs non-linear relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ordinary-Least-Squares (OLS) regression

A

The most popular type of linear regression. The OLS estimator minimizes the sum of all squared estimation errors (i.e.,
residuals) in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Assumptions for OLS

A
  1. Linear relationship between dependent- and independent variable
  2. No multicollinearity
  3. Homoskedasticity (not heteroskedasticity) in a scatterplot
  4. Normally distributed error terms
17
Q

U-shaped relationships

A

You can still use a linear regression, just include a quadratic formula.

18
Q

T-statistic

A

Tests significance against the hypothesis that the regression coefficient is equal to 0. It indicates how strongly each independent variable is associated with the dependent variable. A higher absolute value of the t-statistic suggests a stronger relationship, while a t-statistic close to zero indicates that the variable may not be a significant predictor.

19
Q

Regression coefficients (β)

A

Indicate the change in the dependent variable (Y) for a one-unit change in the independent variable (X), holding other variables constant.

Interpretation: A positive coefficient indicates a positive relationship, while a negative coefficient indicates a negative relationship.

If (β) is 2, for every one-unit increase in the independent variable, the dependent variable increases by 2 units.

20
Q

Intercept

A

The expected value of the dependent variable when all independent variables are zero.

21
Q

R² coefficient of determination

A

A goodness of fit measure that varies between 0 and 1 representing the proportion of variation in the dependent variable that is accounted for by the model, i.e. how much variance the independent variables are able to explain. Higher values indicate that a larger proportion of variance is explained by the model.

R2=0.70 means that 70% of the variance in the dependent variable is explained by the independent variables in the model.

Interpretation of R2:
Substantial = R2> 0.26
Moderate = 0.13 < R2 < 0.26
Weak = R2 < 0.13

22
Q

Standard Error (SE)

A

Standard error is a measure of the precision of a sample mean as an estimate of the population mean. It quantifies how much the sample mean is expected to vary from the true population mean due to sampling variability.

23
Q

Standard deviation (SD)

A

Measures variability within a dataset. Indicates how spread out the data points are around the mean

24
Q

Delta R2

A

Delta R2 indicates how much additional variance in the dependent variable is explained by the new predictors. A positive
value suggests that the new predictors improve the model’s explanatory power, while a negative or very small value suggests that the new predictors do not significantly improve the model.

25
Q

F-statistic

A

The F-statistic is a ratio that compares the variance explained by the regression model to the variance that is not explained (the residual variance). It assesses whether at least one of the independent variables in the model significantly predicts the dependent variable.

Test criterion, whether the estimated model is also valid for the population beyond the sample.
Significance can be read from p(F).