Regression Flashcards

1
Q

what is regression analysis used for in statistics?

A

regression analysis is used to explore relationships between variables, allowing for the prediction of one variable based on another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a residual in regression analysis?

A

a residual is the difference between the observed value and the predicted value for a given data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what does the line of best fit represent in regression analysis?

A

the line of best fit represents the linear relationship between the dependent and independent variables, minimising the sum of squared residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the method of least squares?

A

the method of least squares is a technique used in regression the find the line that minimises the sum of the squared residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how can you interpret the regression equation y = b0 + b1x?

A

in the equation y = b0 + b1x, bo is the y-intercept, representing the value of y when x = 0, and b1 is the slope, indicating how much y changes for a one-unit increase in x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does a negative residual indicate?

A

a negative residual indicates that the predicted value is higher that the observed value for that data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is the significance of the sum of squared residuals in regression analysis?

A

the sum of squared residuals is minimised in the method of least squares to find the best-fitting line, ensuring the model’s predictions are as close as possible to the observed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is extrapolation, and why is it dangerous in regression analysis?

A

extrapolation involves using a regression line to predict y values for x values outside the observed data range. it is risky because the trend might change, leading to poor predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are forecasts, and what assumption is made when using a regression line to predict future values?

A

forecasts are predictions about the future using time series data. the assumption made is that the past trend will remain the same in the future, which can be risky

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is an influential outlier in regression analysis, and how does it affect results?

A

an influential outlier is a data point that significantly impacts the regression line and correlation, especially when the point is both far from the trend and has an extreme x value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the difference between an outlier and a regression outlier?

A

an outlier is a point far from others in terms of x and y values, but a regression outlier is a point that is far from the overall trend, even if not an outlier on its own x or y values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what should you do when you encounter an influential regression outlier?

A

investigate the observation to see if it was recorded incorrectly or if it is genuinely different. it may be useful to refit the regression line without the outlier to check its impact on results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

does correlation imply causation? why or why not?

A

no, correlation does not imply causation. an association between two variables may be due to a third variable, or there could be other explanations for the observed relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is a lurking variable?

A

a lurking variable is an unobserved variable that influences the association between the response and explanatory variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is simpson’s paradox?

A

simpson’s paradox occurs when the direction of an association between two variables changes after including a third variable and analysing data at separate levels of that variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how can lurking variables affect the interpretation of correlations?

A

lurking variables can create spurious associations or distort the apparent relationship between two variables, making it seem as if one causes the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is confounding in statistics?

A

confiding occurs when two explanatory variables are associated with both the response variable and each other, making it difficult to determine which variable is causing the observed effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is the difference between a lurking variable and a confounding variable?

A

a lurking variable is unmeasured and affects the relationship between the explanatory and response variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how can confounding affect the interpretation of a study’s results?

A

confounding can distort the apparent association between variables, making it seem as though one causes the other when, in fact, a third variable is influencing the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what role do statistical methods play in analysing confounding variables?

A

statistical methods can adjust for confounding variables, isolating the effect of the explanatory variable, but there’s always a risk of omitting important confounders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the response variable inn regression analysis?

A

the response variable (y) is the variable you want to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the explanatory variable in regression analysis?

A

the explanatory variable (x) is the predictor or independent variable used to predict the response variable

23
Q

what is the purpose of a scatterplot in regression analysis?

A

a scatterplot is used to visualise the relationship between two variables and to check if the relationship is linear

24
Q

what is the equation of the regression line?

A

the regression line equation is yn = a + bx, where a is the y-intercept and b is the slope

25
Q

what are residuals in regression analysis?

A

residuals are the differences between observed values and predicted values: residual = y - yn

26
Q

what does the regression model describe in terms of the mean of y?

A

the regression model describes how the mean of y changes with x, represented by µy = alpha + beta * x

27
Q

what is the significance of the straight-line assumption in regression?

A

the straight.line assumption implies that the relationship between x and y is linear, but this is an approximation and may not always hold in practice

28
Q

what is the conditional distribution in regression analysis?

A

the conditional distribution is the distribution of y values at each fixed value of x, and it describes the variability of y around the regression line

29
Q

how is the variability in y values represented in regression analysis?

A

variability is represented by the standard deviation sigma of the conditional distribution of y for each fixed x value

30
Q

what happens if the true relationship between variables is nonlinear?

A

if the relationship is nonlinear (e.g., U-shaped), using a straight-line regression model could lead to inaccurate predictions

31
Q

what does the correlation r describe in the straight-line regression model?

A

the correlation r describes the strength and direction of the linear association between two variables

32
Q

what is the range of values for the correlation r?

A

the correlation r always falls between -1 and +1, i.e., -1 ≤ r ≤ 1

33
Q

how is the strength of the linear association related to the value of r?

A

the larger the absolute value of r, the stronger the linear association. when r = 1, the data points fall exactly on the regression line

34
Q

what is the relationship between the correlation r and the slope b of the regression line?

A

the correlation r has the same sign as the slope b. a positive correlation corresponds to an upward trend, and a negative correlation corresponds to a downward trend

35
Q

how does the correlation r relate to the slope b and standard deviations sx and sy?

A

the slope b is proportional to the correlation r and the ratio of the standard deviations: b = r * (sy/sx)

36
Q

why can’t the slope alone describe the strength of an association?

A

the slope’s numerical value depends on the units of measurement, while the correlation r is a standardised version that does not depend on the units

37
Q

what is “regression toward the mean”?

A

regression toward the mean refers to the tendency for extreme values of one variable (e.g., very tall parents) to be associated with values closer to the mean for the other variable (e.g., shorter children)

38
Q

how does the correlation r affect the predicted value of y in regression?

A

the predicted value of y is relative closer to its mean than x is to its mean. if x is one standard deviation above its mean, the predicted y will be r times that distance from its mean

39
Q

what does r^2 represent in regression analysis?

A

r^2 represents the proportion of the variance in y explained by x. it is the proportional reduction in error compared to predicting y by its mean

40
Q

what is the interpretation of r^2 = 0,4?

A

if r^2 = 0,4, this means that the error using the regression equation to predict y is 40% smaller than the error when predicting y using its mean

41
Q

what is the advantage of using the correlation r over r^2?

A

the correlation r is easier to interpret because it is in the original scale of measurement, while r^2 involves squared units

42
Q

how can outliers affect the correlation?

A

outliers with unusually large or small x values that fall far from the trend can have a significant impact on the correlation and slope

43
Q

how does grouping data for observations affect the correlation?

A

grouping data, such as using summary statistics for countries rather than individual data, tends to increase the magnitude of the correlation

44
Q

what is the ecological fallacy?

A

the ecological fallacy occurs when results from grouped data (e.g., county averages) are incorrectly generalised to individual data

45
Q

how does the range of x values affect the correlation?

A

the correlation tends to be smaller when only a restricted range of x values is sampled, as it may not capture the full variability of the data

46
Q

what are the key assumptions needed for regression analysis to make inferences?

A
  1. randomness: the sample must be representative of the population
  2. normality: the residual should approximately follow a normal distribution
  3. linearity: the relationship between the variables should be linear
47
Q

what does the null hypothesis (H0 : beta = 0) in a regression significance test represent?

A

it tests whether there is no linear relationship between the variables (i.e., they are independent)

48
Q

what is the formula for the t-statistic in a regression test of significance?

A

t = ((b-0)/se)
where beta is the sample slope and se is the standard error of the slope

49
Q

what does a small p-value in a regression test suggest?

A

a small p-value indicates strong evidence against the null hypothesis, suggesting that there is a significant linear relationship between the variables

50
Q

what does a 95% confidence interval for the slope tell us?

A

it gives the range of plausible value for the true population slope. for example, a 95% confidence interval of (1.2;1.8) means we are 95% confident that the true slope falls within this range

51
Q

how is the sample correlation r related to the sample slope beta?

A

r = beta * (sx/sy)
where sx and sy are the standard deviations of x and y, respectively

52
Q

what does it mean if the slope beta = 0 in a regression analysis?

A

if beta = 0, then r = 0, indicating no linear relationship between the variables

53
Q

what is the purpose of testing the population correlation using the t-statistic for the slope?

A

the t-statistic for testing whether the slope is zero is equivalent to testing whether the population correlation is zero. a small p-value indicates a significant correlation