Linear Modeling and Regression Flashcards

Chapter 1

1
Q
  1. In the context of predicting men’s 100m Olympic sprint times using past data, what does it mean to ‘choose a linear model,’ and how is such a model written mathematically?
A

A linear model assumes the winning time is a linear function of the input (e.g., Olympic year). Mathematically, it is written as t = w0 + w1*x, where w0 is the intercept and w1 is the slope, capturing how time changes with the year.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Why do we square the difference between the predicted time and the actual time (the squared loss) when measuring how good a linear model is?
A

We square the difference to penalize large errors more heavily and ensure the loss function is differentiable everywhere. It provides a smooth measure of how far predictions are from true values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Given data points (xn, tn) for Olympic years (xn) and winning times (tn), how do we define the average squared loss L for a proposed line t = w0 + w1*x?
A

We define L = (1/N) * Σ[tn – (w0 + w1*xn)]² over all N data points. Minimizing this L finds the best-fitting line in the least-squares sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Suppose you are given the years and winning times for multiple Olympic 100m finals. Describe how you would find the optimal values of w0 and w1 using the normal equations.
A

First, construct the design matrix X (with a column of 1’s and a column of the years), and a vector t of winning times. Then solve w = (XᵀX)⁻¹ Xᵀ t for the parameter vector w = [w0, w1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. In a back-of-the-envelope calculation to predict the 2012 Olympic winning time, we essentially drew a ‘good’ straight line through the data. What are the formal steps that mirror this intuition?
A

(1) Decide on a model (a straight line). (2) Define a loss function (squared error). (3) Fit the line by minimizing that loss (solve for w0, w1). (4) Extrapolate to the year 2012. (5) Use that line’s value at 2012 as the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. Give an example problem: Suppose you want to predict future winning times for 2024 and 2028. How would you proceed using the linear model framework from the lecture?
A

You would collect the historical (year, winning time) data. Then, fit a linear model by solving for w0 and w1 via least squares. Finally, plug x=2024 and x=2028 into t = w0 + w1*x to get the predictions for those years.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. What key assumptions are we making if we use a simple linear model (t = w0 + w1*x) to predict future Olympic 100m winning times?
A

(1) There is a relationship between year and winning time. (2) The relationship is (approximately) linear. (3) This linear trend continues into the future, even though real-world factors may change over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. Explain why a perfectly fitting high-order polynomial (like an 8th-order polynomial) might lead to poor predictions for future Olympic sprint times.
A

Such a polynomial can overfit by curving too precisely to random variations/noise in the data. Although it might produce a near-zero training error, it often generalizes poorly to new data, leading to inaccurate future predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. What is overfitting in regression, and how does it manifest when modeling data like the Olympic 100m times?
A

Overfitting occurs when a model fits random fluctuations or noise rather than the underlying trend. In the Olympic data, this can appear as a very high-order polynomial that gets very small training loss but makes unrealistic extrapolations for future years.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. Describe an example problem that demonstrates overfitting and how cross-validation can help choose a better model.
A

For instance, if you fit polynomial models of increasing degree (1st, 2nd, 8th, etc.) to the men’s 100m data, the training loss will keep decreasing, but predictions on held-out data may worsen. Using cross-validation (splitting data into training and validation sets) can reveal which degree best balances fitting the known data and predicting new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. How does cross-validation work conceptually to help decide on model complexity for linear or polynomial regression?
A

You reserve some data (validation set) for testing the model’s predictive performance. You fit the model on the remaining data (training set), then check how well it predicts the validation set. Repeating this for different model complexities (e.g., different polynomial degrees) or folds in the data helps you find the best trade-off between fitting and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. If you only have one dataset of historical winning times, how can you systematically use cross-validation to assess multiple potential models (e.g., linear, quadratic, cubic)?
A

You can perform K-fold cross-validation: split your dataset into K segments, repeatedly train on K-1 of them and validate on the remaining 1. Summarize performance across all folds. This approach uses every data point for both training and validation (in turns), helping judge which model generalizes best.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. In practice, why might you prefer a relatively simple linear model over a higher-order polynomial, even if the polynomial achieves a lower training loss?
A

Because simpler models tend to generalize better to future data, are less prone to fitting noise, and are easier to interpret. A higher-order polynomial can achieve a very low training loss yet suffer from large errors when extrapolating to new points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. Suppose your dataset has more features than just the year (e.g., track conditions). How can you extend the linear model to handle multiple features?
A

You extend the model to t = w0 + w1x1 + w2x2 + … + wD*xD, where each xi is a feature (e.g., year, track condition, temperature, etc.). This is written in matrix form as t = Xw, and you solve for w via (XᵀX)⁻¹ Xᵀ t or gradient-based methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. Describe a short example problem that uses multiple features for predicting winning times, and outline how you would solve it.
A

Example: Collect (year, average track temperature, wind speed) as features and official winning times as targets. Form the design matrix X (adding a column of 1’s for the intercept). Fit the parameters w by minimizing squared loss. Then predict new winning times by plugging in new feature values into t = Xw.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. What does the gradient descent method do, and why is it guaranteed to converge to the global optimum for a linear least-squares problem?
A

Gradient descent iteratively adjusts model parameters w in the direction opposite the gradient of the loss (the steepest ascent in error) until it reaches a minimum. For a convex loss such as squared error in linear regression, there is only one global optimum, so gradient descent will converge to it (given a suitable learning rate).

17
Q
  1. Provide an example gradient descent challenge: if your loss does not decrease smoothly and you observe oscillations in w, what practical steps might you take?
A

You might reduce the learning rate η or use an adaptive learning rate method. Checking gradient calculations for errors and employing momentum or other optimization variants can also help stabilize convergence.

18
Q
  1. How can you incorporate polynomial or other nonlinear terms into a linear model framework, and what does the design matrix look like in that case?
A

You treat each function of x (e.g., x², sin(x), x³) as an additional feature. In the design matrix X, each column corresponds to a function hᵢ(xn), allowing the model t = Σ wᵢ hᵢ(x). You still solve via the same linear algebra or gradient descent approach because it’s linear in the parameters w.

19
Q
  1. Why is it sometimes beneficial to validate or test on more recent Olympic data separately, rather than using random splits?
A

Because the main purpose is predicting future events. Using chronological splits (past vs. most recent data) checks how well your model might extrapolate to genuinely later outcomes, better simulating real-world forecasting conditions.

20
Q
  1. Summarize how you would explain the process of building, choosing, and evaluating a linear regression model to a newcomer in the context of predicting Olympic 100m winning times.
A

(1) Decide on the form of the model (linear, polynomial, etc.). (2) Gather data for features (year) and targets (winning time). (3) Define a loss function (often squared error). (4) Solve for parameters using either the normal equations or gradient descent. (5) Assess how well it predicts unseen or withheld data (validation). (6) Select the simplest model that makes accurate predictions while avoiding overfitting.