Statistical Learning Flashcards

Question

# Inference problem Modeling the brand of a product that a customer might purchase based on variables such as price, store location, discount levels, competition price, and so forth. In this situation one might really be most interested in how each of the individual variables affects the probability of purchase. | Why is this an inference problem?

Answer 1

Examining relationship between predictors and response.

Answer 2

Predicting the value of a home given its characteristics.

Answer 3

Examining how individual input variables affect the prices.

Answer 4

They assume a specific functional form for f, estimate a finite set of parameters (e.g. using linear regression).

Answer 5

They make fewer assumptions about f, can fit a wide range of shapes, but generally require more data to avoid overfitting.

Answer 6

* Simplified problem: much easier to estimate a set of parameters than fit an entirely arbitrary function f. * Less prone to overfitting. * Less observations necessary given the smaller number of parameters.

Answer 7

* The model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor. * More rigid. Parametric methods follow a fixed functional form whereas non-parametric methods do not assume a form for f, allowing for more flexible models.

Answer 8

Modeling random noise in the training data so closely that performance on new data suffers.

Answer 9

Highly flexible methods can fit data more accurately but are often less interpretable; simpler methods are more interpretable but may fit less accurately.

Answer 10

* Restrictive models (e.g., linear regression) tend to be more interpretable; valuable when inference about how predictors affect the response is important. * In cases where a more flexible model yields overfitting.

Answer 11

Flexible models often capture complex relationships more accurately, which can improve prediction performance but reduce interpretability.

Answer 12

A setting in which each observation has both predictors (X) and a known response (Y), allowing us to train models for prediction or inference.

Answer 13

A setting in which observations have predictors (X) but no associated response (Y), so we can only find structure or groupings in the data.

Answer 14

An unsupervised technique that groups observations into clusters based on similarities among their measured variables.

Answer 15

A scenario where some observations have a response while others do not, mixing both supervised and unsupervised elements.

Answer 16

Can be measured on a numeric scale.

Answer 17

Qualitative (categorical) variables fall into distinct classes.

Answer 18

* Regression for quantitative response * Classification for qualitative response

Answer 19

It states that no single learning method is guaranteed to dominate all others across every possible data set; method performance depends on the specific problem.

Answer 20

Mean Squared Error (MSE)

Answer 21

MSE = (1/n) Σᵢ (yᵢ − f̂(xᵢ))²

Answer 22

Training MSE measures how well the model fits the data it was trained on; test MSE measures how well the model predicts new, unseen data.

Answer 23

Because a model might overfit the training data, capturing noise rather than the true underlying relationship, resulting in poor performance on unseen data.

Answer 24

A model fits the training data so closely that it fails to generalize to new data, typically showing low training error but high test error.

Answer 25

A way to quantify model complexity or flexibility; more degrees of freedom usually means a more complex model that can fit data more closely.

Answer 26

Minimizes the test MSE.

Answer 27

Exhibits a small training MSE but a large test MSE.

Answer 28

E[(Y₀ − f̂(X₀))²] = Var(f̂(X₀)) + [Bias(f̂(X₀))]² + Var(ε)

Answer 29

* We aim for low variance and low bias to minimize overall error. * The expected test MSE can never lie below Var(ε), the irreducible error.

Answer 30

Error introduced by approximating a potentially complex real‐world process with an overly simple model, causing systematic under‐ or over‐estimation.

Answer 31

Sensitivity of the model to the particular training set used. A high‐variance model changes drastically when the training data change.

Answer 32

Variance will increase and bias will decrease.

Answer 33

The relative rate of change of variance and bias. If bias decreases faster than variance increases, test MSE declines. If bias declines slowely or is unchanged and variance increases significantly, test MSE increases.

Answer 34

1/n Σᵢ I(yᵢ ≠ ŷᵢ), , i.e. the fraction of observations for which the predicted class differs from the true class.

Answer 35

Training error rate measures how well the model classifies the data it was trained on; test error rate measures how well the model classifies new, unseen data.

Answer 36

A theoretical classifier that assigns each observation to the most likely class, given its predictor values. It achieves the lowest possible test error rate on average but is generally unachievable in practice because the true distribution of Y given X is unknown.

Answer 37

Pr(Y = j | X = X₀)

Answer 38

Assign a test observation with predictor vector X₀ to the class j for which the probability is the largest.

Answer 39

The boundary in feature space where Pr = 0.5 (in a two-class example). Points on one side are assigned to one class, points on the other side to the other class.

Answer 40

Bayes Decision Boundary

Answer 41

1 − E(max_j Pr(Y = j | X))

Answer 42

It is the lowest achievable test error rate. It is analogous to the irreducible error in regression settings.

Answer 43

Classifies a test observation by first finding the K closest training points and then assigning the majority class among those neighbors.

Answer 44

Pr(Y = j | X = x0) = (1/K) * ∑ over i in N0 [ I(yi = j ) ] * K = number of nearest neighbors * N0 = indices of those K neighbors closest to x0 * I(condition) = 1 if condition is true, else 0

Answer 45

A small K (e.g. K=1) produces a highly flexible model that can overfit (low bias, high variance).

Answer 46

A large K (e.g. K=100) yields a smoother, more biased but lower-variance model.

Answer 47

As 1/K increases (K decreases), the training error rate decreases.

Answer 48

As 1/K increases (K decreases), the test error rate first declines as flexibility increases before increasing as the method becomes excessively flexible and overfits.

Answer 49

Because with small K, the method can memorize training data noise (overfitting). This leads to low training error but potentially high test error.

Answer 50

KNN can approximate the Bayes rule if K is chosen well and we have enough data. However, the Bayes classifier remains a gold standard that typically requires full knowledge of the true distribution.

Answer 51

Both in regression and classification, we want a balance between bias and variance. Too much flexibility can overfit; too little can underfit. Optimal flexibility minimizes test error, typically in a U-shaped pattern.

Answer 52

Lists the names of objects currently stored in the R environment.

Answer 53

Removes all objects in the environment, effectively clearing your workspace.

Answer 54

Creates a matrix from a given set of values, arranged in rows and columns.

Answer 55

Generates random values from a normal (Gaussian) distribution.

Answer 56

``` pdf("PlotName.pdf") plot(...) dev.off() ```

Answer 57

``` jpeg("PlotName.jpg") plot(...) dev.off() ```

Answer 58

Creates a contour plot to visualize 3D data; draws contour lines of a 3D surface on a 2D plot.

Answer 59

Creates a heatmap to visualize 3D data; displays a 2D image of a matrix of values using different colors to represent magnitude.

Answer 60

Creates a 3D perspective plot of a surface for 3D data.

Answer 61

Selects all elements except the ones at the specified indices. For instance, `x[-1]` means 'all but the first element.'

Answer 62

Returns the dimensions of an object (e.g., rows and columns for a matrix or data frame).

Answer 63

Removes rows or observations with missing values (NA) from an object like a data frame.

Answer 64

Makes the variables of a data set or environment accessible by name without using the $ operator or indexing.

Answer 65

Creates a matrix of scatterplots for every pair of variables in a data frame or matrix.

Answer 66

Lets you click on plotted points to label or retrieve their coordinates (or row indices). Helpful for interactive identification.

Answer 67

Better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained

Answer 68

Worse - a flexible method would overfit the small number of observations

Answer 69

Worse - flexible methods fit to the noise in the error terms and increase variance

Answer 70

The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.

Answer 71

`sapply(auto[, 1:7], range)`

Answer 72

Most cities have low crime rates, but there is a long tail: 15-20 suburbs appear to have a crime rate > 20, reaching to above 80.

Answer 73

There is a large divide between suburbs with low tax rates and a peak at 660-680.

Answer 74

A skew towards high ratios, but no particularly high ratios.

Answer 75

1. Subset the data to find the observation where `medv` equals the minimum `medv`: `boston[medv == min(boston$medv)]` 2. Compare this to the distribution of each variable, using `summary`: `summary(boston)` 3. Look at where each variable value falls within the variable's distribution.

Answer 76

1. Compare the distribution in variables for the subset of data where `rm` > 8: `summary(boston[rm > 8])` 2. Compare the distribution in variables for the overall data set: `summary(boston)` 3. Compare the distribution/range for (1) to (2) for each variable.

Statistical Learning Flashcards

(106 cards)